HOW TO FIND ALL EXISTING AND ARCHIVED URLS ON A WEB SITE

How to Find All Existing and Archived URLs on a web site

How to Find All Existing and Archived URLs on a web site

Blog Article

There are numerous good reasons you might require to find all the URLs on a web site, but your actual objective will determine Anything you’re seeking. As an example, you might want to:

Identify each individual indexed URL to analyze troubles like cannibalization or index bloat
Collect existing and historic URLs Google has viewed, specifically for website migrations
Uncover all 404 URLs to recover from post-migration errors
In Each and every situation, an individual Device gained’t Present you with anything you require. Sad to say, Google Research Console isn’t exhaustive, plus a “web site:example.com” look for is limited and tough to extract info from.

Within this article, I’ll walk you thru some tools to develop your URL list and in advance of deduplicating the data utilizing a spreadsheet or Jupyter Notebook, based upon your website’s measurement.

Aged sitemaps and crawl exports
Should you’re on the lookout for URLs that disappeared in the Dwell web-site recently, there’s a chance somebody with your staff could possibly have saved a sitemap file or simply a crawl export ahead of the changes were made. For those who haven’t now, look for these information; they're able to frequently provide what you would like. But, when you’re looking through this, you probably didn't get so lucky.

Archive.org
Archive.org
Archive.org is a useful Instrument for Search engine marketing responsibilities, funded by donations. For those who seek out a site and select the “URLs” alternative, you could access approximately ten,000 listed URLs.

Even so, Here are a few limitations:

URL limit: You'll be able to only retrieve as much as web designer kuala lumpur ten,000 URLs, and that is inadequate for much larger web pages.
High-quality: Quite a few URLs can be malformed or reference source data files (e.g., photographs or scripts).
No export choice: There isn’t a designed-in method to export the listing.
To bypass The dearth of the export button, utilize a browser scraping plugin like Dataminer.io. Nevertheless, these restrictions suggest Archive.org might not supply a complete Option for much larger web-sites. Also, Archive.org doesn’t point out whether Google indexed a URL—but when Archive.org found it, there’s a fantastic opportunity Google did, far too.

Moz Professional
Though you could commonly make use of a backlink index to search out external web-sites linking for you, these resources also discover URLs on your web site in the process.


The way to utilize it:
Export your inbound inbound links in Moz Pro to obtain a brief and easy list of goal URLs from the site. Should you’re working with a huge Web-site, think about using the Moz API to export facts further than what’s manageable in Excel or Google Sheets.

It’s important to Take note that Moz Pro doesn’t affirm if URLs are indexed or uncovered by Google. Nevertheless, because most web-sites apply the exact same robots.txt rules to Moz’s bots as they do to Google’s, this technique usually operates properly being a proxy for Googlebot’s discoverability.

Google Lookup Console
Google Look for Console presents quite a few important sources for setting up your list of URLs.

One-way links studies:


Comparable to Moz Professional, the Back links portion gives exportable lists of focus on URLs. Unfortunately, these exports are capped at one,000 URLs Every. You may utilize filters for unique internet pages, but since filters don’t utilize to your export, you may perhaps should count on browser scraping tools—restricted to five hundred filtered URLs at any given time. Not perfect.

Performance → Search Results:


This export provides you with a summary of webpages getting lookup impressions. Though the export is restricted, You may use Google Research Console API for more substantial datasets. There are also no cost Google Sheets plugins that simplify pulling additional comprehensive data.

Indexing → Webpages report:


This part provides exports filtered by situation form, though these are typically also confined in scope.

Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is a superb source for accumulating URLs, that has a generous limit of a hundred,000 URLs.


Better still, you can use filters to create diverse URL lists, successfully surpassing the 100k limit. One example is, if you wish to export only site URLs, observe these ways:

Move 1: Include a segment into the report

Move 2: Simply click “Produce a new phase.”


Phase three: Define the phase using a narrower URL sample, including URLs made up of /weblog/


Notice: URLs located in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they supply beneficial insights.

Server log files
Server or CDN log files are Possibly the last word Instrument at your disposal. These logs capture an exhaustive record of every URL route queried by buyers, Googlebot, or other bots during the recorded period of time.

Issues:

Information sizing: Log information can be large, a lot of web-sites only keep the final two weeks of information.
Complexity: Analyzing log information is often challenging, but different equipment can be found to simplify the process.
Combine, and great luck
After you’ve collected URLs from all of these sources, it’s time to combine them. If your web site is small enough, use Excel or, for larger sized datasets, equipment like Google Sheets or Jupyter Notebook. Ensure all URLs are persistently formatted, then deduplicate the record.

And voilà—you now have an extensive listing of present-day, aged, and archived URLs. Good luck!

Report this page