How to Find All Current and Archived URLs on a web site

There are many good reasons you may want to discover all of the URLs on a website, but your specific intention will identify what you’re searching for. For example, you may want to:

Identify every indexed URL to investigate challenges like cannibalization or index bloat
Obtain present-day and historic URLs Google has found, specifically for internet site migrations
Uncover all 404 URLs to Recuperate from publish-migration errors
In each state of affairs, only one Resource gained’t Provide you with every thing you need. Unfortunately, Google Look for Console isn’t exhaustive, and also a “internet site:instance.com” search is proscribed and tough to extract info from.

Within this write-up, I’ll stroll you through some tools to develop your URL checklist and prior to deduplicating the info utilizing a spreadsheet or Jupyter Notebook, based on your internet site’s dimensions.

Previous sitemaps and crawl exports
If you’re on the lookout for URLs that disappeared through the Reside site not long ago, there’s a chance someone on your own group may have saved a sitemap file or maybe a crawl export before the variations have been made. When you haven’t presently, look for these information; they can generally offer what you would like. But, should you’re reading this, you most likely didn't get so Fortunate.

Archive.org
Archive.org
Archive.org is a useful Instrument for SEO jobs, funded by donations. If you seek out a domain and select the “URLs” possibility, it is possible to obtain around ten,000 outlined URLs.

On the other hand, There are some constraints:

URL Restrict: You'll be able to only retrieve nearly web designer kuala lumpur 10,000 URLs, and that is insufficient for more substantial web pages.
Top quality: Quite a few URLs may be malformed or reference source data files (e.g., pictures or scripts).
No export selection: There isn’t a crafted-in approach to export the listing.
To bypass the lack of an export button, make use of a browser scraping plugin like Dataminer.io. Having said that, these limitations indicate Archive.org may well not provide a complete Resolution for larger sized sites. Also, Archive.org doesn’t reveal irrespective of whether Google indexed a URL—but if Archive.org identified it, there’s a superb prospect Google did, too.

Moz Pro
Even though you would possibly typically use a website link index to search out exterior web-sites linking for you, these tools also learn URLs on your internet site in the procedure.


How you can use it:
Export your inbound back links in Moz Professional to acquire a quick and easy list of target URLs out of your web site. When you’re addressing a large Web site, consider using the Moz API to export information over and above what’s manageable in Excel or Google Sheets.

It’s vital that you Notice that Moz Pro doesn’t validate if URLs are indexed or identified by Google. Even so, due to the fact most web-sites utilize the same robots.txt principles to Moz’s bots since they do to Google’s, this technique usually performs perfectly as being a proxy for Googlebot’s discoverability.

Google Lookup Console
Google Research Console features a number of worthwhile resources for making your listing of URLs.

Links reviews:


Much like Moz Professional, the One-way links section gives exportable lists of goal URLs. Sad to say, these exports are capped at one,000 URLs Just about every. You could apply filters for particular webpages, but given that filters don’t use on the export, you would possibly should trust in browser scraping applications—restricted to five hundred filtered URLs at a time. Not perfect.

Efficiency → Search Results:


This export gives you a list of internet pages obtaining research impressions. Although the export is restricted, You can utilize Google Search Console API for larger datasets. Additionally, there are free Google Sheets plugins that simplify pulling more considerable facts.

Indexing → Web pages report:


This segment provides exports filtered by difficulty form, however they're also limited in scope.

Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is a superb source for accumulating URLs, using a generous Restrict of one hundred,000 URLs.


A lot better, you may use filters to create unique URL lists, efficiently surpassing the 100k Restrict. As an example, if you wish to export only web site URLs, stick to these steps:

Action one: Incorporate a phase for the report

Move 2: Click on “Create a new section.”


Stage three: Determine the segment that has a narrower URL sample, for example URLs made up of /weblog/


Observe: URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide valuable insights.

Server log information
Server or CDN log data files are perhaps the last word Resource at your disposal. These logs seize an exhaustive list of every URL path queried by buyers, Googlebot, or other bots through the recorded period of time.

Criteria:

Info measurement: Log data files may be substantial, lots of websites only retain the last two weeks of information.
Complexity: Analyzing log documents is often complicated, but numerous resources can be found to simplify the method.
Blend, and fantastic luck
Once you’ve gathered URLs from each one of these sources, it’s time to combine them. If your site is small enough, use Excel or, for bigger datasets, resources like Google Sheets or Jupyter Notebook. Make certain all URLs are persistently formatted, then deduplicate the listing.

And voilà—you now have a comprehensive list of present, old, and archived URLs. Great luck!

Leave a Reply

Your email address will not be published. Required fields are marked *