How Crawlers Use Robots.txt and Sitemaps

When search engine or AI crawlers land on your website, their first stop is usually the robots.txt file. This small but important file acts as a set of instructions — telling crawlers what they can and cannot access, and where they should go next.

Step 1: Crawlers Visit Robots.txt

The crawler requests yourdomain.com/robots.txt. This file contains directives such as:

  • Allow / Disallow → Controls which URLs the crawler is permitted to scan.

  • Crawl-delay → Slows down the crawler to prevent it from overloading your server.

  • Sitemap → Points the crawler to your XML sitemap, which lists all important pages.

Example from Howarth Timber’s robots.txt:

User-agent: *
Sitemap: https://www.website.co.uk/xmlsitemap.php
Disallow: /account.php
Disallow: /cart.php
Disallow: /login.php*
Disallow: /admin/
Crawl-delay: 10

Here:

  • The Sitemap: directive is right at the top, ensuring crawlers can quickly find the XML sitemap.

  • The Disallow rules stop crawlers from wasting time on non-public pages (like cart, login, and checkout).

  • The Crawl-delay: 10 ensures bots don’t overload the server with too many requests.

Step 2: Robots.txt Directs Crawlers to the Sitemap

Once the crawler sees the Sitemap: line, it follows the link to the sitemap index file.

This sitemap index points to separate sitemaps for pages, products, categories, and news. Breaking content into multiple sitemaps keeps things organised and scalable.

Step 3: Crawlers Discover and Index Pages

From here, the crawler systematically works through each sitemap file, finding the key URLs you want indexed. Since unnecessary admin and checkout pages are already excluded in robots.txt, crawlers focus only on valuable content.

Why This Matters

  • Efficiency: Bots waste less time on irrelevant pages.

  • Compatibility: Standardised sitemap references make it easier for all crawlers, including AI scrapers.

  • Scalability: A sitemap index allows easy expansion as the site grows.