Google Issues Reminder on Robots.txt Usage to Block Action URLs

google

Gary Illyes of Google has reiterated the importance of utilizing robots.txt to prevent crawlers from accessing URLs that execute actions like adding items to carts or wishlists, thereby conserving server resources.

In a recent LinkedIn post, Illyes emphasized longstanding advice directed at website owners: Employ the robots.txt file to shield action-oriented URLs from web crawlers.

Illyes pointed out the common issue of excessive crawler traffic burdening servers, typically caused by bots indexing URLs designed for user-specific actions.

He stated:

«Upon reviewing the crawling behavior reported by sites, it frequently involves URLs that execute actions such as ‘add to cart’ or ‘add to wishlist.’ These are irrelevant to crawlers and are likely unwanted.»

To mitigate server strain, Illyes recommended blocking URLs containing parameters like «?add_to_cart» or «?add_to_wishlist» in the robots.txt file.

As an illustrative measure, he suggested:

«If your site includes URLs such as:
https://example.com/product/scented-candle-v1?add_to_cart
and
https://example.com/product/scented-candle-v1?add_to_wishlist

It’s advisable to implement a disallow directive for them in your robots.txt file.»

While deploying the HTTP POST method can also deter crawlers from indexing such URLs, Illyes cautioned that crawlers may still initiate POST requests, underscoring the continued relevance of robots.txt.

Reaffirming Traditional Best Practices Alan Perkins, participating in the discussion, underscored the historical basis of this guidance, dating back to web standards introduced in the 1990s for analogous reasons.

Quoting from a document titled «A Standard for Robot Exclusion» from 1993:

«In 1993 and 1994, there were instances where robots accessed WWW servers against their host’s wishes…robots traversed unsuitable parts of WWW servers, such as excessively deep virtual trees, redundant information, transient data, or cgi-scripts with unintended effects (like voting).»

The robots.txt standard, proposing rules to restrict well-mannered crawler access, emerged as a consensus among web stakeholders in 1994.

Adherence and Exceptions Illyes affirmed Google’s commitment to honoring robots.txt directives, noting rare documented exceptions involving «user-initiated or contractually obligated fetches.»

This adherence to robots.txt has long been a cornerstone of Google’s web crawling policies.