The Robots.txt file is a handy file that provides recommendations to search engine crawlers about how to crawl through your website. It is part of robust technical SEO practices.
While this carries considerable weight, one has to understand that there are limitations. According to Google itself, this is not a surefire way to keep web pages out of its index; this helps to avoid too many crawler requests, which can overload your site or server.
If you have crawl controls in place for your website, then it is crucial that you implement them properly. This is even more critical when one uses dynamic URLs, or other methods, which can create what is potentially an unlimited number of pages.
This document covers some common robots.txt file errors, the impact on your website and visibility to search, and options to fix these issues should they be identified.
What Is Robots.txt?
The Robots.txt uses a simple plain text file format and should be located in the root directory of your website. It must be uploaded to the root directory of your site. If placed in a subdirectory, search engines will disregard it. Despite its significant capabilities, a robots.txt file is typically a simple document that can be created quickly using basic editors like Notepad. You have the freedom to enhance it with additional messaging to engage users.
Alternative methods can accomplish some of the objectives typically addressed by robots.txt. Pages can embed a robots meta tag directly within their code, while the X-Robots-Tag HTTP header offers another avenue to influence the display of content in search results, including whether it appears at all.
What Functions Can Robots.txt Serve?
Robots.txt offers a range of capabilities across various content types:
Webpages can be disallowed from crawling: While they may be listed in search results, they will not have a text description. The non-HTML contents that are on the page will not be crawled either.
Media files can be disallowed from appearing in Google searches: This includes pictures, videos, and audio files. While the file is still publicly accessible online and may be viewed and linked to, it will not appear in Google searches.
Resource files-like less important third-party scripts can be blocked. In such cases: If Google crawl a page dependent on that resource to load, the Googlebot may see that page as if the resource was never present, and this may affect how it is indexed.
Robots.txt: Introducing a New Meta Tag for LLM and AI Products
Following is a point to note: this is all that can be done via the robots.txt method of robots to prevent a webpage from coming into Google search results; more specifically, for that one needs to add a noindex meta tag in the header of that page.
What are the Risks of Robots.txt Errors?
Mistakes in robots.txt can lead to unintended outcomes, but they’re typically not catastrophic. The positive aspect is that by rectifying errors in your robots.txt file, you can quickly recover from any issues, often completely.
Google’s advice to web developers addresses the topic of robots.txt errors with the following guidance:
“Web crawlers are generally very flexible and typically will not be swayed by minor mistakes in the robots.txt file. In general, the worst that can happen is that incorrect [or] unsupported directives will be ignored.
Bear in mind though that Google can’t read minds when interpreting a robots.txt file; we have to interpret the robots.txt file we fetched. That said, if you are aware of problems in your robots.txt file, they’re usually easy to fix.”
If your website shows unusual behavior in search results, examining your robots.txt file is advisable to identify any errors, syntax issues, or overly broad rules.
Let’s explore deeper into each of these potential mistakes and explore how to guarantee the validity of your robots.txt file.
Robots.txt Located Incorrectly
Search robots can only locate the file if it is located in your root directory. Therefore, there should be only a forward slash between the .com (or equivalent domain) of your website and the ‘robots.txt’ filename in the URL of your robots.txt file.
If there’s a subfolder present, your robots.txt file likely isn’t visible to the search robots, and your website might behave as if there’s no robots.txt file at all. To resolve this issue, move your robots.txt file to your root directory.
You can also read: Google on The Indexing and Follow Meta Tag
It’s important to note that this may require root access to your server. Some content management systems may default to uploading files to a “media” subdirectory, so you may need to bypass this to ensure your robots.txt file is in the correct location.
Noindex Directive in Robots.txt
This issue is common on old websites that have been live for several years. Google ceased honouring noindex directives in robots.txt files as of September 1, 2019. If your robots.txt file was generated before that date or includes noindex commands, chances are those pages will still be indexed in Google’s search results.
The remedy for this predicament involves implementing an alternative “noindex” technique. One possibility is using the robots meta tag, which you can insert into the header of any webpage to prevent Google from indexing it.
Misuse of Wildcards
Robots.txt uses two wildcard characters:
- Asterisk (*): Represents any instances of a valid character, akin to a Joker in a deck of cards.
- Dollar sign ($): Indicates the end of a URL, enabling you to enforce rules solely on the final part of the URL, such as the filetype extension.
It’s prudent to use wildcards wisely, as they have the potential to impose restrictions on a significantly broader section of your website. Misplacing an asterisk can inadvertently block robot access to your entire site, so exercise caution.
Verify your wildcard rules using a robots.txt testing tool to ensure they function as intended. Exercise care in wildcard usage to avoid inadvertent blocking or allowing excessive access.
Restricting Access to Scripts and Stylesheets
It may appear logical to prevent crawler access to external JavaScripts and cascading stylesheets (CSS). However, it’s important to remember that Googlebot requires access to CSS and JS files to properly interpret your HTML and PHP pages. If your pages show unusual behavior in Google’s results or appear incorrectly indexed, investigate whether you’re blocking crawler access to essential external files.
You can also read: Google On Image Removal From Search Index
A simple solution is to remove the line from your robots.txt file that restricts access. Alternatively, if there are specific files you need to block, include an exception that allows access to the necessary CSS and JavaScript.
Absence of XML Sitemap URL
This aspect pertains more to SEO considerations than anything else. You have the option to include the URL of your XML sitemap in your robots.txt file. Since this is typically the first location Googlebot checks when crawling your site, it provides the crawler with a head start in understanding the structure and key pages of your site.
While not strictly an error—omitting a sitemap shouldn’t detrimentally affect the fundamental functionality and appearance of your website in search results—it’s still beneficial to append your sitemap URL to robots.txt if you aim to enhance your SEO strategies.
Opting for Absolute URLs
While using absolute URLs in elements like canonicals and hreflang tags aligns with best practices, the opposite holds true for URLs within the robots.txt file. Using relative paths in the robots.txt file is the preferred method for specifying which sections of a site should be excluded from crawler access.
Google’s robots.txt documentation elaborates on this, stating:
A directory or page, relative to the root domain, that may be crawled by the user agent just mentioned.
When using an absolute URL, there’s no assurance that crawlers will interpret it as intended or adhere to the disallow/allow rule.
Accessing Development Sites
Preventing crawlers from accessing your live website is essential, but it’s equally important to bar them from indexing pages still in development. It’s considered best practice to integrate a disallow directive in the robots.txt file of a website under construction, ensuring it remains hidden from the general public until completion.
Similarly, it’s important to remove the disallow directive upon launching a finished website. Forgetting to eliminate this directive from robots.txt is a common oversight among web developers, potentially hindering the proper crawling and indexing of your entire site.
If your development site appears to be attracting real-world traffic or your recently launched website isn’t performing well in search results, inspect your robots.txt file for a universal user agent disallow rule:
User-Agent: *
Disallow: /
If you detect this rule when it shouldn’t be present (or don’t see it when it should), make the necessary adjustments to your robots.txt file and verify that your website’s search visibility updates accordingly.
Deprecated and Unsupported Elements
Although the guidelines for robots.txt files have remained relatively unchanged over time, two elements frequently included are:
Crawl-delay
Noindex
While Bing does support crawl-delay, Google does not, yet it is often specified by webmasters. Previously, crawl settings could be configured in Google Search Console, but this feature was removed toward the end of 2023.
You can also read: Google on Search Console Validation Fix for 404 Errors
In July 2019, Google announced it would no longer support the noindex directive in robots.txt files. Up until the announcement, webmasters could place the noindex directive in their robots.txt file, though the practice was not widely supported nor standardized. The preferred method of implementing noindex was on-page robots and/or x-robots at the page level.
How to Fix a Robots.txt Error
The first thing you’ll want to do, if you find a problem with robots.txt is negatively affecting the visibility of your site, is fix the robots.txt file and then test to make sure the new directives have taken effect. SEO crawling tools can dramatically speed up this process, saving you from having to wait for the search engines to recrawl your site.
Once you have ascertained that the robots.txt file is working, you will need to request recrawling of your website. Google Search Console and Bing Webmaster Tools are two services that will facilitate you in this regard. Upload an updated sitemap and include a request for re-crawl for pages delisted incorrectly.
You’re at the mercy of Googlebot’s scheduling, really, and there’s no guarantee how much longer that might take in order for any missing pages to re-appear in the Google search index. The best you can do is undertake the above steps to help shorten this time period, besides which you just need to keep checking and wait for Googlebot to collect the corrected robots.txt file.
Summary
When it comes to robots.txt errors, prevention is indeed preferable to remediation. For a large revenue-generating website, a single mistake, such as an errant wildcard that eliminates your site from Google, can instantly impact earnings. Modifications to robots.txt should be executed with caution by professional developers, thoroughly verified, and, when appropriate, validated by a second opinion.
Whenever feasible, testing in a sandbox editor before deploying live on your production server can help avert inadvertent availability issues. Identify the issue, rectify the robots.txt file as needed, and resubmit your sitemap for a fresh crawl.