Traditionally, you will use a robots.txt
file on your server to manage what pages, folders, subdomains or other content search engines will be allowed to crawl. But did you know that there’s also such a thing as the X-Robots-Tag HTTP header? In this post we’ll discuss what the possibilities are and how this might be a better option for your blog.
Quick recap: robots.txt
Before we continue, let’s take a look at what a robots.txt
file does. In a nutshell, what it does is tell search engines to not crawl a particular page, file or directory of your website.
Using this, helps both you and search engines such as Google. By not providing access to certain, unimportant areas of your website, you can save on your crawl budget and reduce load on your server.
Please note that using the robots.txt
file to hide your entire website for search engines is definitely not recommended.
Say hello to X-Robots-Tag
Back in 2007, Google announced that they added support for the X-Robots-Tag directive. What this meant was that you not only could restrict access to search engines via a robots.txt
file, you could also programmatically set various robot.txt-related directives in the headers of a HTTP response. Now, you might be thinking “But can’t I just use the robots meta tag instead?”. The answer is yes. And no. If you plan on programmatically blocking a particular page that is written in HTML, then using the meta tag should suffice. But if you plan on blocking crawling of, lets say an image, then you could use the HTTP response approach to do this in code. Obviously you can always use the latter method if you don’t feel like adding additional HTML to your website.
X-Robots-Tag directives
As Sebastian explained in 2008, there are two different kinds of directives: crawler directives and indexer directives. I’ll briefly explain the difference below.
Crawler directives
The robots.txt
file only contains the so called ‘crawler directives’, which tells search engines where they are or aren’t allowed to go. By using the
Allow
directive, you can specify where search engines are allowed to crawl.
Disallow
does the exact opposite. Additionally, you can use the
Sitemap
directive to help search engines out and crawl your website even faster.
Note that it’s also possible to fine tune the directives for a specific search engine by using the
User-agent
directive in combination with the other directives.
As Sebastian points out and explains thoroughly in another post, pages can still show up in search results in case there are enough links pointing to it, despite explicitly defining these with the
Disallow
directive. This basically means that if you want to really hide something from the search engines, and thus from people using search, robots.txt
won’t suffice.
Indexer directives
Indexer directives are directives that are set on a per page and/or per element basis. Up until July 2007, there were two directives: the microformat rel=”nofollow”, which means that that link should not pass authority / PageRank, and the Meta Robots tag.
With the Meta Robots tag, you can really prevent search engines from showing pages you want to keep out of the search results. The same result can be achieved with the X-Robots-Tag HTTP header. As described earlier, the X-Robots-Tag gives you more flexibility by also allowing you to control how specific file(types) are indexed.
Example uses of the X-Robots-Tag
Theory is nice and all, but let’s see how you could use the X-Robots-Tag in the wild!
If you want to prevent search engines from showing files you’ve generated with PHP, you could add the following in the head of the header.php file:
header("X-Robots-Tag: noindex", true);
This would not prevent search engines from following the links on those pages. If you want to do that, then alter the previous example as follows:
header("X-Robots-Tag: noindex, nofollow", true);
Now, although using this method in PHP has its benefits, you’ll most likely end up wanting to block specific filetypes altogether. The more practical approach would be to add the X-Robots-Tag to your Apache server configuration or a .htaccess file.
Imagine you run a website which also has some .doc files, but you don’t want search engines to index that filetype for a particular reason. On Apache servers, you should add the following line to the configuration / a .htaccess file:
<FilesMatch ".doc$"> Header set X-Robots-Tag "index, noarchive, nosnippet" </FilesMatch>
Or, if you’d want to do this for both .doc and .pdf files:
<FilesMatch ".(doc|pdf)$"> Header set X-Robots-Tag "index, noarchive, nosnippet" </FilesMatch>
If you’re running Nginx instead of Apache, you can get a similar result by adding the following to the server configuration:
location ~* .(doc|pdf)$ { add_header X-Robots-Tag "index, noarchive, nosnippet"; }
There are cases in which the robots.txt
file itself might show up in search results. By using an alteration of the previous method, you can prevent this from happening to your website:
<FilesMatch "robots.txt"> Header set X-Robots-Tag "noindex" </FilesMatch>
And in Nginx:
location = robots.txt { add_header X-Robots-Tag "noindex"; }
Conclusion
As you can see based on the examples above, the X-Robots-Tag HTTP header is a very powerful tool. Use it wisely and with caution, as you won’t be the first to block your entire site by accident. Nevertheless, it’s a great addition to your toolset if you know how to use it.