Block your site’s search result pages

Why should you block your internal search result pages for Google? Well, how would you feel if you are in dire need for the answer to your search query and end up on the internal search pages of a certain website? That’s one crappy experience. Google thinks so too. And prefers you not to have these internal search pages indexed.

Optimize your site for search & social media and keep it optimized with Yoast SEO Premium »

Yoast SEO for WordPress pluginBuy now » Info

Google considers these search results pages to be of lower quality than your actual informational pages. That doesn’t mean these internal search pages are useless, but it makes sense to block these internal search pages.

Back in 2007

10 Years ago, Google, or more specifically Matt Cutts, told us that we should block these pages in our robots.txt. The reason for that:

Typically, web search results don’t add value to users, and since our core goal is to provide the best search results possible, we generally exclude search results from our web search index. (Not all URLs that contains things like “/results” or “/search” are search results, of course.)
– Matt Cutts (2007)

Nothing changed, really. Even after 10 years of SEO changes, this remains the same. The Google Webmaster Guidelines still state that you should “Use the robots.txt file on your web server to manage your crawling budget by preventing crawling of infinite spaces such as search result pages.” Furthermore, the guidelines state that webmasters should avoid techniques like automatically generated content, in this case, “Stitching or combining content from different web pages without adding sufficient value”.

However, blocking internal search pages in your robots.txt doesn’t seem the right solution. In 2007, it even made more sense to simply redirect the user to the first result of these internal search pages. These days, I’d rather use a slightly different solution.

Blocking internal search pages in 2017

I believe nowadays, using a noindex, follow meta robots tag is the way to go instead. It seems Google ‘listens’ to that meta robots tag and sometimes ignores the robots.txt. That happens, for instance, when a surplus of backlinks to a blocked page tells Google it is of interest to the public anyway. We’ve already mentioned this in our Ultimate guide to robots.txt.

The 2007 reason is still the same in 2017, by the way: linking to search pages from search pages delivers a poor experience for a visitor. For Google, on a mission to deliver the best result for your query, it makes a lot more sense to link directly to an article or another informative page.

Yoast SEO will block internal search pages for you

If you’re on WordPress and using our plugin, you’re fine. We’ve got you covered:

Block internal search pages

That’s located at SEO › Titles & Metas › Archives. Most other content management systems allow for templates for your site’s search results as well, so adding a simple line of code to that template will suffice:
<meta name="robots" content="noindex,follow"/>

Become a technical SEO expert with our Technical SEO 1 training! »

Technical SEO 1 training$ 199 - Buy now » Info

Meta robots AND robots.txt?

If you try to block internal search pages by adding that meta robots tag and disallowing these in your robots.txt, please think again. Just the meta robots will do. Otherwise, you’ll risk losing the link value of these pages (hence the follow in the meta tag). If Google listens to your robots.txt, they will ignore the meta robots tag, right? And that’s not what you want. So just use the meta robots tag!

Back to you

Did you block your internal search results? And how did you do that? Go check for yourself! Any further insights or experiences are appreciated; just drop us a line in the comments.

Read more: ‘Robots.txt: the ultimate guide’ »

SEO basics: What is crawlability?

Ranking in the search engines requires a website with flawless technical SEO. Luckily, the Yoast SEO plugin takes care (of almost) everything on your WordPress site. Still, if you really want to get most out of your website and keep on outranking the competition, some basic knowledge of technical SEO is a must. In this post, I’ll explain one of the most important concepts of technical SEO: crawlability.

What is the crawler again?

A search engine like Google consists of a crawler, an index and an algorithm. The crawler follows the links. When Google’s crawler finds your website, it’ll read it and its content is saved in the index.

A crawler follows the links on the web. A crawler is also called a robot, a bot, or a spider. It goes around the internet 24/7. Once it comes to a website, it saves the HTML version of a page in a gigantic database, called the index. This index is updated every time the crawler comes around your website and finds a new or revised version of it. Depending on how important Google deems your site and the amount of changes you make on your website, the crawler comes around more or less often.

Read more: ‘SEO basics: what does Google do’ »

And what is crawlability?

Crawlability has to do with the possibilities Google has to crawl your website. Crawlers can be blocked from your site. There are a few ways to block a crawler from your website. If your website or a page on your website is blocked, you’re saying to Google’s crawler: “do not come here”. Your site or the respective page won’t turn up in the search results in most of these cases.
There are a few things that could prevent Google from crawling (or indexing) your website:

  • If your robots.txt file blocks the crawler, Google will not come to your website or specific web page.
  • Before crawling your website, the crawler will take a look at the HTTP header of your page. This HTTP header contains a status code. If this status code says that a page doesn’t exist, Google won’t crawl your website. In the module about HTTP headers of our (soon to be launched!) Technical SEO training we’ll tell you all about that.
  • If the robots meta tag on a specific page blocks the search engine from indexing that page, Google will crawl that page, but won’t add it to its index.

This flow chart might help you understand the process bots follow when attempting to index a page:

Want to learn all about crawlability?

Although crawlability is just the very basics of technical SEO (it has to do with all the things that enable Google to index your site), for most people it’s already pretty advanced stuff. Nevertheless, if you’re blocking – perhaps even without knowing! – crawlers from your site, you’ll never rank high in Google. So, if you’re serious about SEO, this should matter to you.

If you really want to understand all the technical aspects concerning crawlability, you should definitely check out our Technical SEO 1 training, which will be released this week. In this SEO course, we’ll teach you how to detect technical SEO issues and how to solve them (with our Yoast SEO plugin).

Keep reading: ‘How to get Google to crawl your site faster’ »

 

Ask Yoast: should I redirect my affiliate links?

There are several reasons for cloaking or redirecting affiliate links. For instance, it’s easier to work with affiliate links when you redirect them, plus you can make them look prettier. But do you know how to cloak affiliate links? We explained how the process works in one of our previous posts. This Ask Yoast is about the method of cloaking affiliate links we gave you in that post. Is it still a good idea to redirect affiliate links via the script we described?

Elias Nilson emailed us, saying that he read our article about cloaking affiliate links and he’s wondering if the solution is still up-to-date.

“Is it still a good idea to redirect affiliate links via the script you describe in your post?”

Check out the video or read the answer below!

Get the most out of Yoast SEO, learn every feature and best practice in our Yoast SEO for WordPress training! »

Yoast SEO for WordPress training€ 99 - Buy now » Info

Redirect affiliate links

Read this transcript to figure out if it is still a valid option to redirect affiliate links via the described script. Want to see the script directly? Read this post: ‘How to cloak affiliate links’:

Honestly, yes. Recently we updated the post about cloaking affiliate links, so the post and therefore the script is still up to date. Link cloaking, which sounds negative, because we use the word cloaking, is basically hiding from Google that you’re an affiliate. And if you’re an affiliate, that’s still the thing that you want to do, because usually Google ranks original content that is not by affiliates better than it does affiliates.

So, yes, I’d still recommend that method, the link will be below this post, so you can see the original post that we are referencing to. It’s a very simple method to cloak your affiliate links and I think it works in probably the best way that I know.

So, keep going. Good luck.

Ask Yoast

In the series Ask Yoast we answer SEO questions from followers. Need help with SEO? Let us help you out! Send your question to ask@yoast.com.

Read more: ‘How to cloak your affiliate links’ »

Ask Yoast: nofollow layered navigation links?

If you have a big eCommerce site with lots of products, layered navigation can help your users to narrow down their search results. Layered or faceted navigation is an advanced way of filtering by providing groups of filters for (many) product attributes. In this filtering process, you might create a lot of URLs though, because the user will be able to filter and thereby group items in many ways, and those groups will all be available on separate URLs. So what should you do with all these URLs? Do you want Google to crawl them all?

In this Ask Yoast, we’ll answer a question from Daniel Jacobsen:

“Should I nofollow layered navigation links? And if so, why? Are there any disadvantages of this?”

Check out the video or read the answer below!

Want to outrank your competitor and get more sales? Read our Shop SEO eBook! »

Shop SEO$ 25 - Buy now » Info

Layered navigation links

Read this transcript to learn how to deal with layered or faceted navigation links:

“The question is: “Why would you want to do that?” If you have too many URLs, so if you have a layered or a faceted navigation that has far too many options -creating billions of different types of URLs for Google to crawl – then probably yes. At the same time you need to ask yourself: “Why does my navigation work that way?” And, “Can we make it any different?” But in a lot of eCommerce systems that’s very hard. So in those cases adding a nofollow to those links, does actually help to prevent Google from indexing each and every one of the versions of your site.

I’ve worked on a couple of sites with faceted navigation that had over a billion variations in URLs, even though they only had like 10,000 products. If that’s the sort of problem you have, then yes, you need to nofollow them and maybe you even need to use your robots.txt file to exclude some of those variants. So specific stuff that you don’t want indexed, for instance, if you don’t want color indexed, you could do a robots.txt line that says: “Disallow for everything that has color in the URL”. At that point you strip down what Google crawls and what it thinks is important. The problem with that is, that if Google has links pointing at that version from somewhere else, those links don’t count for your site’s ranking either.

So it’s a bit of a quid pro quo, where you have to think about what is the best thing to do. It’s a tough decision. I really would suggest getting an experienced technical SEO to look at your site if it really is a problem, because it’s not a simple cut-and-paste solution that works the same for every site.

Good luck!”

Ask Yoast

In the series Ask Yoast we answer SEO questions from followers! Need help with SEO? Let us help you out! Send your question to ask@yoast.com.

Read more: ‘Internal search for online shops: an essential asset’ »

Playing with the X-Robots-Tag HTTP header

Traditionally, you will use a robots.txt file on your server to manage what pages, folders, subdomains or other content search engines will be allowed to crawl. But did you know that there’s also such a thing as the X-Robots-Tag HTTP header? In this post we’ll discuss what the possibilities are and how this might be a better option for your blog.

Quick recap: robots.txt

Before we continue, let’s take a look at what a robots.txt file does. In a nutshell, what it does is tell search engines to not crawl a particular page, file or directory of your website.

Using this, helps both you and search engines such as Google. By not providing access to certain, unimportant areas of your website, you can save on your crawl budget and reduce load on your server.

Please note that using the robots.txt file to hide your entire website for search engines is definitely not recommended.

Say hello to X-Robots-Tag

Back in 2007, Google announced that they added support for the X-Robots-Tag directive. What this meant was that you not only could restrict access to search engines via a robots.txt file, you could also programmatically set various robot.txt-related directives in the headers of a HTTP response. Now, you might be thinking “But can’t I just use the robots meta tag instead?”. The answer is yes. And no. If you plan on programmatically blocking a particular page that is written in HTML, then using the meta tag should suffice. But if you plan on blocking crawling of, lets say an image, then you could use the HTTP response approach to do this in code. Obviously you can always use the latter method if you don’t feel like adding additional HTML to your website.

X-Robots-Tag directives

As Sebastian explained in 2008, there are two different kinds of directives: crawler directives and indexer directives. I’ll briefly explain the difference below.

Get the most out of Yoast SEO, learn every feature and best practice in our Yoast SEO for WordPress training! »

Yoast SEO for WordPress training$ 99 - Buy now » Info

Crawler directives

The robots.txt file only contains the so called ‘crawler directives’, which tells search engines where they are or aren’t allowed to go. By using the

Allow

directive, you can specify where search engines are allowed to crawl.

Disallow

does the exact opposite. Additionally, you can use the

Sitemap

directive to help search engines out and crawl your website even faster.

Note that it’s also possible to fine tune the directives for a specific search engine by using the

User-agent

directive in combination with the other directives.

As Sebastian points out and explains thoroughly in another post, pages can still show up in search results in case there are enough links pointing to it, despite explicitly defining these with the

Disallow

directive. This basically means that if you want to really hide something from the search engines, and thus from people using search, robots.txt won’t suffice.

Indexer directives

Indexer directives are directives that are set on a per page and/or per element basis. Up until July 2007, there were two directives: the microformat rel=”nofollow”, which means that that link should not pass authority / PageRank, and the Meta Robots tag.

With the Meta Robots tag, you can really prevent search engines from showing pages you want to keep out of the search results. The same result can be achieved with the X-Robots-Tag HTTP header. As described earlier, the X-Robots-Tag gives you more flexibility by also allowing you to control how specific file(types) are indexed.

Example uses of the X-Robots-Tag

Theory is nice and all, but let’s see how you could use the X-Robots-Tag in the wild!

If you want to prevent search engines from showing files you’ve generated with PHP, you could add the following in the head of the header.php file:

header(&quot;X-Robots-Tag: noindex&quot;, true);

This would not prevent search engines from following the links on those pages. If you want to do that, then alter the previous example as follows:

header(&quot;X-Robots-Tag: noindex, nofollow&quot;, true);

Now, although using this method in PHP has its benefits, you’ll most likely end up wanting to block specific filetypes altogether. The more practical approach would be to add the X-Robots-Tag to your Apache server configuration or a .htaccess file.

Imagine you run a website which also has some .doc files, but you don’t want search engines to index that filetype for a particular reason. On Apache servers, you should add the following line to the configuration / a .htaccess file:

&lt;FilesMatch &quot;.doc$&quot;&gt;
Header set X-Robots-Tag &quot;index, noarchive, nosnippet&quot;
&lt;/FilesMatch&gt;

Or, if you’d want to do this for both .doc and .pdf files:

&lt;FilesMatch &quot;.(doc|pdf)$&quot;&gt;
Header set X-Robots-Tag &quot;index, noarchive, nosnippet&quot;
&lt;/FilesMatch&gt;

If you’re running Nginx instead of Apache, you can get a similar result by adding the following to the server configuration:

location ~* .(doc|pdf)$ {
	add_header  X-Robots-Tag &quot;index, noarchive, nosnippet&quot;;
}

There are cases in which the robots.txt file itself might show up in search results. By using an alteration of the previous method, you can prevent this from happening to your website:

&lt;FilesMatch &quot;robots.txt&quot;&gt;
Header set X-Robots-Tag &quot;noindex&quot;
&lt;/FilesMatch&gt;

And in Nginx:

location = robots.txt {
	add_header  X-Robots-Tag &quot;noindex&quot;;
}

Conclusion

As you can see based on the examples above, the X-Robots-Tag HTTP header is a very powerful tool. Use it wisely and with caution, as you won’t be the first to block your entire site by accident. Nevertheless, it’s a great addition to your toolset if you know how to use it.

Read more: ‘Meta robots tag: the ultimate guide’ »

Noindex a post or page in WordPress, the easy way!

Some posts and pages should not show up in search results. To make sure they don’t show up, you should tell search engines to exclude them. You do this with a meta robots noindex tag. Setting a page to noindex makes sure search engines never show it in their results. Here, we’ll explain how easy it is to noindex a post in WordPress if you use Yoast SEO.

Why keep a post out of the search results?

Why would you NOT want a page to show up in the search results? Well, most sites have pages that shouldn’t show up in the search results. For example, you might not want people to land on the ‘thank you’ page you redirect people to when they’ve contacted you. Or your ‘checkout success’ page. Finding those pages in Google is of no use to anyone.

Not sure if you should noindex or nofollow a post? Read Michiel’s post: Which pages should I noindex or nofollow?

How to set a post to noindex with Yoast SEO

Setting a post or page to noindex is simple when you’re running Yoast SEO. Below your post, in the Yoast SEO meta box, just click on the Advanced tab:

The Advanced tab in the Yoast SEO meta box harbours the indexing options

On the Advanced tab, you’ll see some questions. The first is: “Allow search engines to show this post in search results?” If you select ‘Yes’, your post can show up in Google. If you select ‘No’, you’ll set the post to noindex . This means it won’t show up in the search results.

Select No from the dropdown menu to noindex this post

The default setting of the post – in this case, Yes – is the setting you’ve selected for this post type in the Search Appearance tab of Yoast SEO. If you want to prevent complete sections of your site from showing up in Google, you can set that there. This is further explained in Edwin’s post: Show x in search results?.

Please note that if the post you’re setting to noindex is already in the search results, it might take some time for the page to disappear. The search engines will first have to re-index the page to find the noindex tag. And do not noindex posts frivolously: if they were getting traffic before, you’re losing that traffic.

Were you considering to use the robots.txt file to keep something out of the search results? Read why you shouldn’t use the robots.txt file for that.

Do links on noindexed pages have value?

When you set a post to noindex, Yoast SEO automatically assumes you want to set it to noindex, follow. This means that search engines will still follow the links on those pages. If you do not want the search engines to follow the links, your answer to the following question should be No:

Simply answer No if you don’t want Google to follow links on this page

This will set the meta robots tonofollow, which will change the search engines behavior. They’ll ignore all the links on the page. Use this with caution though! In doubt if you need it? Just check Michiel’s post right here.

PS. Did you noindex a post or page in WordPress while you didn’t mean to? No worries, as you can fix an accidental noindex easily!

Read more: The ultimate guide to the meta robots tag »

The post Noindex a post or page in WordPress, the easy way! appeared first on Yoast.

How to optimize your crawl budget

Google doesn’t always spider every page on a site instantly. In fact, sometimes, it can take weeks. This might get in the way of your SEO efforts. Your newly optimized landing page might not get indexed. At that point, it’s time to optimize your crawl budget. We’ll discuss what a ‘crawl budget’ is and what you can do to optimize it in this article.

What is a crawl budget?

Crawl budget is the number of pages Google will crawl on your site on any given day. This number varies slightly from day to day, but overall, it’s relatively stable. Google might crawl 6 pages on your site each day, it might crawl 5,000 pages, it might even crawl 4,000,000 pages every single day. The number of pages Google crawls, your ‘budget’, is generally determined by the size of your site, the ‘health’ of your site (how many errors Google encounters) and the number of links to your site. Some of these factors are things you can influence, we’ll get to that in a bit.

How does a crawler work?

A crawler like Googlebot gets a list of URLs to crawl on a site. It goes through that list systematically. It grabs your robots.txt file every once in a while to make sure it’s still allowed to crawl each URL and then crawls the URLs one by one. Once a spider has crawled a URL and it has parsed the contents, it adds new URLs it has found on that page that it has to crawl back on the to-do list.

Several events can make Google feel a URL has to be crawled. It might have found new links pointing at content, or someone has tweeted it, or it might have been updated in the XML sitemap, etc, etc… There’s no way to make a list of all the reasons why Google would crawl a URL, but when it determines it has to, it adds it to the to-do list.

When is crawl budget an issue?

Crawl budget is not a problem if Google has to crawl a lot of URLs on your site and it has allotted a lot of crawls. But, say your site has 250,000 pages and Google crawls 2,500 pages on this particular site each day. It will crawl some (like the homepage) more than others. It could take up to 200 days before Google notices particular changes to your pages if you don’t act. Crawl budget is an issue now. On the other hand, if it crawls 50,000 a day, there’s no issue at all.

To quickly determine whether your site has a crawl budget issue, follow the steps below. This does assume your site has a relatively small number of URLs that Google crawls but doesn’t index (for instance because you added meta noindex).

  1. Determine how many pages you have on your site, the number of your URLs in your XML sitemaps might be a good start.
  2. Go into Google Search Console.
  3. Go to “Legacy Tools” -> “Crawl stats” and take note of the average pages crawled per day.
  4. Divide the number of pages by the “Average crawled per day” number.
  5. If you end up with a number higher than ~10 (so you have 10x more pages than what Google crawls each day), you should optimize your crawl budget. If you end up with a number lower than 3, you can go read something else.

What URLs is Google crawling?

You really should know which URLs Google is crawling on your site. The only ‘real’ way of knowing that is looking at your site’s server logs. For larger sites, I personally prefer using Logstash + Kibana. For smaller sites, the guys at Screaming Frog have released quite a nice little tool, aptly called SEO Log File Analyser (note the S, they’re Brits).

Get your server logs and look at them

Depending on your type of hosting, you might not always be able to grab your log files. However, if you even so much as think you need to work on crawl budget optimization because your site is big, you should get them. If your host doesn’t allow you to get them, it’s time to change hosts.

Fixing your site’s crawl budget is a lot like fixing a car. You can’t fix it by looking at the outside, you’ll have to open up that engine. Looking at logs is going to be scary at first. You’ll quickly find that there is a lot of noise in logs. You’ll find a lot of commonly occurring 404s that you think are nonsense. But you have to fix them. You have to get through the noise and make sure your site is not drowned in tons of old 404s.

Read more: Website maintenance: Check and fix 404 error pages »

Increase your crawl budget

Let’s look at the things that actually improve how many pages Google can crawl on your site.

Website maintenance: reduce errors

Step one in getting more pages crawled is making sure that the pages that are crawled return one of two possible return codes: 200 (for “OK”) or 301 (for “Go here instead”). All other return codes are not OK. To figure this out, you have to look at your site’s server logs. Google Analytics and most other analytics packages will only track pages that served a 200. So you won’t find many of the errors on your site in there.

Once you’ve got your server logs, try to find common errors, and fix them. The most simple way of doing that is by grabbing all the URLs that didn’t return 200 or 301 and then order by how often they were accessed. Fixing an error might mean that you have to fix code. Or you might have to redirect a URL elsewhere. If you know what caused the error, you can try to fix the source too.

Another good source to find errors is Google Search Console. Read this post by Michiel for more info on that. If you’ve got Yoast SEO Premium, you can even redirect them away easily using the redirects manager.

Block parts of your site

If you have sections of your site that really don’t need to be in Google, block them using robots.txt. Only do this if you know what you’re doing, of course. One of the common problems we see on larger eCommerce sites is when they have a gazillion way to filter products. Every filter might add new URLs for Google. In cases like these, you really want to make sure that you’re letting Google spider only one or two of those filters and not all of them.

Reduce redirect chains

When you 301 redirect a URL, something weird happens. Google will see that new URL and add that URL to the to-do list. It doesn’t always follow it immediately, it adds it to its to-do list and just goes on. When you chain redirects, for instance, when you redirect non-www to www, then http to https, you have two redirects everywhere, so everything takes longer to crawl.

Get more links

This is easy to say, but hard to do. Getting more links is not just a matter of being awesome, it’s also a matter of making sure others know that you’re awesome. It’s a matter of good PR and good engagement on Social. We’ve written extensively about link building, I’d suggest reading these 3 posts:

  1. Link building from a holistic SEO perspective
  2. Link building: what not to do?
  3. 6 steps to a successful link building strategy

When you have an acute indexation problem, you should definitely look at your crawl errors, blocking parts of your site and at fixing redirect chains first. Link building is a very slow method to increase your crawl budget. On the other hand: if you intend to build a large site, link building needs to be part of your process.

TL;DR: crawl budget optimization is hard

Crawl budget optimization is not for the faint of heart. If you’re doing your site’s maintenance well, or your site is relatively small, it’s probably not needed. If your site is medium-sized and well maintained, it’s fairly easy to do based on the above tricks.

Keep reading: Robots.txt: the ultimate guide »

The post How to optimize your crawl budget appeared first on Yoast.

The ultimate guide to robots.txt

The robots.txt file is one of the main ways of telling a search engine where it can and can’t go on your website. All major search engines support the basic functionality it offers, but some of them respond to some extra rules which can be useful too. This guide covers all the ways to use robots.txt on your website, but, while it looks simple, any mistakes you make in your robots.txt can seriously harm your site, so make sure you read and understand the whole of this article before you dive in.

Want to learn all about technical SEO? Our Technical SEO bundle is on sale today: you’ll get a $40 discount if you get it now. This bundle combines our Technical SEO training and Structured data training. After completing this course, you’ll be able to detect and fix technical errors; optimize site speed and implement structured data. Don’t wait!

What is a robots.txt file?

Crawl directives

The robots.txt file is one of a number of crawl directives. We have guides on all of them and you’ll find them here:

Crawl directives guides by Yoast »

A robots.txt file is a text file which is read by search engine spiders and follows a strict syntax. These spiders are also called robots – hence the name – and the syntax of the file is strict simply because it has to be computer readable. That means there’s no room for error here – something is either 1, or 0.

Also called the “Robots Exclusion Protocol”, the robots.txt file is the result of a consensus among early search engine spider developers. It’s not an official standard set by any standards organization, but all major search engines adhere to it.

What does the robots.txt file do?

humans.txt

Once upon a time, some developers sat down and decided that, since the web is supposed to be for humans, and since robots get a file on a website, the humans who built it should have one, too. So they created the humans.txt standard as a way of letting people know who worked on a website, amongst other things.

Search engines index the web by spidering pages, following links to go from site A to site B to site C and so on. Before a search engine spiders any page on a domain it hasn’t encountered before, it will open that domain’s robots.txt file, which tells the search engine which URLs on that site it’s allowed to index.

Search engines typically cache the contents of the robots.txt, but will usually refresh it several times a day, so changes will be reflected fairly quickly.

Where should I put my robots.txt file?

The robots.txt file should always be at the root of your domain. So if your domain is www.example.com, it should be found at https://www.example.com/robots.txt.

It’s also very important that your robots.txt file is actually called robots.txt. The name is case sensitive, so get that right or it just won’t work.

Pros and cons of using robots.txt

Pro: managing crawl budget

It’s generally understood that a search spider arrives at a website with a pre-determined “allowance” for how many pages it will crawl (or, how much resource/time it’ll spend, based on a site’s authority/size/reputation), and SEOs call this the crawl budget. This means that if you block sections of your site from the search engine spider, you can allow your crawl budget to be used for other sections.

It can sometimes be highly beneficial to block the search engines from crawling problematic sections of your site, especially on sites where a lot of SEO clean-up has to be done. Once you’ve tidied things up, the you can let them back in.

A note on blocking query parameters

One situation where crawl budget is particularly important is when your site uses a lot of query string parameters to filter and sort. Let’s say you have 10 different query parameters, each with different values that can be used in any combination. This leads to hundreds if not thousands of possible URLs. Blocking all query parameters from being crawled will help make sure the search engine only spiders your site’s main URLs and won’t go into the enormous trap that you’d otherwise create.

This line blocks all URLs on your site containing a query string:

Disallow: /*?*

Con: not removing a page from search results

Even though you can use the robots.txt file to tell a spider where it can’t go on your site, you can’t use it tell a search engine which URLs not to show in the search results – in other words, blocking it won’t stop it from being indexed. If the search engine finds enough links to that URL, it will include it, it will just not know what’s on that page. So your result will look like this:

If you want to reliably block a page from showing up in the search results, you need to use a meta robots noindex tag. That means that, in order to find the noindex tag, the search engine has to be able to access that page, so don’t block it with robots.txt.

Noindex directives

It remains an ongoing area of research and contention in SEO as to whether adding ‘noindex’ directives in your robots.txt file enables you to control indexing behaviour, and, to avoid these ‘fragments’ showing up in search engines. Test results vary, and, the search engines are unclear on what is and isn’t supported.

Con: not spreading link value

If a search engine can’t crawl a page, it can’t spread the link value across the links on that page. When a page is blocked with robots.txt, it’s a dead-end. Any link value which might have flowed to (and through) that page is lost.

robots.txt syntax

WordPress robots.txt

We have an entire article on how best to setup your robots.txt for WordPress. Don’t forget you can edit your site’s robots.txt file in the Yoast SEO Tools → File editor section.

A robots.txt file consists of one or more blocks of directives, each starting with a user-agent line. The “user-agent” is the name of the specific spider it addresses. You can either have one block for all search engines, using a wildcard for the user-agent, or specific blocks for specific search engines. A search engine spider will always pick the block that best matches its name.

These blocks look like this (don’t be scared, we’ll explain below):

User-agent: * 
Disallow: /

User-agent: Googlebot
Disallow:

User-agent: bingbot
Disallow: /not-for-bing/

Directives like Allow and Disallow should not be case sensitive, so it’s up to you whether you write them lowercase or capitalize them. The values are case sensitive however, /photo/ is not the same as /Photo/. We like to capitalize directives because it makes the file easier (for humans) to read.

The User-agent directive

The first bit of every block of directives is the user-agent, which identifies a specific spider. The user-agent field is matched against that specific spider’s (usually longer) user-agent, so for instance the most common spider from Google has the following user-agent:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 

So if you want to tell this spider what to do, a relatively simple User-agent: Googlebot line will do the trick.

Most search engines have multiple spiders. They will use a specific spider for their normal index, for their ad programs, for images, for videos, etc.

Search engines will always choose the most specific block of directives they can find. Say you have 3 sets of directives: one for *, one for Googlebot and one for Googlebot-News. If a bot comes by whose user-agent is Googlebot-Video, it would follow the Googlebot restrictions. A bot with the user-agent Googlebot-News would use the more specific Googlebot-News directives.

The most common user agents for search engine spiders

Here’s a list of the user-agents you can use in your robots.txt file to match the most commonly used search engines:

Search engineFieldUser-agent
BaiduGeneralbaiduspider
BaiduImagesbaiduspider-image
BaiduMobilebaiduspider-mobile
BaiduNewsbaiduspider-news
BaiduVideobaiduspider-video
BingGeneralbingbot
BingGeneralmsnbot
BingImages & Videomsnbot-media
BingAdsadidxbot
GoogleGeneralGooglebot
GoogleImagesGooglebot-Image
GoogleMobileGooglebot-Mobile
GoogleNewsGooglebot-News
GoogleVideoGooglebot-Video
GoogleAdSenseMediapartners-Google
GoogleAdWordsAdsBot-Google
Yahoo!Generalslurp
YandexGeneralyandex

The Disallow directive

The second line in any block of directives is the Disallow line. You can have one or more of these lines, specifying which parts of the site the specified spider can’t access. An empty Disallow line means you’re not disallowing anything, so basically it means that a spider can access all sections of your site.

The example below would block all search engines that “listen” to robots.txt from crawling your site.

User-agent: * 
Disallow: /

The example below would, with only one character less, allow all search engines to crawl your entire site.

User-agent: * 
Disallow:

The example below would block Google from crawling the Photo directory on your site – and everything in it.

User-agent: googlebot 
Disallow: /Photo

This means all the subdirectories of the /Photo directory would also not be spidered. It would not block Google from crawling the /photo directory, as these lines are case sensitive.

This would also block Google from accessing URLs containing /Photo, such as /Photography/.

How to use wildcards/regular expressions

“Officially”, the robots.txt standard doesn’t support regular expressions or wildcards, however, all major search engines do understand it. This means you can use lines like this to block groups of files:

Disallow: /*.php 
Disallow: /copyrighted-images/*.jpg

In the example above, * is expanded to whatever filename it matches. Note that the rest of the line is still case sensitive, so the second line above will not block a file called /copyrighted-images/example.JPG from being crawled.

Some search engines, like Google, allow for more complicated regular expressions, but be aware that some search engines might not understand this logic. The most useful feature this adds is the $, which indicates the end of a URL. In the following example you can see what this does:

Disallow: /*.php$

This means /index.php can’t be indexed, but /index.php?p=1 could be. Of course, this is only useful in very specific circumstances and also pretty dangerous: it’s easy to unblock things you didn’t actually want to unblock.

Non-standard robots.txt crawl directives

As well as the Disallow and User-agent directives there are a couple of other crawl directives you can use. These directives are not supported by all search engine crawlers so make sure you’re aware of their limitations.

The Allow directive

While not in the original “specification”, there was talk very early on of an allow directive. Most search engines seem to understand it, and it allows for simple, and very readable directives like this:

Disallow: /wp-admin/ 
Allow: /wp-admin/admin-ajax.php

The only other way of achieving the same result without an allow directive would have been to specifically disallow every single file in the wp-admin folder.

The host directive

Supported by Yandex (and not by Google, despite what some posts say), this directive lets you decide whether you want the search engine to show example.com or www.example.com. Simply specifying it like this does the trick:

host: example.com

But because only Yandex supports the host directive, we wouldn’t advise you to rely on it, especially as it doesn’t allow you to define a scheme (http or https) either. A better solution that works for all search engines would be to 301 redirect the hostnames that you don’t want in the index to the version that you do want. In our case, we redirect www.yoast.com to yoast.com.

The crawl-delay directive

Yahoo!, Bing and Yandex can sometimes be fairly crawl-hungry, but luckily they all respond to the crawl-delay directive, which slows them down. And while these search engines have slightly different ways of reading the directive, the end result is basically the same.

A line like the one below would instruct Yahoo! and Bing to wait 10 seconds after a crawl action, while Yandex would only access your site once in every 10 seconds. It’s a semantic difference, but still interesting to know. Here’s the example crawl-delay line:

crawl-delay: 10

Do take care when using the crawl-delay directive. By setting a crawl delay of 10 seconds you’re only allowing these search engines to access 8,640 pages a day. This might seem plenty for a small site, but on large sites it isn’t very many. On the other hand, if you get next to no traffic from these search engines, it’s a good way to save some bandwidth.

The sitemap directive for XML Sitemaps

Using the sitemap directive you can tell search engines – specifically Bing, Yandex and Google – where to find your XML sitemap. You can, of course, also submit your XML sitemaps to each search engine using their respective webmaster tools solutions, and we strongly recommend you do, because search engine webmaster tools programs will give you lots of valuable information about your site. If you don’t want to do that, adding a sitemap line to your robots.txt is a good quick alternative.

Validate your robots.txt

There are various tools out there that can help you validate your robots.txt, but when it comes to validating crawl directives, we always prefer to go to the source. Google has a robots.txt testing tool in its Google Search Console (under the ‘Old version’ menu) and we’d highly recommend using that:

robots.txt Tester

Be sure to test your changes thoroughly before you put them live! You wouldn’t be the first to accidentally use robots.txt to block your entire site, and to slip into search engine oblivion!

Read more: WordPress SEO: The definitive guide to higher rankings for WordPress sites »

The post The ultimate guide to robots.txt appeared first on Yoast.

hreflang: the ultimate guide

hreflang tags are a technical solution for sites that have similar content in multiple languages. The owner of a multilingual site wants search engines to send people to the content in their own language. Say a user is Dutch and the page that ranks is English, but there’s also a Dutch version. You would want Google to show the Dutch page in the search results for that Dutch user. This is the kind of problem hreflang was designed to solve.

In this (very long) article we’ll discuss:

hreflang tags are among the hardest specs I’ve ever seen come out of a search engine. Doing it right is tough and takes time. The aim of this guide is to prevent you from falling into common traps, so be sure to read it thoroughly if you’re embarking on an hreflang project.

Need help implementing hreflang as part of your international SEO project? Our Multilingual SEO training is designed to help you understand the process and put it into practice. You’ll have a killer international SEO strategy in no time. 

What are hreflang tags for?

hreflang tags are a method to mark up pages that are similar in meaning but aimed at different languages and/or regions. There are three common ways to implement hreflang:

  • Content with regional variations like en-us and en-gb.
  • Content in different languages like en, de and fr.
  • A combination of different languages and regional variations.

hreflang tags are fairly commonly used to target different markets that use the same language – for example, to differentiate between the US and the UK, or between Germany and Austria.

What’s the SEO benefit of hreflang?

So why are we even talking about hreflang? What is the SEO benefit? From an SEO point of view, there are two main reasons why you should implement it.

First of all, if you have a version of a page that you have optimized for the users’ language and location, you want them to land on that page. Having the right language and location dependent information improves their user experience and thus leads to fewer people bouncing back to the search results. Fewer people bouncing back to the search results leads to higher rankings.

The second reason is that hreflang prevents the problem of duplicate content. If you have the same content in English on different URLs aimed at the UK, the US, and Australia, the difference on these pages might be as small as a change in prices and currency. Without hreflang, Google might not understand what you’re trying to do and see it as duplicate content. With hreflang, you make it very clear to the search engine that it’s (almost) the same content, just optimized for different people.

What is hreflang?

hreflang is code, which you can show to search engines in three different ways – and there’s more on that below. By using this code, you specify all the different URLs on your site(s) that have the same content. These URLs can have the same content in a different language, or the same language but targeted at a different region.

What does hreflang achieve?

Who supports hreflang?

hreflang is supported by Google and Yandex. Bing doesn’t have an equivalent but does support language meta tags.

In a complete hreflang implementation, every URL specifies which other variations are available. When a user searches, Google goes through the following process:

  1. it determines that it wants to rank a URL;
  2. it checks whether that URL has hreflang annotations;
  3. it presents the searcher with the results with the most appropriate URL for that user.

The user’s current location and his language settings determine the most appropriate URL. A user can have multiple languages in his browser’s settings. For example, I have Dutch, English, and German in there. The order in which these languages appear in my settings determines the most appropriate language.

Should you use hreflang?

Tip: homepage first!

If you’re not sure on whether you want to implement hreflang on your entire site, start with your homepage! People searching for your brand will get the right page. This is a lot easier to implement and it will “catch” a large part of your traffic.

Now we’ve learned on what hreflang is and how it works, we can decide whether you should use it. You should use it if:

  • you have the same content in multiple languages;
  • you have content aimed at different geographic regions but in the same language.

It doesn’t matter whether the content you have resides on one domain or multiple domains. You can link variations within the same domain but can also link between domains.

Hreflang_mistakes_FI

Architectural implementation choices

One thing is very important when implementing hreflang: don’t be too specific! Let’s say you have three types of pages:

  • German
  • German, specifically aimed at Austria
  • German, specifically aimed at Switzerland

You could choose to implement them using three hreflang attributes like this:

  • de-de targeting German speakers in Germany
  • de-at targeting German speakers in Austria
  • de-ch targeting German speakers in Switzerland

However, which of these three results should Google show to someone searching in German in Belgium? The first page would probably be the best. To make sure that every user searching in German who does not match either de-at or de-ch gets that one, change that hreflang attribute to just de. In many cases, specifying just the language is a smart thing to do.

It’s good to know that when you create sets of links like this, the most specific one wins. The order in which the search engine sees the links doesn’t matter; it’ll always try to match from most specific to least specific.

Technical implementation – the basics

Regardless of which type of implementation you choose – and there’s more on that below – there are three basic rules.

1. Valid hreflang attributes

The hreflang attribute needs to contain a value that consists of the language, which can be combined with a region. The language attribute needs to be in ISO 639-1 format (a two-letter code).

Wrong region codes

Google can deal with some of the common mistakes with region codes, although you shouldn’t take any chances. For instance, it can deal with en-uk just as well as with the “correct” en-gb. However, en-eu does not work, as eu doesn’t define a country.

The region is optional and should be in ISO 3166-1 Alpha 2 format, more precisely, it should be an officially assigned element. Use this list from Wikipedia to verify you’re using the right region and language codes. This is where things often go wrong: using the wrong region code is a very common problem.

2. Return links

The second basic rule is about return links. Regardless of your type of implementation, each URL needs return links to every other URL, and these links should point at the canonical versions, more on that below. The more languages you have the more you might be tempted to limit those return links – but don’t. If you have 80 languages, you’ll have hreflang links for 80 URLs, and there’s no getting around it.

3. hreflang link to self

The third and final basic rule is about self-links. It may feel weird to do this, just as those return links might feel weird, but they are essential and your implementation will not work without them.

Technical implementation choices

There are three ways to implement hreflang:

  • using link elements in the <head>
  • using HTTP headers
  • or using an XML sitemap.

Each has its uses, so we’ll explain them and discuss which you should choose.

1. HTML hreflang link elements in your <head>

The first method to implement hreflang we’ll discuss is HTML hreflang link elements. You do this by adding code like this to the <head> section of every page:

<link rel="alternate" href="http://example.com/" 
  hreflang="en" />
<link rel="alternate" href="http://example.com/en-gb/" 
  hreflang="en-gb" />
<link rel="alternate" href="http://example.com/en-au/" 
  hreflang="en-au" />

As every variation needs to link to every other variation, these implementations can become quite big and slow your site down. If you have 20 languages, choosing HTML link elements would mean adding 20 link elements as shown above to every page. That’s 1.5KB on every page load, that no user will ever use, but will still have to download. On top of that, your CMS will have to do several database calls to generate all these links. This markup is purely meant for search engines. That’s why I would not recommend this for larger sites, as it adds far too much unnecessary overhead.

2. hreflang HTTP headers

The second method of implementing hreflang is through HTTP headers. HTTP headers are for all your PDFs and other non-HTML content you might want to optimize. Link elements work nicely for HTML documents, but not for other types of content as you can’t include them. That’s where HTTP headers come in. They should look like this:

Link: <http://es.example.com/document.pdf>; 
rel="alternate"; hreflang="es", 
<http://en.example.com/document.pdf>; 
rel="alternate"; hreflang="en", 
<http://de.example.com/document.pdf>; 
rel="alternate"; hreflang="de"

The problem with having a lot of HTTP headers is similar to the problem with link elements in your <head>: it adds a lot of overhead to every request.

3. An XML sitemap hreflang implementation

The third option to implement hreflang is using XML sitemap markup. It uses the xhtml:link attribute in XML sitemaps to add the annotation to every URL. It works very much in the same way as you would in a page’s <head> with <link> elements. If you thought link elements were verbose, the XML sitemap implementation is even worse. This is the markup needed for just one URL with two other languages:

<url>
  <loc>http://www.example.com/uk/</loc> 
  <xhtml:link rel="alternate" hreflang="en" 
 href="http://www.example.com/" /> 
  <xhtml:link rel="alternate" hreflang="en-au" 
 href="http://www.example.com/au/" /> 
  <xhtml:link rel="alternate" hreflang="en-gb" 
 href="http://www.example.com/uk/" />
</url>

You can see it has a self-referencing URL as the third URL, specifying the specific URL is meant for en-gb, and it specifies two other languages. Now, both other URLs would need to be in the sitemap too, which looks like this:

<url>
  <loc>http://www.example.com/</loc> 
  <xhtml:link rel="alternate" hreflang="en" 
 href="http://www.example.com/" /> 
  <xhtml:link rel="alternate" hreflang="en-au" 
 href="http://www.example.com/au/" /> 
  <xhtml:link rel="alternate" hreflang="en-gb" 
 href="http://www.example.com/uk/" />
</url>
<url>
  <loc>http://www.example.com/au/</loc> 
  <xhtml:link rel="alternate" hreflang="en" 
 href="http://www.example.com/" /> 
  <xhtml:link rel="alternate" hreflang="en-au" 
 href="http://www.example.com/au/" /> 
  <xhtml:link rel="alternate" hreflang="en-gb" 
 href="http://www.example.com/uk/" />
</url>

As you can see, basically we’re only changing the URLs within the <loc> element, as everything else should be the same. With this method, each URL has a self-referencing hreflang attribute, and return links to the appropriate other URLs.

XML sitemap markup like this is very verbose: you need a lot of output to do this for a lot of URLs. The benefit of an XML sitemap implementation is simple: your normal users won’t be bothered with this markup. You don’t end up adding extra page weight and it doesn’t require a lot of database calls on page load to generate this markup.

Another benefit of adding hreflang through the XML sitemap is that it’s usually a lot easier to change an XML sitemap than to change all the pages on a site. There’s no need to go through large approval processes and maybe you can even get direct access to the XML sitemap file.

Other technical aspects of an hreflang implementation

Once you’ve decided your implementation method, there are a couple of other technical considerations you should know about before you start implementing hreflang.

hreflang x-default

x-default is a special hreflang attribute value that specifies where a user should be sent if none of the languages you’ve specified in your other hreflang links match their browser settings. In a link element it looks like this:

<link rel="alternate" href="http://example.com/" 
  hreflang="x-default" />

When it was introduced, it was explained as being for “international landing pages”, ie pages where you redirect users based on their location. However, it can basically be described as the final “catch-all” of all the hreflang statements. It’s where users will be sent if their location and language don’t match anything else.

In the German example we mentioned above, a user searching in English still wouldn’t have a URL that fits them. That’s one of the cases where x-default comes into play. You’d add a fourth link to the markup, and end up with these 4:

  • de
  • de-at
  • de-ch
  • x-default

In this case, the x-default link would point to the same URL as the de one. We wouldn’t advise you to remove the de link though, even though technically that would create exactly the same result. In the long run, it’s usually better to have both as it specifies the language of the de page – and it makes the code easier to read.

hreflang and rel=canonical

rel="canonical"

If you don’t know what rel=”canonical” is, read this article!

rel=”alternate” hreflang=”x”markup and rel=”canonical” can and should be used together. Every language should have a rel=”canonical” link pointing to itself. In the first example, this would look like this, assuming that we’re on the example.com homepage:

<link rel="canonical" href="http://example.com/">
<link rel="alternate" href="http://example.com/" 
  hreflang="en" />
<link rel="alternate" href="http://example.com/en-gb/" 
  hreflang="en-gb" />
<link rel="alternate" href="http://example.com/en-au/" 
  hreflang="en-au" />

If we were on the en-gb page, only the canonical would change:

<link rel="canonical" href="http://example.com/en-gb/">
<link rel="alternate" href="http://example.com/" 
  hreflang="en" />
<link rel="alternate" href="http://example.com/en-gb/" 
  hreflang="en-gb" />
<link rel="alternate" href="http://example.com/en-au/" 
  hreflang="en-au" />

Don’t make the mistake of setting the canonical on the en-gb page to http://example.com/, as this breaks the implementation. It’s very important that the hreflang links point to the canonical version of each URL, because these systems should work hand in hand!

Useful tools when implementing hreflang

If you’ve come this far, you’ll probably be thinking “wow this is hard”! I know – I thought that when I first start to learn about it. Luckily, there are quite a few tools available if you dare to start implementing hreflang.

hreflang tag generator

The hreflang tags generator tool

Aleyda Solis, who has also written quite a lot about this topic, has created a very useful hreflang tag generator that helps you generate link elements. Even when you’re not using a link element implementation, this can be useful to create some example code.

hreflang XML sitemap generator

The Media Flow have created an hreflang XML sitemap generator. Just feed it a CSV with URLs per language and it creates an XML sitemap. This is a great first step when you decide to take the sitemap route.

The CSV file you feed this XML sitemap generator needs a column for each language. If you want to add an x-default URL to it as well, just create a column called x-default.

hreflang tag validator

hreflang tag validator

Once you’ve added markup to your pages, you’ll want to validate it. If you choose to go the link element in the <head> route, you’re in luck, as there are a few validator tools out there. The best one we could find is flang, by DejanSEO.

Unfortunately, we haven’t found a validator for XML sitemaps yet.

Making sure hreflang keeps working: process

Once you’ve created a working hreflang setup, you need to set up maintenance processes. It’s probably also a good idea to regularly audit your implementation to make sure it’s still set up correctly.

Make sure that people in your company who deal with content on your site know about hreflang so that they won’t do things that break your implementation. Two things are very important:

  1. When a page is deleted, check whether its counterparts are updated.
  2. When a page is redirected, change the hreflang URLs on its counterparts.

If you do that and audit regularly, you shouldn’t run into any issues.

Conclusion

Setting up hreflang is a cumbersome process. It’s a tough standard with a lot of specific things you should know and deal with. This guide will be updated as new things are introduced around this specification and best practices evolve, so check back when you’re working on your implementation again!

Read more: rel=canonical: what is it and how (not) to use it »

The post hreflang: the ultimate guide appeared first on Yoast.

Don’t block CSS and JS files

In 2015, Google Search Console already started to warn webmasters actively not to block CSS and JS files. In 2014, we told you the same thing: don’t block CSS and JS files. We feel the need to repeat this message now. We’re currently working on the websites of our first Yoast SEO Care customers, and this is obviously something we’ll look into for them as well. In this post, we’ll explain why you shouldn’t block these specific files from Googlebot.

Get the most out of Yoast SEO, learn every feature and best practice in our Yoast SEO for WordPress training! »

Yoast SEO for WordPress training$ 99 - Buy now » Info

Why you shouldn’t block CSS and JS files

You shouldn’t block CSS and JS files because that way, you’re preventing Google to check if your website works properly. If you block CSS and JS files in yourrobots.txt file, Google can’t render your website like intended. This, in return, makes that Google won’t understand your website to the fullest and might even result in lower rankings.

I think this aligns perfectly with the general assumption that Google has gotten more and more ‘human’. Google simply wants to see your website like a human visitor would, so it can distinguish the main elements from the ‘extras’. Google wants to know if JavaScript is enhancing the user experience or ruining it.

Test and fix

Google guides webmasters in this, for instance in the blocked resources check in Google Search Console:

Search Console - Blocked Resources example | Don't block CSS and JS files

Besides that, Google Search Console allows you to test any files against yourrobots.txt settings at Crawl > Robots.txt tester:

Search Console robots.txt tester | Don't block CSS and JS files

The tester will tell you what file is and isn’t allowed according to your robots.txt file. More on these crawl tools in Google Search Console here.

Unblocking these blocked resources basically comes down to changing your robots.txt file. You need to set that file up in such a way that it doesn’t disallow Google to access to your site’s CSS and JS files anymore. If you’re on WordPress and use Yoast SEO, this can be done directly in our Yoast SEO plugin.

WordPress and blocking CSS and JS files in robots.txt

To be honest, we don’t think you should block anything in your robots.txt file unless it’s for a very specific reason. That means you have to know what you’re doing. In WordPress, you can go without blocking anything in most cases. We frequently see /wp-admin/ disallowed in robots.txt files, but this will, in most cases, also prevent Google from reaching some files. There is no need to disallow that directory, as Joost explained in this post.

We’ll say it again

We’ve said it before and we’ll say it again: don’t block Googlebot from accessing your CSS and JS files. These files allow Google to decently render your website and get an idea of what it looks like. If they don’t know what it looks like, they won’t trust it, which won’t help your rankings.

Read more: ‘robots.txt: the ultimate guide’ »