Crawl directives | Urban Legend web design & development

Your robots.txt file is a powerful tool when you’re working on a website’s SEO – but it should be handled with care. It allows you to deny search engines access to different files and folders, but often that’s not the best way to optimize your site. Here, we’ll explain how we think webmasters should use their robots.txt file, and propose a ‘best practice’ approach suitable for most websites.

You’ll find a robots.txt example that works for the vast majority of WordPress websites further down this page. If want to know more about how your robots.txt file works, you can read our ultimate guide to robots.txt.

What does “best practice” look like?

Search engines continually improve the way in which they crawl the web and index content. That means what used to be best practice a few years ago doesn’t work anymore, or, may even harm your site.

Today, best practice means relying on your robots.txt file as little as possible. In fact, it’s only really necessary to block URLs in your robots.txt file when you have complex technical challenges (e.g., a large eCommerce website with faceted navigation), or when there’s no other option.

Blocking URLs via robots.txt is a ‘brute force’ approach, and can cause more problems than it solves.

For most WordPress sites, the following example is best practice:

# This space intentionally left blank
# If you want to learn about why our robots.txt looks like this, read this post: https://yoa.st/robots-txt
User-agent: *

We even use this approach in our own robots.txt file.

What does this code do?

The User-agent: * instruction states that any following instructions apply to all crawlers.
Because we don’t provide any further instructions, we’re saying “all crawlers can freely crawl this site without restriction”.
We also provide some information for humans looking at the file (linking to this very page), so that they understand why the file is ’empty’.

If you have to disallow URLs

If you want to prevent search engines from crawling or indexing certain parts of your WordPress site, it’s almost always better to do so by adding meta robots tags or robots HTTP headers.

Our ultimate guide to meta robots tags explains how you can manage crawling and indexing ‘the right way’, and our Yoast SEO plugin provides the tools to help you implement those tags on your pages.

If your site has crawling or indexing challenges that can’t be fixed via meta robots tags or HTTP headers, or if you need to prevent crawler access for other reasons, you should read our ultimate guide to robots.txt.

Note that WordPress and Yoast SEO already automatically prevent indexing of some sensitive files and URLs, like your WordPress admin area (via an x-robots HTTP header).

Why is this ‘minimalism’ best practice?

Robots.txt creates dead ends

Before you can compete for visibility in the search results, search engines need to discover, crawl and index your pages. If you’ve blocked certain URLs via robots.txt, search engines can no longer crawl through those pages to discover others. That might mean that key pages don’t get discovered.

Robots.txt denies links their value

One of the basic rules of SEO is that links from other pages can influence your performance. If a URL is blocked, not only won’t search engines crawl it, but they also might not distribute any ‘link value’ pointing to that URL to, or through that URL to other pages on the site.

Google fully renders your site

People used to block access to CSS and JavaScript files in order to keep search engines focused on those all-important content pages.

Nowadays, Google fetches all of your styling and JavaScript and renders your pages completely. Understanding your page’s layout and presentation is a key part of how it evaluates quality. So Google doesn’t like it at all when you deny it access to your CSS or JavaScript files.

Previous best practice of blocking access to your wp-includes directory and your plugins directory via robots.txt is no longer valid, which is why we worked with WordPress to remove the default disallow rule for wp-includes in version 4.0.

Many WordPress themes also use asynchronous JavaScript requests – so-called AJAX – to add content to web pages. WordPress used to block Google from this by default, but we fixed this in WordPress 4.4.

You (usually) don’t need to link to your sitemap

The robots.txt standard supports adding a link to your XML sitemap(s) to the file. This helps search engines to discover the location and contents of your site.

We’ve always felt that this was redundant; you should already by adding your sitemap to your Google Search Console and Bing Webmaster Tools accounts in order to access analytics and performance data. If you’ve done that, then you don’t need the reference in your robots.txt file.

The post WordPress robots.txt: Best-practice example for SEO appeared first on Yoast.

We regularly consult for sites that monetize, in part, with affiliate links. We usually advise people to redirect affiliate links. In the past, we noticed that there wasn’t a proper script available online that could handle this for us, so we created one to tackle this problem. In this post, I explain how you can get your hands on it and how you can get it running on your website.

Why should I cloak my affiliate links?

A quick online search will result in tons of reasons as to why you should redirect your affiliate links. The “historical” reason for this is hiding from search engines that you’re an affiliate. It would be naive to think that search engines don’t understand what’s happening, but nevertheless this seems like a valid reason.

There are also a few more advantages to cloaking your affiliate links, such as:

Ease of management
Sometimes you might need to change your affiliate links. If said links are spread out across your blog, this could become a quite time-intensive task. By centralizing the affiliate links, you have one location to manage all of them.
Prevents leaking PageRank to advertisers
Affiliate links are ads and should be nofollowed or otherwise altered to prevent leaking PageRank to the advertiser. Instead of having to do this manually for every individual affiliate link, you can do this is a single location without much hassle. This also prevents the possibility of forgetting to add nofollow to one of the links.
“Clean” links
Different affiliate programs tend to use different permalink structures. Some might have relatively ‘clean’ links, whereas others tend to add a lot of gibberish. Using the redirect script can help you deal with this issue because the cloaked URL will always follow the same structure. This makes it a lot clearer for the user where the link is taking them to!

Get the most out of Yoast SEO, learn every feature and best practice in our Yoast SEO for WordPress training! »

$ 99 - Buy now » Info

Cloaking affiliate links, the how to

The basic process of cloaking affiliate links is simple:

Create a folder from where you’ll serve your redirects. At Yoast we use /out/.
Block the /out/ folder in your robots.txt file by adding:
```
Disallow: /out/
```
Use a script in your redirect folder to redirect to your affiliate URLs.

Step 2 ensures search engines won’t follow the redirects, but we’ll add some extra security measures in our script to prevent accidental indexation of our affiliate links. Step 3 is as easy as manually adding each redirect to your redirect directory’s .htaccess file, assuming you’re running your website on an Apache-based server. Alternatively, you can use the script we produced to make it easier on yourself. The added bonus of this script is that it also works for servers running Nginx!

Affiliate link redirect script

The script we created consists of three files, one of which is optional: an index.php file, a redirects.txt file and, to finish it all off, a .htaccess file to prettify your URLs.

Index.php

This file contains the logic that handles the actual redirection by performing a 302 redirect. Additionally, it sends a X-Robots-Tag header along to ensure search engines that can detect this header, obey the noindex, nofollow rules we pass along in it. We do this as an extra security measure in case you might forget to exclude the affiliate link in your robots.txt.

Redirects.txt

The redirects.txt file is a comma-separated file that contains a list of names and destination URLs like so:

yoast,https://yoast.com

Note that the file should always contain the following line at the very top to ensure people don’t attempt to redirect themselves to a non-existing URL:

default,http://example.com

Just change example.com to your own domain and you’re ready to go!

.htaccess

If you only install the above two files, you’ll already have enough in place to get things running. However, we advise you prettify the URLs, because this dramatically increases the readability. Without prettifying your URLs, you’ll end up with something like /out/?id=yoast instead of /out/yoast.

Prettifying can be achieved by adding a .htaccess file to the mix. This small file also helps ensure people can’t access your redirects.txt file to take a peek and see what affiliate links are available.

What about plugins?

In the past we’ve received questions about using WordPress plugins to tackle this cloaking issue. Despite there being a lot of valid options, they have one small caveat: speed. Because these plugins depend on WordPress’ core code, they need to wait for it to be fully booted before being able to execute themselves. This can easily add a second or two to the total loading and redirecting time if you’re on a slow server.
Our non-plugin solution is faster because it doesn’t depend on WordPress to run.

Ultimately, the best option depends on your needs. If you want to collect statistics on your affiliate links, you might be better off with a plugin. Otherwise, just use our script to keep things fast.

The files

If you’re interested in running this nifty script on your own website, head on over to GitHub. Feeling adventurous? You can find the source code here. People running Nginx can find sample code in this gist to see how to make it work for them.

Tag: Crawl directives

WordPress robots.txt: Best-practice example for SEO

What does “best practice” look like?

What does this code do?

If you have to disallow URLs

Why is this ‘minimalism’ best practice?

Robots.txt creates dead ends

Robots.txt denies links their value

Google fully renders your site

You (usually) don’t need to link to your sitemap

How to cloak your affiliate links

Why should I cloak my affiliate links?

Cloaking affiliate links, the how to

Affiliate link redirect script

Index.php

Redirects.txt

.htaccess

What about plugins?

The files

Search

Contact Us

Location

Phone

Email

Pages