Yoast internal linking: the making of

A few weeks ago, we added Yoast internal linking to Yoast SEO Premium for English. We released the same feature for German earlier this week. In this post, I’ll explain how the earlier released Insights laid the groundwork for this feature, how we compose the list of linking suggestions, and why Yoast internal linking is currently only available for a limited set of languages.

So what does the internal linking tool do? While working on your post, our internal linking tool will give you suggestions on which posts you could consider linking to because they are about related topics. Linking to these posts will help you create a better site structure.

Insights

To know which posts we should show in the Yoast internal linking meta box, we first need to find out what all your posts are about. For this, we use the data we’ve already gathered for the Insights box, that you’ll find beneath the content analysis:

insights in yoast seo premium

But how do we get to this list of five words and word combinations? Let’s take a look at the steps we take when we analyze a post for its most prominent words.

Optimize your site for search & social media and keep it optimized with Yoast SEO Premium »

Yoast SEO for WordPress pluginBuy now » Info

Step 1: Getting all relevant single words

First, we want to know which relevant 100 single words are most frequently used in the post. We therefore start by making a list with all words from the text. Next, we remove words like ‘the’, ‘you’ and ‘to’ from this list. Articles, pronouns, prepositions and other function words are simply too widely used to be truly relevant to a text. If we wouldn’t filter out words like these, all posts would end up with roughly the same prominent words. Once we’ve removed all function words, we save the 100 most frequent single words and move on to the word combinations.

Step 2: Getting all relevant word combinations

Combinations of two or more words are often more relevant and information-rich than single words, because they are more specific. That is why we also look for the most relevant two to five-word combinations. We filter these combinations as well, because combinations like ‘headlines to be’ and ‘to rank and your’ are useless. We only want to keep meaningful combinations like ‘optimize your site structure’ and ‘writing clickbait titles’.

Step 3: Filtering on word density

Once we’ve retrieved and filtered all one to five-word combinations, we filter out everything with a word density of over 0.03. This means we remove all combinations from the list that comprise over 3% of the entire text. The rationale behind this is that words that are too frequent are seldom genuinely relevant, because they tend to be non-specific. This also serves as an extra safety net to catch all function words that we might have forgotten to remove during the previous steps.

Step 4: Calculating relevance scores

The final step is calculating which words and word combinations are most relevant to the post. Based on trial and error, we came up with a formula that uses the frequency, length and percentage of relevant words of the word combinations that does just this.

Length bonus

We start with determining the length bonus. As shown in the table below, the longer a combination is, the higher is the length bonus it receives. This means longer, more specific word combinations will eventually get a higher relevance score than shorter, less specific combinations.

Word combination length Length bonus
Single word 0
Two-word combination 3
Three-word combination 7
Four-word combination 12
Five-word combination 15

Relevant word proportion

We also calculate which proportion of each word combination is on the list of the 100 most frequent words. This is the list we drew up during Step 1. For example, if one word of a four-word combination is also in the top 100 frequent words, the calculated proportion would be 0.25. The idea behind this is that the more relevant words a combination contains, the more relevant the combination probably is.

Multiplier

Next, we calculate the so-called multiplier using the following formula: 1 + relevant word proportion * length bonus. For a four-word combination with a relevant word proportion of 0.25, this would result in a multiplier of  1 + 0.25 * 12 = 4.

Relevance score

Finally, we calculate the actual relevance score by multiplying the number of occurrences of each word combination by its multiplier. If the four-word combination of the above example would have a frequency of 3, its relevance score would be 3 * 4 = 12. Once we’ve calculated all relevance scores, we sort the words and word combinations from the highest to the lowest relevance. To keep the Insights box clear of clutter, we only show the top 5. However, we save a maximum of 100 words and word combinations for further use. 

Optimize your site for search & social media and keep it optimized with Yoast SEO Premium »

Yoast SEO for WordPress pluginBuy now » Info

Yoast internal linking

Once we have collected the most prominent words for all your posts, it’s time to compare them. To do this we take the top 20 prominent words of each post. However, for the sake of simplicity, I will illustrate the process with only five prominent words per blog.

Imagine you’re writing a post about Twitter Analytics. You’ve also written posts about Twitter Cards, homepage SEO and Instagram Analytics. You can find the top 5 prominent words from these blogs in the table below.

Twitter Analytics Twitter cards homepage SEO Instagram Analytics
Twitter Analytics Twitter cards homepage SEO Instagram Analytics
Twitter Twitter business name or brand Instagram
analytics Twitter account homepage followers
Twitter analytics dashboard account optimize your homepage analytics
Twitter cards data site name engagement rate

The more overlapping prominent words a post has with the current post, the higher its position will be in the list. Because the post about Instagram Analytics shares the prominent word ‘analytics’ with your post about Twitter Analytics, that post will show up in the linking suggestions. However, the blogs about Twitter Analytics and Twitter Cards have two overlapping prominent words: ‘Twitter Cards’ and ‘Twitter’. As a result, the post about Twitter Cards will end up higher in the list. Lastly, the post about homepage SEO doesn’t have any prominent words in common with the post about Twitter Analytics. For that reason we won’t suggest it to you.

We’ve decided to limit the number of suggested posts to twenty, because we don’t want to overwhelm you. Only the twenty posts that share the most prominent words with your post will be shown in the meta box. Check out what the result looks like in this video!

Language support

Now that we’ve built the above framework, we stand before the time-consuming task of making the linking suggestions available for languages other than English and German. Not only do we have to compose lists of function words for each individual language, but we also need to adjust the filtering for each of them. This has to do with word order differences. In English, for example, one describes an action with a verb followed by an object: eating cookies. However, in German, the object comes before the verb: Kekse essen (literally: cookies eat). As a result, we want to filter out English word combinations ending with a verb (he eats), but German combinations beginning with a verb (isst Kekse, literally: eats cookies).

The future of link suggestions

We’re happy to announce that we’ve released internal linking for German. But, maybe more importantly, we’d also like to let you know that you can help to make Yoast internal linking available for your own language! Please contact us if you’d like to help.



Read more: ‘Why you should use Yoast internal linking’ »