Content Analysis with the WordPress SEO plugin

We’ve been rather busy with the WordPress SEO plugin the last few days. We did a release yesterday and a quick follow up today to fix a few collisions with other plugins. Loads of cool small fixes in there, but one in particular that I think is worth highlighting as it’s something other plugin developers might want to pick up on: a small but important change to our content analysis functionality.

content analysisFor quite a while now, the WordPress SEO plugin has had a page analysis function baked in. The name is misleading, which is why I’ll be changing it soon, as it’s actually not page analysis, but content analysis. If you give it a focus keyword to test for, it analyses the content of your post and gives you hints and tips on how to improve it.

Every once in a while, we’ll get bug reports, on GitHub or through email, telling us that we’re wrong, and that we should analyse the entire page when doing the content analysis. I disagree, which is why we’re not doing it. Let me tell you why I disagree first.

Web Page Segmentation and Content Analysis

Search engines have been able to analyse the content of pages on a block level for quite a while now. Going into the specifics would take too much time here, but if you’re interested, read this post by Bill Slawski from 2009 or even this one, about a Google patent from 2006. Basically, search engines are able to tell what the content bit of a page is, what the sidebar is, what the footer is, etc. Using that segmentation, they judge your page by judging just the content section of it.

Building block level recognition like that into my content analysis function would be…. Undoable. Especially because we know what the content is, so we can just take that and ignore all the other bits. Oh and I’m not even half way smart enough to do the kind of segmentation search engines do and keep your WordPress site running smoothly.

So the content analysis just fetches the posts or pages content and runs it analysis on that. It’s clean, it’s simple and it’s rather fast.

The Issue with Focussing on Post Content Analysis

There’s one issue with this approach. The issue is that WordPress is being used more and more as a CMS. People are adding different blocks of content to pages in more and more ways. Plugins like Pods and Advanced Custom Fields are allowing people to be more flexible with their content blocks. We had to come up with something for that.

Another issue was that we didn’t parse shortcodes when doing the content analysis, causing us not to recognise galleries correctly, the native gallery or galleries added with for instance Next Gen Gallery. This meant we didn’t properly recognise all the images in a post and thus couldn’t output them in XML sitemaps and OpenGraph tags.

Now you might remember from installing the plugin, if you’re a user, that we ask permission to anonymously track data about your site, we collect that data specifically for these kinds of problems. Through this tracking database, which currently tracks about 650,000 sites, we looked at how big this particular issue was. We know that of users who run our WordPress SEO plugin, about half of the sites we track, 10% also run Next Gen Gallery. Pods and Advanced Custom Fields aren’t as popular, but they are both growing, rapidly. So it’s a serious and growing problem. Time to fix it.

The solution

Yesterday, in 1.4.14, we had a first patch that tried to parse shortcodes to discover images for use in our OpenGraph tags. The results were painful. Apparently, loads of plugin developers don’t really understand how a shortcode should work according to its API, so it broke, on loads of sites, horribly. Several plugins suddenly failed, simply because we were doing a do_shortcode outside of the main body and the shortcodes were echoing instead of returning their content or doing rather ugly things to the post_content attribute of the post global. I have to say: that shouldn’t happen. But it did.

So we released 1.4.15 today, which reverted that code. And now we’re left with only one option: providing plugin developers out there with a simple filter. This filter is called wpseo_pre_analysis_post_content and takes 1 argument: a string containing the post’s content. It’s used in several spots within the WordPress SEO plugin, with more to come, and it allows plugin developers to add their custom fields content to the content the plugin analyses by just adding on to that string.

It’s a simple enough change for us to make, but it opens up a world of possibilities. I hope people will use it and I’d love for you to tell us in the comments if you do!

This post first appeared on Yoast. Whoopity Doo!