As impressive as machine learning and algorithm-based intelligence can be, they often lack something that comes naturally to humans: common sense.
It’s common knowledge that putting the same content on multiple pages produces duplicate content. But what if you create pages about similar things, with differences that matter? Algorithms flag them as duplicates, though humans have no problem telling pages like these apart:
How does this happen? How can you spot issues? What can do you about it?
The danger of duplicate content
Duplicate content interferes with your ability to make your site visible to search users through:
How machines identify duplicate content
Google uses algorithms to determine whether two pages or parts of pages are duplicate content, which Google defines as content that is “appreciably similar“.
Google’s similarity detection is based on their patented Simhash algorithm, which analyzes blocks of content on a web page. It then calculates a unique identifier for each block, and composes a hash, or “fingerprint”, for each page.
Because the number of webpages is colossal, scalability is key. Currently, Simhash is the only feasible method for finding duplicate content at scale.
Simhash fingerprints are:
This last means that the difference between any two fingerprints can be measured algorithmically and expressed as a percentage. To reduce the cost of evaluating every single pair of pages, Google employs techniques such as:
Finally, Google uses a weighted similarity rate that excludes certain blocks of identical content (boilerplate: header, navigation, sidebars, footer; disclaimers…). It takes into account the subject of the page using n-gram analysis to determine which words on the page occur most frequently, and – in the context of the site – are most important.
Analyzing duplicate content with Simhash
We’ll be looking at a map of content clusters flagged as similar using Simhash. This chart from OnCrawl overlays an analysis of your duplicate content strategy on clusters of duplicate content.
OnCrawl’s content analysis also includes similarity ratios, content clusters, and n-gram analysis. OnCrawl is also working on an experimental heatmap indicating similarity per content block that can be overlaid on a webpage.
Validating clusters with canonicals
Using canonical URLs to indicate the main page in a group of similar pages is a way of intentionally clustering pages. Ideally, the clusters created by canonicals and those established by Simhash should be identical.
When this isn’t the case, it’s often because there is no canonical policy in place on your website:
Or because there are conflicts between your canonical strategy and the methods Google uses to group similar content:
Your site’s clusters don’t look like the ones above. You’ve already followed best practices for duplicate content. URLs that contain the same content — such as printable/mobile versions, or alternate URLs generated by a CMS — declare the correct canonical URL.
Filter out the duplicate content that is correctly handled by your canonical strategy. The remaining non-canonicalized URLs are pages you want to rank.
URLs that still appear in clusters based on Simhash and semantic analysis are URLs you and Google disagree on.
Solving duplicate content problems for unique content
There’s no satisfying trick to correct a machine’s view of unique pages that appear to be duplicates: we can’t change how Google identifies duplicate content. However, there are still solutions to align your perception of unique content and Google’s… while still ranking for the keywords you need.
Here are five strategies to adapt to your site.
Resolve edge cases
Start by looking at the edge cases: clusters with very low or very high similarity rates.
Reduce the number of facets
If your duplicate pages are related to facets, you may have an indexing issue. Maintain the facets that already rank, and limit the number of facets you allow Google to index.
Make pages (more) unique
Remember: minor differences in content create minor differences in Simhash fingerprints. You need to make significant changes to the content on the page rather than small adjustments.
Enrich page content:
Create ranking reference pages
If enriching your pages isn’t possible or appropriate, consider creating a single reference page that ranks in place of all the “duplicate” pages. This strategy uses the same principle as content hubs to promote a main page for multiple keywords. It’s particularly useful when you have multiple versions of a product that you need to maintain as separate pages.
This strategy can be used to create pages targeting a need or a seasonal opportunity. It can improve families of pages by providing stronger semantics and rankings.
It can also benefit classifieds websites, job offer sites, and other sites with many, often-similar listings. Reference pages should group listings by a single characteristic; location (city) is often used successfully.
What to do:
Strengthened by links from the “duplicate” pages, canonical declarations, and combined content, reference pages are easy to rank.
Combine your pages
You keep trying to enrich pages with the same content? You can’t explain why you want to keep them all? It may be time to combine them.
If you decide to combine your pages into one:
The future of duplicate content
Google’s ability to understand the content of a page is constantly evolving. With the increasingly precise ability to identify boilerplate and to differentiate between intent on web pages, unique content identified as a duplicate should eventually become a thing of the past.
Until then, understanding why your content looks like duplicates to Google, and adapting it to convince Google otherwise, are the keys to successful SEO for similar pages.
About The Author
OnCrawl is an award-winning technical SEO platform that helps you make smarter SEO decisions. OnCrawl combines your content, log files and search data at scale so that you can open Google\’s blackbox and build a SEO strategy with confidence. Backed by a SEO crawler, a log analyzer and third-party integrations, OnCrawl currently works with over 800 clients in 66 countries including e-commerce websites, online publishers and agencies. OnCrawl produces actionable dashboards and reports to support your entire search engine optimization process and helps you improve your positions, traffic and revenues. Learn more about us at
This content was originally published here.