Don’t be Joey Donner.
Out of the middle of a crowded high school hallway, Joey Donner appears before Bianca (who has been seriously crushing on him for about 15 movie minutes of 10 Things I HAte About You), pulls out two nearly identical photos and forces her into choosing which she prefers.
Joey: [holding up headshots] “Which one do you like better?”
Bianca: “Umm, I think I like the white shirt better.”
Joey: “Yeah, it’s-it’s more…”
Joey: “Damn, I was going for thoughtful.”
Like Bianca, search engines must make choices – black tee or white tee, to rank or not to rank (#ShakespearePuns!). And according to Introduction to Information Retrieval (c19) “by some estimates, as many as 40% of the pages on the Web are duplicates of other pages” – which accounts for a tremendous amount of wasted storage and overhead resources (for little return #AintNoBotGotTimeForThat)!
On the surface, the solution is simple: Don’t be Joey Donner, making search engine bots pick between two identical results. However, diving deeper into Joey’s psychological state — he doesn’t know he’s being redundant. He doesn’t realize that he’s presenting the same photo and a sticky situation to Bianca. He is simply unaware. Similarly, duplicate content can sprout out from a multitude of unexpected avenues and webmasters must be vigilant to ensure that it does not interfere with the user and bot’s experience. We must be mindful and purposeful about not being another Joey Donner.
What is duplicate content?
As outlined in Google Webmaster Guidelines, duplicate content is “substantive blocks of content within or across domains that either completely match other content or are appreciably similar.”
It’s important to understand that pages that appear virtually the same to a user should not affect their site experience; however, pages with highly similar content will affect a search engine bot’s evaluation of those pages.
Why should webmasters care about duplicate content?
Duplicate content causes a few issues, primarily around rankings, due to the confusing signals that are sent to the search engine. These issues include:
- Indexing Challenges: Search engines don’t know which version(s) to include/exclude from their indices.
- Lower Link Impact: As different internet denizens across the web link to different version of the same page, the link equity is spread between those multiple versions.
- Internal Competition: When content is very closely related, search engines struggle with which versions of the page to rank for query results (because they look so similar. How’s a bot to know?!).
- Poor Crawl Bandwidth: Forcing search engines into crawling pages that don’t add value eats away at site’s crawl bandwidth and can be a huge detriment for larger sites.
- Note: The non-canonical version can (and will also likely) be crawled, so internal linking to the end-state version of the URL is still of utmost importance!
What counts as duplicate content?
Duplicate content is often created unintentionally (most times we don’t aim to be Joey Donner).
Below is a list of common sources that duplicate content may unintentionally arise from. It is important to note that although all of these elements should be checked, they may not be causing issues (prioritizing top duplicate content challenges is vital).
Common sources of duplication:
- Repeated pages (example: sizing pages for every product with same specifications, paid search landing pages with repeated copy)
- Indexed staging sites
- URLs with different protocol (HTTP vs. HTTPS URLs)
- URLs on different subdomains (e.g., www v non-www)
- URLs with different case characters
- URLs with different file extensions
- Trailing slash or no slash (Google Webmaster Blog on / or no-/)
- Index pages (/index.html, /index.asp, etc.)
- URLs with parameters (i.e., click tracking, navigation filtering, session IDs, analytics code, etc.)
- Printer-friendly versions
- Doorway pages
- Poorly executed inventory control
- Syndicated content
- PR releases across domains
- Republishing content across domains
- Plagiarism across domains
- Sharing large snippets of content across domains
- Localized content (pages without appropriate Hreflang labels, especially in the same language)
- Thin content appearing as duplicate
- Includes: Template or boilerplate content
- Pages for images-only
- Indexable internal site-search results
- Note: Paginated series aren’t technically duplicate content. Google should index them with lower priority. Check out TechnicalSEO.com’s Pagination rel=”prev/next” Testing Tool.
- Separate Mobile URL Configuration
- It is of vital importance that the mobile experience mimics the desktop experience with the mobile-first index rolling out. Separate Mobile URL Configuration (e.g., m-dots) should be canonicalized to the desktop; however, should contain the same content. If mobile UX teams are worried about the UX, Google suggests leveraging expandable content on smaller screens (e.g., “Read More” buttons, hidden tabs, accordions, etc.).
Does duplicate content rank?
When dealing with cross-domain duplicate content, there is an “auction” as to the winner (making duplicate content within the SERPs, hypothetically, a winner-take-all situation). Gary Illyes, better known as the House Elf and Chief of Sunshine and Happiness at Google, mentioned that the auction occurs during indexation, before the content gets into the database, and it’s fairly permanent (so once you’ve won, you’re supposedly going to have an edge). This means that the first to publish content should theoretically be considered the winner.
This however, is not to say that duplicate content (whether on the same or across domains) will not rank. There are actually cases that exist, where Google determines that the original site is less suited to answer a result and a competing site is selected to rank.
Rankings depend on the nature of the query; the content available on the web to answer said query, your site’s semantic relevance towards a topic, and authority within the space (i.e., duplicate content is more likely to rank for more specific, related queries or queries that have a low amount of related content).
Should duplicate content rank?
Theoretically, no. If content doesn’t add value to the users within the SERPs, it shouldn’t rank.
Should I be worried if my site must have duplicated content?
Focus on what’s best for the user. A grounding question — Is this answering your user’s questions in a way that is meaningful to your site’s overall experience?
If a site must have duplicate content (whether for political, legal, or time constraint reasons) and it cannot be consolidated, signal to search engine bots how they should proceed with one of the following methods — canonical tags, noindex/nofollow meta robots tags, or block within the robots.txt.
It’s also important to note that duplicate content, in and of itself, does NOT merit a penalty (note: this does not include scaper sites, spam, spun content, or doorway pages); according to Google Webmaster John Mueller in the October 2015 Google Hangout.
How does one go about identifying duplicate content?
OnCrawl – I’d be remiss if I didn’t start with OnCrawl’s duplicate content visualization, because they’re the baddest is the biz (and by this I mean the best). One of my favorite aspects is how OnCrawl evaluates the duplicate content against canonicals. If the content is within a specific canonical cluster/group then the issues can typically be dismissed as resolved. Their reports also take this one step further and can show data segmented by subfolder. This can help to identify specific areas with duplicate content issues.
Plagiarism Tools – Thank your high school teachers and college professors for creating some of the most useful tools for evaluating duplicate content. While trying to identify haphazard students, they managed to create useful tools that applied to online duplicate content (providing percentages of similarity). A+!
Google Searches – Leverage quotes and search operators to find duplicate content potentially within your site and across the web. If Google can’t find it, then the issue can likely be dismissed.
- Direct quotes in Google
- Searching via site:searches
- site:domain.com inurl:www
- inurl:product id or category id
Keyword Density Tools – When comparing content across pages use density checker visuals to identify topicality of a page. If the topic of the page isn’t coming through within the densities, the writing should be reviewed for clarity.
Google Search Console – Google Search Console offers countless useful tools to support webmasters. Chief of the duplicate content tools is Google’s URL Parameter report, which is designed to help Google crawl one’s site more efficiently by signaling how to handle URL Parameters.
TechnicalSEO.com’s Mobile-First Index Prep Tool – If you have a separate mobile configuration this tool is a good place to start a mobile/desktop parity audit to identify any discrepancies.
Solving for duplicate content
Solutions for duplicate content are highly contingent upon the cause; however, here are some tips and tricks. Duplicate content resolution requires a beautiful balance between technical SEO and content strategy.
- Know your user journey.
Understanding where users are in the marketing funnel, what content they’re interacting with, and why they’re interacting with it ultimately can help shape your site’s overarching information architecture and the content, creating experiences with purpose. See sample content mapping outlines below (print and fill in!).
- Create a strong hierarchical URL taxonomy and information architecture that facilitates this. If you have a ton of similar topics, make sure you have a clear keyword alignment map.
At the core, you want to make sure that you aren’t cannibalizing your own traffic. There’s no need to fight against oneself.
- When identifying duplicate content, it is vital to prioritize duplicate content issues that are affecting performance (and integrate this into the overall organic search strategy).
- If the pages are 100 percent duplicate and one version doesn’t need to be live, pick one and consolidate with a 301 HTTP status redirect.
- Based on your user journey – Make sure all content on pages that should be indexed is indexable.
- As an illustrative example, my team once identified an issue where comments from Facebook (which were a highlight of this website’s product pages) were not being indexed. Resolving Facebook comments not being crawled and indexed would turn the pages from thin content into unique forums relating to the product.
- Based on your user journey – Leverage HTML tags, robots.txt, and appropriate HTTP status codes to indicate what search engines should do with a particular piece of content.
- <Link>Tag Attributes to Indicate Document Relationships:
- Canonical Tags: I’m the same as this version! (also may help marginally with crawl bandwidth)
- Rel=”prev/next”: I’m part of a paginated series.
- Blocking URLs:
- Meta robots=”noindex”: Don’t index me.
- Note: Don’t just noindex pages with link authority. Be strategic with the link authority.
- Meta robots=”nofollow”: Don’t follow my links.
- Robots.txt disallow: You’re not allowed to be here.
- 403 HTTP Status Code: No bots allowed.
- Server Password Protection: Give me a password to get in.
- Meta robots=”noindex”: Don’t index me.
- <Link>Tag Attributes to Indicate Document Relationships:
- Move forward strategically → Consolidate, create, delete, and optimize
- Consolidate – Consolidate duplicate content where applicable with 301 redirects and canonicalization (when both experiences must remain live).
- Optimize – Can you have a unique perspective? Can you target or align keywords better? How can you reframe this content to be unique and valuable?
- Create – Occasionally it makes sense to break out content and create a separate, universal experience.
- Delete – Pruning content can support crawl bandwidth, as search engines won’t be forced into crawling the same content repetitively that does not add additional value to your site.
- If your content is stolen, there are two primary venues to approach:
- 1. Request a canonical tag pointing back to your site.
- 2. File a DCMA request with Google.
- Sit back and enjoy the amazing rom-com that is 10 Things I Hate About You!
Content Mapping Strategy Templates
Illustrative User Journey
Print out and label a top user journey to ground yourself in the example. Label each type in content the user could interact with in their journey and estimated time duration of each step. Not that they may be additional steps and that the path may not always be linear. Add arrows and expand, the goal is to ground oneself in a basic example before diving into complex/involved mappings.
Simple Marketing Funnel Content Map
Print out and write out goals, types of content, common psychological traits, content location, and what it would take to push the user to the next stage in their journey. The idea is to identify when users are interacting with certain content, what’s going through their minds, and how to move them along in their journey.
Content Prioritization Matrix
Print out and map points with type of content available, mapping out by vital binary points. Once you’ve mapped out all of your content, step back and see if there are any areas missing. Leverage this matrix to prioritize the most important content, whether it’s by conversion potential or by need.
Retail Content Matrix:
Start with mapping informational to transactional intent on the y-axis. Non-brand and brand as more relevant.
Service Line Content Matrix:
Start with users that are more proactive versus reactive on the y-axis. Then transition to conversion potential. For services, this may look like “Seeks Expert” versus “DIY”.
Google’s Documentation on Duplicate Content
Introduction to Information Retrieval (chapter 19) – Stanford’s introduction into search engine theory’s book. This chapter covers theory behind how search engineers might resolve duplicate content issues, including concepts such as: a fingerprint and shingling. (pdf version, buy book on Amazon)
Duplicate Content Advice from Google – Hobo Web sifted through Google Hangout notes, Twitter comments, and Google documentation, painting a picture of Google’s position on duplicate content.