Duplicate Content on Google: The Complete Guide
Duplicate content is one of the most widespread problems of SEO and, curiously, one of the least attended. We often talk about how to optimize a page for search engines, how to get links, but hardly mention the annoying content duplicates.
The point is, if your site is flooded with repeated pages, it is difficult for search engines to give it the favour it deserves. Hence it is crucial to minimize the problem. In this post I explain ALL what you need to know about duplicate content:
What it is, where to find it and how to get rid of it.
- What is duplicate content
- Consequences of Duplicate Content
- Does Google penalize duplicate content?
- Causes of duplicate content
- Causes when off site
- How to Detect Duplicate Content
- Detect it when off site
- How to get rid of duplicate content
- Free when off-site
- General Tips for Preventing Duplicate Content
What is duplicate content
Duplicate content is any text repeated in more than one URL, either internal or external. This is what happens when your site generates multiple copies of the same page, or when a spammer copies one of your articles.
It may seem that duplicate content is not of the least importance, but the truth is that they pose a serious problem. Any Google user expects to find miscellaneous results, not the same content over and over again. For that reason filters the copies of the same content so that they do not appear.
Of course, the filtering process that the search engine does is transparent. You do not have to know that your pages have been marked as duplicates, and you may even think that all your pages are unique and attract visitors when the reality is very different. But if your site generates duplicate content, you’re losing the ability to appear in search results.
Consequences of Duplicate Content
Now that you know why it is so important to avoid duplication of content, you should know the problems that can cause your site. Some of the most important are:
Incorrect pages – Having different pages for the same content means leaving to the search engine the choice of the correct page. This is not a good idea since you can choose a version other than the one you want.
Worse visibility – As a consequence of the above, the search engine may end up showing a lower weight copy, and therefore position it worse than the good version would be.
Poor Indexing – Indexing of your pages may be affected because the searcher spends his time tracking duplicate pages. If duplicate content is a significant portion of the site, the search engine will visit the important pages less frequently.
Waste of links – Duplicate pages can receive links and dilute the strength of your content, since all those links could (and should) be joining forces on a single page.
Misattribution – The searcher can decide that your content originates from another domain and exclude your pages in their results. It’s high, but it happens.
Does Google penalize duplicate content?
Google rejects duplicate content but does not penalize it. What it does is filter it so that it does not appear in its results, which is enough punishment.However, sites that copy and/or rewrite the content of others in a systematic way are penalized. The famous Panda algorithm was designed for that mission.
Causes of duplicate content
When people think of duplicate content, the first thing that comes to mind is the image of a spammer landing on your site, copying a few articles and pasting them into another domain. It rarely happens like this. The biggest source of duplicate content is usually the web itself, no matter how optimized it is.As you will now see, there are many reasons why you might be inundated with copies without knowing it:
Non-canonical domain – Your site can operate with the (sub) domain that begins with “www.” And with the domain that does not carry this prefix. The good version is canonical and not setting it correctly makes your site repeated in both variants.
Safe pages – Similar to the canonical domain, if your web uses SSL encryption, you can end up with an exact copy in the secure version (the one that starts with HTTPS).
Session IDs – Many sites handle user sessions by entering a code at the end of the URL of each page. These parameters, different for each user session, can make the searcher believe that they are separate pages, although in reality they are the same.
Dynamic content – Sites that assign parameters to URLs to control the content that is displayed to the user. As with session IDs, search engines can interpret these pages as copies.
Files – A typical problem of blogs is that the same content appears on different pages, as in categories and tag files, for example.
Pagination – Any web that uses pagination may have this problem, especially if the pages of the series share the same title and description.
Mobile version – If the smartphone version of your pages is generated in a separate URL (eg a subdomain or subdirectory) and is not configured correctly, the browser may have difficulty knowing that it is a parallel version.
Causes when off site In this case, the most common reasons are:
Syndication – Consists of sending your content to other sites to attract traffic, such as RSS. The problem may arise when they publish a complete copy of your content instead of a snippet.
Location – To target several countries, you may have used the same content (or almost) in several domains at once, such as .es and .mx.
CDN – Using a content delivery network or CDN involves replicating part of the web content (mainly static files) to the servers in the network. This can be a problem if appropriate action is not taken.
Scraping – Scapers use robot software to copy your pages and publish them to another domain.
Plagiarism – Punctually, someone can copy a text from your site and publish it in yours. Sometimes it happens intentionally but others do not (the whole mentality is free on the Internet and can be used freely).
How to Detect Duplicate Content
Google identifies duplicate content when it encounters titles, descriptions, headings, and sections that are identical or very similar. So to check the duplicate content you must start there.
Here are the most effective methods:
Google Webmaster Tools – If you’ve signed up for Google webmaster tools, this is definitely the best starting point. Go to Search Appearance> HTML Enhancements and pay attention to duplicate title tags and metadata. The report tells you the number of replicas that exist and on which pages they have been found so you can correct them.
Site command – It is a very effective method but it requires a lot of work. It consists of searching inside your site for certain words or key phrases, such as products in the case of an online store (site: example.com “this is a product of the store”). In the result you can see if there are pages with titles and duplicate descriptions. This method also lets you know if certain pages have been moved to the secondary index when you see a message on the last results page (“repeat search and include results that have been omitted”). This is a symptom of duplicate content.
Screaming Frog – This powerful tool allows you to track your site for duplicate contents, among other things. The tabs that interest you are URI, Page Titles, Meta Description and H1, using the Duplicate filter.
Google Analytics – You can also find duplicate pages in Google Analytics using the Behavior> Site Content> Landing Pages report. The key is to look for suspicious URLs and pages that receive less organic traffic than they should.
Web Analysis Tools – There are tools capable of identifying duplicate content, as well as broken links, non-indexed pages and other problems that are difficult to detect with the naked eye, such as Siteliner, SEMrush, Advanced Web Ranking or Moz.
Detect it when off-site
If the copies are outside your web, you can check the duplicate content using the following tools:
CopyScape, both in its free and premium version
How to get rid of duplicate content
It is clear that search engines do not like duplicate content because it leads to a poor user experience. So if you have duplicate content on your site, you should do everything possible to eliminate it.
Here are the main options for dealing with the problem:
Uses Rel Canonical – The label “rel = canonical” was devised precisely to address this inconvenience, so it is the best solution. It consists of a line of code inside the <head> section of the page that tells the search engine which version is the good one (canonical). It can also be included in the HTTP header of the page.
Create redirects 301 – Recommended when you can not use the canonical tag, when you move content from one page to another or when you set the canonical domain. Redirects 301 are commands included in the .htaccess file in Apache.
Deny access to robots – To prevent search engines from finding duplicate pages, you can use the meta robots tag or the robots.txt file.
Manage URL parameters – In case duplicate content is triggered by parameters, you can tell Google which ones to ignore using the Trace> URL Parameters tool in Webmaster Tools.
Uses Schema.org – The searcher can use the structured data to solve the confusion between duplicate pages.
Unify pages or rewrite content – Both the one and the other are sensible solutions when several pages of your site show very similar or equal contents.
Free when off-site
If your content has been copied by a third party, you have the following options:
Request it deleted – Send a message by email or contact form to politely request to delete it. If there is no way to communicate, use the WHOIS email, which can be obtained through the Whois.net tool. Also ask them, if they are not willing to get rid of the content, they at least put a link to the page that they have copied you. This will help the searcher identify the original source.
Request removal to Google – If the communication does not work, you can ask the search engine to delete the offending page from your results. To do so, it submits a request based on the US Digital Copyright Act (DMCA).
Help improve detection of plagiarism – Apart from the above, you can send your case to Google for use as an example in improving their algorithms. Now, keep in mind that it will not be used as a spam or copyright report.
General Tips for Preventing Duplicate Content
- Never use the same title / description on more than one page
- The text of each page must be unique for the whole site and for the whole web
- Includes only the canonical version of the page in your Sitemaps
- When you copy an appointment from another site, always include a link to the original
- When you copy a whole page, ask permission first; includes a link to the source, and denies the search engine access to the page using robots (you can also use canonical relay between domains)