What is duplicate content?
According to Google’s own webmaster central blog:
Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Most of the time when we see this, it’s unintentional or at least not malicious in origin: forums that generate both regular and stripped-down mobile-targeted pages, store items shown (and — worse yet — linked) via multiple distinct URLs, and so on. In some cases, content is duplicated across domains in an attempt to manipulate search engine rankings or garner more traffic via popular or long-tail queries.
Company “ACME” wants to have several entry points for the same website. So they register the following domains: acme.com, acme.be, acme.org, acme.eu, … etc. Now they point all those domains to the same website (content). After a bit the page gets crawled by the Search Engines. Guess what happens?
The search engine crawls acme.be and finds an article about product “acme generic”. It indexes this and continues. After a while it crawls acme.com and finds (the same) article about “acme generic”. Then it flags the article at acme.com as a duplicate penalizing it in the search results.
Why do the Search Engines do this?
To have decent search results… A lot of (blackhat?!?) webmasters will use RSS feeds of other sites to publish their own content. If someone browses the net, and wants to find information about “acme generic”, they should only get the unique & relevant search results.
What should I do then?
Tips & tricks from google:
- Block appropriately: Rather than letting our algorithms determine the “best” version of a document, you may wish to help guide us to your preferred version. For instance, if you don’t want us to index the printer versions of your site’s articles, disallow those directories or make use of regular expressions in your robots.txt file.
- Use 301s: If you have restructured your site, use 301 redirects (“RedirectPermanent”) in your .htaccess file to smartly redirect users, the Googlebot, and other spiders.
- Be consistent: Endeavor to keep your internal linking consistent; don’t link to /page/ and /page and /page/index.htm.
- Use TLDs: To help us serve the most appropriate version of a document, use top level domains whenever possible to handle country-specific content. We’re more likely to know that .de indicates Germany-focused content, for instance, than /de or de.example.com.
- Syndicate carefully: If you syndicate your content on other sites, make sure they include a link back to the original article on each syndicated article. Even with that, note that we’ll always show the (unblocked) version we think is most appropriate for users in each given search, which may or may not be the version you’d prefer.
Use the preferred domain feature of webmaster tools: If other sites link to yours using both the www and non-www version of your URLs, you can let us know which way you prefer your site to be indexed.
- Minimize boilerplate repetition: For instance, instead of including lengthy copyright text on the bottom of every page, include a very brief summary and then link to a page with more details.
- Avoid publishing stubs: Users don’t like seeing “empty” pages, so avoid placeholders where possible. This means not publishing (or at least blocking) pages with zero reviews, no real estate listings, etc., so users (and bots) aren’t subjected to a zillion instances of “Below you’ll find a superb list of all the great rental opportunities in [insert cityname]…” with no actual listings.
- Understand your CMS: Make sure you’re familiar with how content is displayed on your Web site, particularly if it includes a blog, a forum, or related system that often shows the same content in multiple formats.
- Don’t worry be happy: Don’t fret too much about sites that scrape (misappropriate and republish) your content. Though annoying, it’s highly unlikely that such sites can negatively impact your site’s presence in Google. If you do spot a case that’s particularly frustrating, you are welcome to file a DMCA request to claim ownership of the content and have us deal with the rogue site.
The playbook for the above scenario
- Setup a 301 redirect from the aliasses: acme.be, acme.org & acme.eu point to acme.com
- Provide subdomains for the local branch offices: be.acme.com, emea.acme.com, etc
- Configure your website that it publishes a sitemap.
I hope this explained some things about duplicate content…
Deftly dealing with duplicate content