The WPKube Guide to Content Scraping in WordPress

Content scraping is essentially the act of copying the content from one site and publishing it on another. If you are publishing content online then there is a good chance that you have been a victim of content scraping at some point.

Content scraping is usually carried out in one of two ways. One popular method is to use a content scraping bot that has been created to search the internet looking for relevant content, and then scraping it or copying it, before publishing it on another website. Another approach is to manually search for content, copy it and then publish it elsewhere.

However, for the victim of content scraping the end result is the same and their content ends up published elsewhere without permission and usually unaccredited to the original author.

As Google and other search engines reportedly don’t like to list the same piece of content in their database more than once, if your content gets scraped, then you run the risk of not being listed in the search engine results pages, despite the content rightfully belong to you. Not only does someone take the credit for your hard work, they also end up standing a good chance of taking the readers and visitors that would’ve made their way to your site via a Google search.

Why Do People Scrape Content

At the most superficial level, the main reason for carrying out content scraping is to add content to a site with minimal effort. By using an automated content scraping service, unscrupulous webmasters can quickly build out a site with thousands of pages in a very short space of time and with very little effort involved.

One of the reasons why they might do this is to effortlessly create a site that gets lots of traffic via the search engines. As in most cases traffic equals money, there is a good incentive to attempt this. The traffic to the site can then be used to build an email mailing list which can then be used to promote products, display pay per click ads from networks like Google AdSense or advertise products using an affiliate program such as Amazon Associates.

Another reason why people might scrape content is to claim credit for other people’s work, in an act which is also known as plagiarism. While the above reasons related to making money online from content scraping might take place on a massive scale, copying content from multiple sites on a daily basis, this reason for doing it might involve a more selective approach.

Individuals or small business have been known to selectively scrape content on a manual basis, cherry picking the best articles from a site as they find them, in order to boost their credibility and appear an expert on a particular topic. Appropriating other people’s content for portfolios is a common example of content scraping, where the content can then be used to gain clients and work. This content could take the form of images, written content or any other types that can be published or distributed online.

How to Check if You are a Victim

Many victims of content scraping are blissfully unaware of the fact. However, by using WordPress, the chances of you discovering it taking place are greatly increased.

By making use of the WordPress pingback and trackback functionality, you will get a notification when someone publishes content that links back to your site. This only happens if they content they scrape contains links to your site, which is another good reason to interlink your content, while it won’t stop it from happening, it can be a good way to be notified after the fact.

However, its best to ensure your installation of WordPress isn’t setup to publish these trackbacks on your site as you will be publishing a link to the offending site. To find out how to disable publishing your trackbacks and pingbacks on your WordPress site, read our post on How to Deal with Trackbacks and Pingbacks in WordPress

Another option is to use Google, or use another search engine, to search for your content online. By copying and pasting the title of your post, or a whole sentence into the search engine, surrounded by quotes, such as “WPKube Guide to Content Scraping” you can view all the pages indexed in the search engine that contain that exact phrase. As long as the phrase you search for is fairly unique, then any results returned are worth investigating to see if your content has been scrapped.

How to Prevent Content Scraping

There isn’t much you can do to prevent content scraping from taking place. There are some anti-content scraping WordPress plugins available as well as commercial services that you can sign up to help dissuade scrappers from targeting your site. Some plugins work to make sure that once your content has been scraped, you can still try and ensure you get a credit for it once it has been republished elsewhere. Some plugins to consider include:

Anti Feed-Scraper Message: this free plugin adds some text and a link to each of your posts in your RSS feed, where the bot is likely to be sourcing your content from, attributing the author and a link back to your site.
Copyright Proof: this plugin works with the Digiprove service to ensure that there is a record of your site being the rightful owner of the content you create and publish.
WordPress Data Guard: block the IP addresses of those you suspect are stealing your content, preventing them from accessing your site.
DCMA Protection Badge: this free plugin allows you to easily insert anti-scraping badges on your site that might help dissuade scrapers from targeting your site, although it’s no guarantee.

Once it has taken place you do have a couple of limited options. One such option is to invoke a DCMA takedown. This service works in line with the Digital Millennium Copyright Act and for a fee, will attempt to get your stolen content taken down. However other than getting in touch with the website owner or their host and stating your case, there aren’t really any other options.

Content Curating vs. Content Scraping

Content curating is a popular method of publishing that if done incorrectly could see you inadvertently becoming a content scraper. Content curation can be described as the practice of sharing content with others. This can be in the form of a Tweet or a creating a top 10 list on your blog of must read articles.

Some lists of curated content feature an excerpt from the source material along with the link back to the original site. While this is in most cases acceptable, it is essential that you properly attribute the author and the original source. Good content curation sees the curator adding value to the reader in some way such as by highlighting a key point or giving their take on the topic.

Conclusion

Content scraping will continue to take place for as long as the efforts of those doing it are rewarded. Until Google and the other major search engines become sophisticated enough to determine what the original source of an article was, and not list the unauthorised publisher prominently in their listings, sites with stolen content will continue to thrive.

While there are steps you can take to minimise the chances of it happening to you, while also ensuring your stolen content is still attributed to you in some way, at the end of the day, the fate of your content is out of your hands.

When it happens to you, the best approach is to remember the saying that imitation is the best form of flattery and then get back to creating the best content you can. By building a community around your site and making a name for yourself in your niche, you can ensure that you benefit from creating great content, even if others try to piggyback your efforts and dishonestly gain from your hard work.

Images: Lego / Crime Scene