Best practices for web scraping

At Zyte , we care about ensuring that our services respect the rights of websites and companies whose data we scrape.

We hear a lot that scraping is a legal grey area, but the truth is scraping itself isn’t illegal. It’s the manner in which you scrape and what you scrape that falls into the grey area.

In this article, we’ll give you a set of guidelines to follow when scraping the web so you know when you need to be cautious about the manner and type of data you scrape.

Disclaimer: We are not your lawyer, and the recommendations in this guide do not constitute legal advice. Our Head of Legal is a lawyer, but she’s not your lawyer, so none of her opinions or recommendations in this guide constitute legal advice from her to you. The commentary and recommendations outlined below are based on Zyte ’s experience helping our clients (startups to Fortune 100’s) maintain legal compliance whilst scraping 7 billion web pages per month. If you want assistance with your specific situation then you should consult a lawyer.

Don't be a burden

The first rule of scraping the web is: do not harm the website. The second rule of web crawling is: do NOT harm the website.

This means that the volume and frequency of queries you make should not burden the website’s servers or interfere with the website’s normal operations.

You can accomplish this in a number of ways:

  1. Limit the number of concurrent requests to the same website from a single IP.
  2. Respect the delay that crawlers should wait between requests by following the crawl-delay directive outlined in the robots.txt file.
  3. If possible it is more respectful if you can schedule your crawls to take place at the website’s off-peak hours.

A crucial aspect of this rule is providing the web administrators of the websites you scrape with an easy way to contact you. At Zyte we accomplish this by making an abuse report available on our website. If you ever receive an abuse report from a website you are scraping you should either stop scraping the site or limit the scraping in order to rectify the abuse reported.

When scraping a website you should always consider whether the web data you are planning to extract is copyrighted.

Copyright is defined as the exclusive legal right over a physical piece of work — like an article, picture, movie, etc. It basically means, if you create it, you own it. In order to be copyrightable, the work needs to be original and tangible.

The common types of material on the web that might be copyrighted are:

  • Articles
  • Videos
  • Pictures
  • Stories
  • Music
  • Databases

As a result, copyright is very relevant to scraping because much of the data on the internet (like articles and videos) are copyrighted works.

However, there are some situations when exceptions can apply to all or part of the data enabling it to be legally scraped without infringing on the owner's copyright.

Fair use: 

Fair Use is an exception that permits limited use of copyrighted material. Typically, fair use includes categories such as criticism/parody, comment, news reporting, teaching, scholarship, and research. One example of fair use is the publishing of short snippets of articles with links, which is generally okay under the fair use exception due to the transformative and limited nature of the use.
The factors commonly used to determine if the fair use exception applies are:

  1. the purpose and character of your use (ie is it transformative in some way);
  2. the nature of the work (ie fact v. fiction or published v. unpublished); 
  3. the amount taken, the less you copy the better; and 
  4. the effect upon the potential market, meaning the extent to which your use may deprive the owner of income or a potential market opportunity.

Transformative use: 

One factor in determining fair use is whether the usage is transformative. Instead of distributing and storing exact duplicates or lengthy portions of the crawled website, transform the content and the use of the content in some way so that you are not violating copyright.

Facts: 

The facts within copyrighted material are often not covered by copyright laws, so if you limit what is being scraped to just the factual matters -- ie names of products, price, etc, then it is acceptable to scrape.

Note that different countries have different exceptions to copyright law, and you should always ensure that an exception applies within the jurisdiction within which you’re operating.

Don't breach GDPR

The introduction of GDPR completely changes how you can scrape the personal data of EU citizens (and sometimes non-EU citizens as well). For a deeper explanation of how GDPR affects web scrapers, be sure to check out our Web Scrapers Guide to GDPR.

However, in this section, we will briefly outline the best practices when it comes to scraping personal data. Personal data is any data that can identify an individual person:

  • Name
  • Email
  • Phone number
  • Address
  • User name
  • IP address
  • Bank or credit card info
  • Medical data
  • Biometric data

Unless you have a “lawful reason” to scrape and store this data you will be in breach of GDPR if any of the scraped data belongs to EU residents. In the case of web scraping, the most common legal reasons are legitimate interest and consent.

Consent

For consent to be your lawful reason to scrape a person's data, you need to have that person's explicit consent to scrape, store and use their data in the way you intended. This means that you or a 3rd party must have been in direct contact with the person and they agreed to terms that allow you to scrape their data.

An example of this would be companies like Mint.com, where users give Mint consent to log into their online banking accounts and retrieve their banking transactions so that they can be tracked and displayed in a more user-friendly format on Mint.com.

Legitimate interest

For most companies, it will be very difficult for you to demonstrate that you have a legitimate interest in scraping someone’s personal data.

In most cases, only governments, law enforcement agencies, etc. will have what would be deemed to be a legitimate interest in scraping the personal data of its citizens as they will typically be scraping people’s personal data for the public good.

Beware of login and website terms and conditions

When you log in and/or explicitly agree to a website's terms and conditions you are entering into a contract with the website owner, thereby agreeing to their rules regarding web scraping. Which can explicitly state that you aren’t allowed to scrape any data on the website.

This means that you need to carefully review the terms and conditions you are agreeing to if your spiders have to log in to scrape data, as they could stipulate that you're not allowed to scrape their data. You should always honor the terms of any contract you enter into, including website terms and conditions and privacy policies.

Looking for web extracted data? We extract the data you need and deliver it exactly as you’d like it. Just tell us what you need.

Learn more about web scraping

Here at Zyte, we have been in the web scraping industry for 12 years. We have helped extract web data for more than 1,000 clients ranging from Government Agencies and Fortune 100 companies to early-stage startups and individuals. During this time we gained a tremendous amount of experience and expertise in web data extraction

Here are some of our best resources if you want to deepen your web scraping knowledge: