When scraping a website you should always consider whether the web data you are planning to extract is copyrighted.
Copyright is defined as the exclusive legal right over a physical piece of work — like an article, picture, movie, etc. It basically means, if you create it, you own it. In order to be copyrightable, the work needs to be original and tangible.
The common types of material on the web that might be copyrighted are:
As a result, copyright is very relevant to scraping because much of the data on the internet (like articles and videos) are copyrighted works.
However, there are some situations when exceptions can apply to all or part of the data enabling it to be legally scraped without infringing on the owner's copyright.
Note that different countries have different exceptions to copyright law, and you should always ensure that an exception applies within the jurisdiction within which you’re operating.
The introduction of GDPR completely changes how you can scrape the personal data of EU citizens (and sometime non-EU citizens as well). For a deeper explanation of how GDPR affects web scrapers, be sure to check out our Web Scrapers Guide to GDPR.
However, in this section we will briefly outline the best practices when it comes to scraping personal data. Personal data is any data that can identify an individual person:
Unless, you have a “lawful reason” to scrape and store this data you will be in breach of GDPR if any of the scraped data belongs to EU residents. In the case of web scraping, the most common legal reasons are legitimate interest and consent.
It is always best practice to identify yourself whenever possible and put contact details in the crawler’s header. It’s important that when using data center or other third party IPs, ensure that they can get an abuse report or cease and desist back to you if needed. If you don’t, they'll have to dig into their logs and look for the offending IPs.
Be nice to the friendly sysadmins in your life and identify your crawler. When using Scrapy this can be accomplished using the USER_AGENT setting. Here you can share your crawler name, company name and a contact email:
USER_AGENT = 'MyCompany-MyCrawler (email@example.com)'
Then on your website you should provide an easy to use contact form or abuse report where a sysadmin can let you know about their issue with your web scraping.
When you login and/or explicitly agree to a website's terms and conditions you are entering into a contract with the website owner, thereby agreeing to their rules regarding web scraping. Which can explicitly state that you aren’t allowed to scrape any data on the website.
This means that you need to carefully review the terms and conditions you are agreeing to if your spiders have to login to scrape data, as they could stipulate that you're not allowed to scrape their data. You should always honor the terms of any contract you enter into, including website terms and conditions and privacy policies.