Scraping the web for publicly available web data is becoming popular in this age of machine learning and big data.
However, if you search “how to build a web scraper in python,” you will get numerous answers for the best way to develop a python web scraping project.
To help solve some of the confusion, in this guide we’re going to compare the four most common open source python libraries and frameworks used for web scraping so you can decide which option is best for your web scraping project.
Some of these are libraries that can solve a specific part of the web scraping process. However, other solutions, like Scrapy, are complete web scraping frameworks designed explicitly for the job of scraping the web.
Requests is a python library designed to simplify the process of making HTTP requests. This is highly valuable for web scraping because the first step in any web scraping workflow is to send an HTTP request to the website’s server to retrieve the data displayed on the target web page.
Out of the box, Python comes with two built-in modules, urllib and urllib2, designed to handle the HTTP requests. However, most developers prefer to use the Requests library over urllib or urllib2 because oftentimes both urllib and urllib2 need to be used together and the documentation can be confusing, often requiring developers to write a lot of code even to make a simple HTTP request.
Using the Requests library is good for the first part of the web scraping process (retrieving the web page data). However, to build a fully functioning web scraping spider, you’ll need to write your own scheduling and parallelization logic, and use other python libraries such as BeautifulSoup to accomplish the other aspects of the web scraping process — which leads us nicely into the next web scraping library we’ll discuss.
Unlike Requests, BeautifulSoup is a python library designed to parse data, i.e., to extract data from HTML or XML documents.
Because BeautifulSoup can only parse the data and can’t retrieve the web pages themselves, it is often used with the Requests library. In situations like these, Requests will make the HTTP request to the website to retrieve the web page, and once it has been returned, BeautifulSoup can be used to parse the target data from the HTML page.
One of the big advantages of using BeautifulSoup is its simplicity and ability to automate some of the recurring parts of parsing data during web scraping. With only a few lines of code, you can configure BeautifulSoup to navigate an entire parsed document and find all instances of the data you want (e.g., find all links in a document) or automatically detect encodings such as special characters.
Selenium is another library that can be useful when scraping the web. Unlike the other libraries, Selenium wasn’t originally designed for web scraping. First and foremost, Selenium is a web driver designed to render web pages like your web browser would for the purpose of automated testing of web applications.
We have discussed each of the main python libraries used when scraping the web. As you can see, each of them has been designed to accomplish one aspect of the web scraping process, resulting in having to combine multiple libraries to build a fully functioning web scraping spider.
However, there is an easier approach — which is to use a purpose build web scraping framework such as Scrapy that includes all the core components to build a web scraper out of the box, and has a huge range of plugins to designed to deal with edge cases.
Scrapy is an open source python framework built specifically for web scraping by Scrapinghub co-founders Pablo Hoffman and Shane Evans. You might be asking yourself, “What does that mean?”
It means that Scrapy is a fully fledged web scraping solution that takes a lot of the work out of building and configuring your spiders, and best of all, it seamlessly deals with edge cases that you probably haven’t thought of yet.
Within minutes of installing the framework, you can have a fully functioning spider scraping the web. Out of the box, Scrapy spiders are designed to download HTML, parse and process the data and save it in either CSV, JSON or XML file formats.
There are also a wide range of built-in extensions and middlewares designed for handling cookies and sessions as well as HTTP features like compression, authentication, caching, user-agents, robots.txt and crawl depth restriction. Scrapy also makes it very easy to extend through the development of custom middlewares or pipelines to your web scraping projects which can give you the specific functionality you require.
One of the biggest advantages of using the Scrapy framework is that it is built on Twisted, an asynchronous networking library. What this means is that Scrapyspiders don’t have to wait to make requests one at a time. Instead, they can make multiple HTTP requests in parallel and parse the data as it is being returned by the server. This significantly increases the speed and efficiency of a web scraping spider.
The learning curve to Scrapy is a bit steeper than, for example, learning how to use BeautifulSoup. However, the Scrapy project has excellent documentationand an extremely active ecosystem of developers on GitHub and StackOverflow who are always releasing new plugins and helping you troubleshoot any issues you are having.
If you’d like to build your first Scrapy spider, then be sure to check out the Learn Scrapy tutorials.
To help you understand the differences between the different web scraping libraries and frameworks, we’ve created a simple comparison table.
OK, we’ve talked about some of the most popular python libraries and frameworks for web scraping, but which one is best for your particular project?
There is no one-size-fits-all answer, as it really depends on the scale and scope of your web scraping project.
However, as a general recommendation, we’d give this advice: