Web Scraping Frameworks

Scraping the web for publicly available web data is becoming popular in this age of machine learning and big data.

However, if you search “how to build a web scraper in python,” you will get numerous answers for the best way to develop a python web scraping project.

To help solve some of the confusion, in this guide we’re going to compare the four most common open source python libraries and frameworks used for web scraping so you can decide which option is best for your web scraping project.

  • Requests
  • BeautifulSoup
  • Selenium
  • Scrapy

Some of these are libraries that can solve a specific part of the web scraping process. However, other solutions, like Scrapy, are complete web scraping frameworks designed explicitly for the job of scraping the web.

Requests

Requests is a python library designed to simplify the process of making HTTP requests. This is highly valuable for web scraping because the first step in any web scraping workflow is to send an HTTP request to the website’s server to retrieve the data displayed on the target web page.

Out of the box, Python comes with two built-in modules, urllib and urllib2, designed to handle the HTTP requests. However, most developers prefer to use the Requests library over urllib or urllib2 because oftentimes both urllib and urllib2 need to be used together and the documentation can be confusing, often requiring developers to write a lot of code even to make a simple HTTP request.

Using the Requests library is good for the first part of the web scraping process (retrieving the web page data). However, to build a fully functioning web scraping spider, you’ll need to write your own scheduling and parallelization logic, and use other python libraries such as BeautifulSoup to accomplish the other aspects of the web scraping process — which leads us nicely into the next web scraping library we’ll discuss.

Beautiful Soup

Unlike Requests, BeautifulSoup is a python library designed to parse data, i.e., to extract data from HTML or XML documents.

Because BeautifulSoup can only parse the data and can’t retrieve the web pages themselves, it is often used with the Requests library. In situations like these, Requests will make the HTTP request to the website to retrieve the web page, and once it has been returned, BeautifulSoup can be used to parse the target data from the HTML page.

One of the big advantages of using BeautifulSoup is its simplicity and ability to automate some of the recurring parts of parsing data during web scraping. With only a few lines of code, you can configure BeautifulSoup to navigate an entire parsed document and find all instances of the data you want (e.g., find all links in a document) or automatically detect encodings such as special characters.

Selenium

Selenium is another library that can be useful when scraping the web. Unlike the other libraries, Selenium wasn’t originally designed for web scraping. First and foremost, Selenium is a web driver designed to render web pages like your web browser would for the purpose of automated testing of web applications.

This functionality is useful for web scraping because a lot of today’s modern web pages make extensive use of JavaScript to dynamically populate the page. The problem this causes for normal web scraping spiders is most of them don’t execute this JavaScript code. Which prevents them from accessing all the available data, limiting their ability to extract all the available data.

In contrast, when a spider built using Selenium visits a page, it will first execute all the JavaScript available on the page before making it available for the parser to parse the data. The advantage to this approach is that it enables you to scrape data not available without JS or a full browser. However, the web scraping process is much slower compared to a simple HTTP request to the web browser because the spider will execute all the scripts present on the web page.

If speed isn’t a big concern or the scale of the web scraping isn’t huge, then using Selenium to scrape the web will work, but it’s not ideal. However, if speed is a big concern for you or you plan to scrape the web scale, then executing the JavaScript on every web page you visit is completely impractical. You’ll need to take a much more sophisticated approach to scraping the web.

We have discussed each of the main python libraries used when scraping the web. As you can see, each of them has been designed to accomplish one aspect of the web scraping process, resulting in having to combine multiple libraries to build a fully functioning web scraping spider.

However, there is an easier approach — which is to use a purpose build web scraping framework such as Scrapy that includes all the core components to build a web scraper out of the box, and has a huge range of plugins to designed to deal with edge cases.

Scrapy

Scrapy is an open source python framework built specifically for web scraping by Scrapinghub co-founders Pablo Hoffman and Shane Evans. You might be asking yourself, “What does that mean?”

It means that Scrapy is a fully fledged web scraping solution that takes a lot of the work out of building and configuring your spiders, and best of all, it seamlessly deals with edge cases that you probably haven’t thought of yet.

Within minutes of installing the framework, you can have a fully functioning spider scraping the web. Out of the box, Scrapy spiders are designed to download HTML, parse and process the data and save it in either CSV, JSON or XML file formats.

There are also a wide range of built-in extensions and middlewares designed for handling cookies and sessions as well as HTTP features like compression, authentication, caching, user-agents, robots.txt and crawl depth restriction. Scrapy also makes it very easy to extend through the development of custom middlewares or pipelines to your web scraping projects which can give you the specific functionality you require.

One of the biggest advantages of using the Scrapy framework is that it is built on Twisted, an asynchronous networking library. What this means is that Scrapyspiders don’t have to wait to make requests one at a time. Instead, they can make multiple HTTP requests in parallel and parse the data as it is being returned by the server. This significantly increases the speed and efficiency of a web scraping spider.

One small drawback about Scrapy is that it doesn’t handle JavaScript straight out of the box like Selenium. However, the team at Scrapinghub has created Splash, an easy-to-integrate, lightweight, scriptable headless browser specifically designed for web scraping.

The learning curve to Scrapy is a bit steeper than, for example, learning how to use BeautifulSoup. However, the Scrapy project has excellent documentationand an extremely active ecosystem of developers on GitHub and StackOverflow who are always releasing new plugins and helping you troubleshoot any issues you are having.

If you’d like to build your first Scrapy spider, then be sure to check out the Learn Scrapy tutorials.

Web scraping libraries and frameworks compared

To help you understand the differences between the different web scraping libraries and frameworks, we’ve created a simple comparison table.

What is it?
Purpose
Ideal use case
Built-in Data Storage Supports
Available selectors
Asynchronous
Javascript support
Documentation
Learning curve
Ecosystem
Github stars

Scrapy

Web scraping framework
Complete web 
scraping solution
Development of recurring or large 
scale web scraping projects
JSON, JSON lines, XML, CSV
JCSS & Xpath
Yes
Yes,via Splash library
Excellent
Easy
Large ecosystem of developers 
contributing projects and support on Github and StackOverflow
32,690

Requests

Library
Simplifies making 
HTTP requests
Simple non-recurring web scraping tasks
Need to develop your own
N/A 
No
N/A 
Excellent
Very easy
Few related projects or plugins
34,727

Beautiful Soup

Library
Data parser
Simple non-recurring web scraping tasks
Need to develop your own
CSS 
No
No
Excellent
Very easy
Few related projects or plugins
-

Selenium

Library
Scriptable web 
browser to render javascript
Small-scale web scraping of javascript heavy websites
Customizable
CSS & Xpath
No
Yes
Good
Easy
Few related projects or plugins
14,262

Which approach is best for your web scraping project?

OK, we’ve talked about some of the most popular python libraries and frameworks for web scraping, but which one is best for your particular project?

There is no one-size-fits-all answer, as it really depends on the scale and scope of your web scraping project.

However, as a general recommendation, we’d give this advice:

Small once-off web scraping tasks (Up to 1,000 pages)

If you only need to scrape a small amount of data for a once-off project, then using a combination of BeautifulSoup and Requests (maybe Selenium if you need to render JavaScript) can be the quickest option to get the data you need if you don't already have experience with Scrapy. However, if there is a possibility that this scraper will need to grow or you’ll need to write more spiders in the future you are better off going with Scrapy.

Recurring or large web scraping projects

However, if your web scraping needs are anything more extensive than a once-off easy data extraction task, then you should seriously consider using the Scrapy framework. Scrapy has been designed as the complete solution for web scraping (and is still being further improved), so it's the best option if you want to build a powerful and flexible web crawler.

Need data you can rely on?

Tell us about your project or start using our scraping tools today.

© 2010 - 2019 Scrapinghub

github-altarrow-left linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram