Scrapy Video Tutorials

Free Scrapy Tutorials To Learn Web Scraping

video tutorials scrapy scrapinghub

A little About The Scrapy Tutorial Course

With the evergrowing amount of data spread around the web, the need for gathering and structuring that data is also increasing day by day. This is exactly where web scraping comes into play.

In this quick Scrapy Tutorial Video Course, you'll learn everything you need to get started with web scraping using Python and Scrapy. Among other things, you'll learn how to:
Extract data from the web using CSS selectors
Follow pagination buttons with a spider
Handle websites that use infinite scrolling
Authenticate your spider in a website
Deploy and run your spiders in the cloud
Become confident in Web Scraping

1. Getting Started with Web Scraping

This video covers the basics of web scraping using your web browser, Scrapy shell and CSS selectors.

Topics:
How to identify the data via Browser's "inspect element" tool
How to build CSS selectors using Scrapy Shell

Further Reading:
Scrapy TutorialThe 30 CSS selectors you must memorizeAn interactive tutorial on CSS selectors

Toggle Code

$ scrapy shell http://quotes.toscrape.com/random
>>> response.css('small.author::text').extract_first()
>>> response.css('span.text::text').extract_first()
>>> response.css('a.tag::text').extract()

2. Creating your First Scrapy Spider

This video shows how to create a Scrapy spider using the selectos built in the previous video.

Topics:
The anatomy of a Scrapy spider
How to run a spider

Further Reading:
Scrapy TutorialScrapy CLI tool commands

Toggle Code

# -*- coding: utf-8 -*-
import scrapy


class SingleQuoteSpider(scrapy.Spider):
    name = "single-quote"
    allowed_domains = ["toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/random']

    def parse(self, response):
        self.log('I just visited: ' + response.url)
        for quote in response.css('div.quote'):
            item = {
                'author_name': quote.css('small.author::text').extract_first(),
                'text': quote.css('span.text::text').extract_first(),
                'tags': quote.css('a.tag::text').extract(),
            }
            yield item

3. Scraping Multiple Items per Page

This video shows how to extract many items from a single page. This is a very common pattern that applies to e-commerces, forums, etc.

Topics:
How to iterate over page elements
How to extract data from repeating elements

Further Reading:
Scrapy Tutorial

Toggle Code

# -*- coding: utf-8 -*-
import scrapy


class MultipleQuotesSpider(scrapy.Spider):
    name = "multiple-quotes"
    allowed_domains = ["toscrape.com"]
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        self.log('I just visited: ' + response.url)
        for quote in response.css('div.quote'):
            item = {
                'author_name': quote.css('small.author::text').extract_first(),
                'text': quote.css('span.text::text').extract_first(),
                'tags': quote.css('a.tag::text').extract(),
            }
            yield item

4. Following Pagination Links

This video shows how to build a spider with the ability to jump from one page to another.

Topics:
How to find links in a page
How to create requests to other pages

Further Reading:
Scrapy TutorialCrawlSpider - a generic spider to crawl based on rulesSitemapSpider - a generic spider to crawl from sitemaps

Toggle Code

# -*- coding: utf-8 -*-
import scrapy


class MultipleQuotesPaginationSpider(scrapy.Spider):
    name = "multiple-quotes-pagination"
    allowed_domains = ["toscrape.com"]
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        self.log('I just visited: ' + response.url)
        for quote in response.css('div.quote'):
            item = {
                'author_name': quote.css('small.author::text').extract_first(),
                'text': quote.css('span.text::text').extract_first(),
                'tags': quote.css('a.tag::text').extract(),
            }
            yield item
        # follow pagination link
        next_page_url = response.css('li.next > a::attr(href)').extract_first()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(url=next_page_url, callback=self.parse)

5. Scraping Details Pages from Lists

This video shows how to scrape websites that are structured similarly to e-commerces, where there are lists of products and then we have to visit each product page to get the data we need.

Topics:
Dealing with multiple pages with different formats
Multiple callbacks per spider

Further Reading:
Scrapy Tutorial

Toggle Code

# -*- coding: utf-8 -*-
import scrapy


class AuthorsSpider(scrapy.Spider):
    name = "authors"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        urls = response.css('div.quote > span > a::attr(href)').extract()
        for url in urls:
            url = response.urljoin(url)
            yield scrapy.Request(url=url, callback=self.parse_details)

        # follow pagination link
        next_page_url = response.css('li.next > a::attr(href)').extract_first()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(url=next_page_url, callback=self.parse)

    def parse_details(self, response):
        yield {
            'name': response.css('h3.author-title::text').extract_first(),
            'birth_date': response.css('span.author-born-date::text').extract_first(),
        }

6. Scraping Infinite Scrolling Pages

This video shows how to find and use underlying APIs that power AJAX-based infinite scrolling mechanisms in web pages.

Topics:
Inspecting the network requests from your browser
Reverse engineer network requests
Extract data from a JSON-based HTTP API

Further Reading:
Scraping Infinite Scrolling Pages (blog post)Chrome DevTools - Networking and the ConsoleWeb Scraping - Discovering Hidden APIs

Toggle Code

# -*- coding: utf-8 -*-
import json
import scrapy


class QuotesInfiniteScrollSpider(scrapy.Spider):
    name = "quotes-infinite-scroll"
    api_url = 'http://quotes.toscrape.com/api/quotes?page={}'
    start_urls = [api_url.format(1)]

    def parse(self, response):
        data = json.loads(response.text)
        for quote in data['quotes']:
            yield {
                'author_name': quote['author']['name'],
                'text': quote['text'],
                'tags': quote['tags'],
            }
        if data['has_next']:
            next_page = data['page'] + 1
            yield scrapy.Request(url=self.api_url.format(next_page), callback=self.parse)

7. Submitting Forms in your Spiders

This video shows how to scrape pages where the users have to submit POST requests, such as login forms.

Topics:
Submitting POST requests with Scrapy
Handling validation tokens
Authenticating in a website

Further Reading:
FormRequest documentationUnderstanding CSRFScraping websites based on ViewStates

Toggle Code

# -*- coding: utf-8 -*-
import scrapy


class QuotesLoginSpider(scrapy.Spider):
    name = 'quotes-login'
    login_url = 'http://quotes.toscrape.com/login'
    start_urls = [login_url]

    def parse(self, response):
        # extract the csrf token value
        token = response.css('input[name="csrf_token"]::attr(value)').extract_first()
        # create a python dictionary with the form values
        data = {
            'csrf_token': token,
            'username': 'abc',
            'password': 'abc',
        }
        # submit a POST request to it
        yield scrapy.FormRequest(url=self.login_url, formdata=data, callback=self.parse_quotes)

    def parse_quotes(self, response):
        """Parse the main page after the spider is logged in"""
        for q in response.css('div.quote'):
            yield {
                'author_name': q.css('small.author::text').extract_first(),
                'author_url': q.css(
                    'small.author ~ a[href*="goodreads.com"]::attr(href)'
                ).extract_first()
            }

8. Scraping JS pages with Splash

This video shows how to scrape JavaScript based websites using Scrapy and Splash.

Topics:
How to identify pages based on JavaScript
How to run Splash
How to integrate your Scrapy spiders with Splash

Further Reading:
Splash docsThe scrapy-splash pluginHandling JavaScript in Scrapy with Splash

Toggle Code

import scrapy
from scrapy_splash import SplashRequest


class QuotesJSSpider(scrapy.Spider):
    name = 'quotesjs'
    # all these settings can be put in your project's settings.py file
    custom_settings = {
        'SPLASH_URL': 'http://localhost:8050',
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        'SPIDER_MIDDLEWARES': {
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
        },
        'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
    }

    def start_requests(self):
        yield SplashRequest(
            url='http://quotes.toscrape.com/js',
            callback=self.parse,
        )

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                'text': quote.css("span.text::text").extract_first(),
                'author': quote.css("small.author::text").extract_first(),
                'tags': quote.css("div.tags > a.tag::text").extract(),
            }

9. Run your Spiders in the Cloud

This video shows how you can deploy, run and manage your crawlers in the cloud, using Scrapy Cloud.

Topics:
Using Shub, the Scrapinghub client
Fetching the scraped data from the Cloud
Scrapy Cloud features

Further Reading:
Shub documentationScrapinghub Python client libraryDependencies in Scrapy Cloud Projects

Toggle Code

$ pip install shub
$ shub login
$ shub deploy
$ shub schedule quotes

We enable over 2,000 marketers extract the data they need every day

web scraping experts

Need data you can rely on?

Tell us about your project or start using our scraping tools today.

© 2010 - 2020 Scrapinghub

github-alt linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram