Scrapy Cloud is a battle-tested platform for running web crawlers (aka. spiders). Your spiders run in the cloud and scale on demand, from thousands to billion of pages. Think of it as a Heroku for web crawling.
Write your spiders using Scrapy, the most powerful open source web crawling framework
With a point and click tool (Portia), which is also open source and extensible
With a single command or the push of a button. No servers involved
Unearth actionable insights. We’re able to filter, normalize, augment, analyze, and aggregate your data.
Watch spiders as they run and scrape data, compare and annotate the data scraped
Run heavy jobs with more memory and lighter jobs with more concurrency
Dowload your data in JSON, CSV or XML formats
You can choose to share the data by publishing a dataset
To integrate and build your apps with Scrapinghub data
One of Scrapy Cloud key features is its elastic capacity. You can purchase capacity units (essentially, 1 GB RAM each) when you need to scale up. You can distribute these units intelligenly, by having heavy spiders require more units to run and lighter spiders less. This way, the capacity will be optimized to the size of your spiders.
Scrapy is really pleasant to work with. It hides most of the complexity of web crawling, letting you focus on the primary work of data extraction. Scrapinghub provides a simple way to run your crawls and browse results, which is especially useful for larger projects with multiple developers.
I love that Scrapy Cloud does not force vendor lock-in, unlike the other scraping and crawling services. Investment developing the right scraping logic is not stuck in some proprietary format or jailed behind some user friendly interface. With Scrapy Cloud scraping logic is in standard Python code calling the open-source Scrapy Python library. You retain the freedom to run the scraping Python code on your own computers or someone else’s servers.