The Ultimate Guide to Proxies for Web Scraping

If you are anyway serious about web scraping you’ll have quickly realised that proxy management is a critical component of any web scraping project.

When scraping the web at any reasonable scale, using proxies is a absolute must. However, it is common for managing and troubleshooting proxy issues to consume more time than building and maintaining the spiders themselves.

In this guide, we will breakdown the differences between the main proxy options and give you the information you need to consider when picking a proxy solution for your project or business.

What are proxies and why do you need them when web scraping?

Before we discuss what a proxy is we first need to understand what an IP address is and how they work.

An IP address is a numerical address assigned to every device that connects to an Internet Protocol network like the internet, giving each device a unique identity. Most IP addresses look like this:

207.148.1.212

A proxy is a 3rd party server that enables you to route your request through their servers and use their IP address in the process. When using a proxy, the website you are making the request to no longer sees your IP address but the IP address of the proxy, giving you the ability to scrape the web anonymously if you choose.

Currently, the world is transitioning from IPv4 to a newer standard called IPv6. This newer version will allow for the creation of more IP addresses. However, in the proxy business IPv6 are still not a big thing so most IPs still use the IPv4 standard.

When scraping a website, we recommend that you use a 3rd party proxy and set your company name as the user agent so the website owner can contact you if your scraping is overburdening their servers or if they would like you to stop scraping the data displayed on their website.

There are a number of reasons why proxies are important for web scraping:

  1. Using a proxy (especially a pool of proxies - more on this later) allows you to crawl a website much more reliably. Significantly reducing the chances that your spider will get banned or blocked.
  2. Using a proxy enables you to make your request from a specific geographical region or device (mobile IPs for example) which enable you to see the specific content that the website displays for that given location or device. This is extremely valuable when scraping product data from online retailers.
  3. Using a proxy pool allows you to make a higher volume of requests to a target website without being banned.
  4. Using a proxy allows you to get around blanket IP bans some websites impose. Example: it is common for websites to block requests from AWS because there is a track record of some malicious actors overloading websites with large volumes of requests using AWS servers. 
  5. Using a proxies enables you to make unlimited concurrent sessions to the same or different websites.

Why use a proxy pool?

Ok, we now know what proxies are, but how do you use them as part of your web scraping?

In a similar way to if we only use our own IP address to scrape a website, if you only use one proxy to scrape a website this will reduce your crawling reliability, geotargeting options, and the number of concurrent requests you can make.

As a result, you need to build a pool of proxies that you can route your requests through. Splitting the amount of traffic over a large number of proxies.

The size of your proxy pool will depend on a number of factors:

  1. The number of requests you will be making per hour.
  2. The target websites - larger websites with more sophisticated anti-bot countermeasures will require a larger proxy pool.
  3. The type of IPs you are using as proxies - datacenter, residential or mobile IPs.
  4. The quality of the IPs you are using as proxies - are they public proxies, shared or private dedicated proxies? Are they datacenter, residential or mobile IPs? (data center IPs are typically lower quality than residential IPs and mobile IPs, but are often more stable than residential/mobile IPs due to the nature of the network).
  5. The sophistication of your proxy management system - proxy rotation, throttling, session management, etc.

All five of these factors have a big impact on the effectiveness of your proxy pool. If you don’t properly configure your pool of proxies for your specific web scraping project you can often find that your proxies are being blocked and you’re no longer able to access the target website.

In the next section we will look at the different types of IPs you can use as proxies.

What are your proxy options?

If you’ve done any level of research into your proxy options you will have probably realised that this can be a confusing topic. Every proxy provider is shouting from the rafters that they have the best proxy IPs on the web, with very little explanation as to why. Making it very hard to assess which is the best proxy solution for your particular project.

So in this section of the guide we will break down the key differences between the available proxy solutions and help you decide which solution is best for your needs. First, let’s talk about the fundamentals of proxies - the underlying IP’s.

As mentioned already, a proxy is just a 3rd party IP address that you can route your request through. However, there are 3 main types of IPs to choose from. Each type with its own pros and cons.

Datacenter IPs

Datacenter IPs are the most common type of proxy IP. They are the IPs of servers housed in data centers. These IPs are the most commonplace and the cheapest to buy. With the right proxy management solution you can build a very robust web crawling solution for your business.

Residential IPs

Residential IPs are the IPs of private residences, enabling you to route your request through a residential network. As residential IPs are harder to obtain, they are also much more expensive. In a lot of situations they are overkill as you could easily achieve the same results with cheaper data center IPs. They also raise legal/consent issues due to the fact you are using a persons personal network to scrape the web.

Mobile IPs

Mobile IPs are the IPs of private mobile devices. As you can imagine, acquiring the IPs of mobile devices is quite difficult so they are very expensive. For most web scraping projects mobile IPs are overkill unless you want to only scrape the results shown to mobile users. But more significantly they raise even trickier legal/consent issues as oftentimes the device owner isn't fully aware that you are using their GSM network for web scraping.
Our recommendation is to go with data center IPs and put in place a robust proxy management solution. In the vast majority of cases, this approach will generate the best results for the lowest cost. With proper proxy management, data center IPs give similar results as residential or mobile IPs without the legal concerns and at a fraction of the cost.

Public, shared or dedicated proxies

The other consideration we need to discuss is whether you should use public, shared or dedicated proxies.

As a general rule you always stay well clear of public proxies, or "open proxies". Not only are these proxies of very low quality, they can be very dangerous. These proxies are open for anyone to use, so they quickly get used to slam websites with huge amounts of dubious requests. Inevitably resulting in them getting blacklisted and blocked by websites very quickly. What makes them even worse though is that these proxies are often infected with malware and other viruses. As a result, when using a public proxy you run the risk of spreading any malware that is present, infecting your own machines and even making public your web scraping activities if you haven't properly configured your security (SSL certs, etc.).

The decision between shared or dedicated proxies is a bit more intricate. Depending on the size of your project, your need for performance and your budget using a service where you pay for access to a shared pool of IPs might be the right option for you. However, if you have a larger budget and where performance is a high priority for you then paying for a dedicated pool of proxies might be the better option.

Ok, by now you should have a good idea of what proxies are and what are the pros and cons of the different types of IPs you can use in your proxy pool. However, picking the right type of proxy is only part of the battle, the real tricky part is managing your pool of proxies so they don’t get banned.

How to manage your proxy pool

If you are planning on scraping at any reasonable scale, just purchasing a pool of proxies and routing your requests through them likely won’t be sustainable longterm. Your proxies will inevitably get banned and stop returning high quality data.

Here are some the main challenges that you will face when managing your proxy pool:

  • Identify Bans - You proxy solution needs to be able to detect numerous types of bans so that you can troubleshoot and fix the underlying problem - i.e. captchas, redirects, blocks, ghosting, etc.
  • Retry Errors - If your proxies experience any errors, bans, timeouts, etc. they need to be able to retry the request with different proxies.
  • User Agents - Managing user agents is crucial to having a healthy crawl.
  • Control Proxies - Some scraping projects require you to keep a session with the same proxy, so you’ll need to configure your proxy pool to allow for this.
  • Add Delays - Randomize delays and apply good throttling to help cloak the fact that you are scraping.
  • Geographical Targeting - Sometimes you’ll need to able to configure your pool so that only some proxies will be used on certain websites.
 
Managing a pool of 5-10 proxies is ok, but when you have 100s or 1,000s it can get messy fast. To overcome these challenges you have three core solutions: Do It Yourself, Proxy Rotators and Done For You Solutions.

Do it yourself

In this situation you purchase a pool of shared or dedicated proxies, then build and tweak a proxy management solution yourself to overcome all the challenges you run into. This can be the cheapest option, but can be the most wasteful in terms of time and resources. Often it is best to only take this option if you have a dedicated web scraping team who have the bandwidth to manage your proxy pool, or if you have zero budget and can’t afford anything better.

Proxy rotators

The middle of the park solution is to purchase your proxies from a provider that also provides proxy rotation and geographical targeting. In this situation, the solution will take care of the more basic proxy management issues. Leaving you to develop and manage session management, throttling, ban identification logic, etc.

Done for you

The final solution is to completely outsource the management of your proxy management. Solutions such as Crawlera are designed as smart downloaders, where your spiders just have to make a request to it’s API and it will return the data you require. Managing all the proxy rotation, throttling, blacklists, session management, etc. under the hood so you don’t have to.
Each one of these approaches has it own pros and cons, so the best solution will depend on your specific priorities and constraints.

How to Pick The Best Proxy Solution For Your Project?

Deciding on an approach to building and managing your proxy pool can be a headache. In this section we will outline some of the questions you need to be asking yourself when picking the best proxy solution for your needs:

  1. What’s your budget? If you have a very limited or virtually non-existent budget then managing your own proxy pool is going to be the cheapest option. However, if you have even a small budget of $20 per month then you should seriously consider outsourcing your proxy management to a dedicated solution that manages everything.
  2. What is your #1 priority? If learning about proxies and everything web scraping is your #1 priority then buying your own pool of proxies and managing them yourself is probably your best option. However, if your #1 priority is getting the web data you need and achieving maximum performance from your web scraping, as is the case for most companies, then it is nearly always better to outsource your proxy management solution to a done for your solution. Or at the very least, use a proxy rotator.  
  3. What is your technical skill level and your available resources? To be able to manage your own proxy pool for a reasonable size web scraping project you will need at least a basic level of software development expertise and the bandwidth to build and maintain your spiders proxy management logic. If you don’t have this expertise or don’t have the bandwidth to devote engineering resources to it then you are often better off either using a proxy rotator and building your own proxy management infrastructure or using a done for you proxy management solution.  

Your answers to these questions will quickly help you decide which approach to proxy management best suits your needs.

Build in-house or done for you solutions?

As outlined above, if you are more focused on learning everything about web scraping from the ground up or have a very tight budget then buying access to a shared pool of IPs and managing the proxy management logic yourself is probably your best option.

However, if your focus is on getting the web data you need with little to no hassle or maximising your web scraping performance then you should really look into using either a proxy rotator and building the other management infrastructure in-house, or use a done for you proxy management solution.

Proxy rotator

As we discussed, if you want to go it alone then at the very least you should use a proxy provider that offers proxy rotation as a service. This will remove the first layer of managing your proxies. However, you will still have to implement your own session management, request throttling, IP blacklisting and ban identification logic.

Done for you

The other approach is to use intelligent algorithms to automatically manage your proxies for you. With this approach instead of having to rely on very expensive residential and mobile IPs to get clean data, purpose-built proxy management solutions are able to manage the rotation, throttling, and selection of data center IPs so that they return consistent clean data. Only using expensive IPs when there is no other option. Here your best option is a solution like Crawlera, the smart downloader developed by Scrapinghub.

Crawlera

The World’s Smartest Proxy Network

With Crawlera, instead of having to manage a pool of IPs your spiders just send a request to Crawlera's single endpoint API to retrieve the desired data. Crawlera manages a massive pool of proxies, carefully rotating, throttling, blacklists and selecting the optimal IPs to use for any individual request to give the optimal results at the lowest cost. Completely, removing the hassle of managing IPs. Users are able to focus on the data, not proxies.

The huge advantage of this approach is that it is extremely scalable. Crawlera can scale from a few hundred requests per day to hundreds of thousands of requests per day without any additional workload on your part. Better yet, with Crawlera you only pay for successful requests that return your desired data, not IPs or the amount of bandwidth you use.

Need data you can rely on?

Tell us about your project or start using our scraping tools today.

© 2010 - 2019 Scrapinghub

github-altarrow-left linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram