Before we discuss what a proxy is we first need to understand what an IP address is and how they work.
An IP address is a numerical address assigned to every device that connects to an Internet Protocol network like the internet, giving each device a unique identity. Most IP addresses look like this:
A proxy is a 3rd party server that enables you to route your request through their servers and use their IP address in the process. When using a proxy, the website you are making the request to no longer sees your IP address but the IP address of the proxy, giving you the ability to scrape the web anonymously if you choose.
Currently, the world is transitioning from IPv4 to a newer standard called IPv6. This newer version will allow for the creation of more IP addresses. However, in the proxy business IPv6 are still not a big thing so most IPs still use the IPv4 standard.
When scraping a website, we recommend that you use a 3rd party proxy and set your company name as the user agent so the website owner can contact you if your scraping is overburdening their servers or if they would like you to stop scraping the data displayed on their website.
There are a number of reasons why proxies are important for web scraping:
Ok, we now know what proxies are, but how do you use them as part of your web scraping?
In a similar way to if we only use our own IP address to scrape a website, if you only use one proxy to scrape a website this will reduce your crawling reliability, geotargeting options, and the number of concurrent requests you can make.
As a result, you need to build a pool of proxies that you can route your requests through. Splitting the amount of traffic over a large number of proxies.
The size of your proxy pool will depend on a number of factors:
All five of these factors have a big impact on the effectiveness of your proxy pool. If you don’t properly configure your pool of proxies for your specific web scraping project you can often find that your proxies are being blocked and you’re no longer able to access the target website.
In the next section we will look at the different types of IPs you can use as proxies.
If you’ve done any level of research into your proxy options you will have probably realised that this can be a confusing topic. Every proxy provider is shouting from the rafters that they have the best proxy IPs on the web, with very little explanation as to why. Making it very hard to assess which is the best proxy solution for your particular project.
So in this section of the guide we will break down the key differences between the available proxy solutions and help you decide which solution is best for your needs. First, let’s talk about the fundamentals of proxies - the underlying IP’s.
As mentioned already, a proxy is just a 3rd party IP address that you can route your request through. However, there are 3 main types of IPs to choose from. Each type with its own pros and cons.
If you are planning on scraping at any reasonable scale, just purchasing a pool of proxies and routing your requests through them likely won’t be sustainable longterm. Your proxies will inevitably get banned and stop returning high quality data.
Here are some the main challenges that you will face when managing your proxy pool:
Deciding on an approach to building and managing your proxy pool can be a headache. In this section we will outline some of the questions you need to be asking yourself when picking the best proxy solution for your needs:
Your answers to these questions will quickly help you decide which approach to proxy management best suits your needs.
By this stage, you should have a good idea of what proxies are and how to choose the best option for your web scraping project. However, there is one consideration that many people overlook when it comes to web scraping and proxies, that is the legal considerations.
The act of using a proxy IP to visit a website is legal, however, there are a couple things you need to keep in mind to make sure you don’t stray into a grey area.
Having a robust proxy solution is akin to having a superpower, but it can also make you sloppy. With the ability to make a huge volume of requests to a website without the website being easily able to identify you, people can get greedy and overload a website's servers with too many requests. Which is never the right thing to do.
If you are a web scraper you should always be respectful to the websites you scrape. No matter the scale or sophistication of your web scraping operation you should always comply with web scraping best practices (Web Scraping Best Practices Guide Coming Soon) to ensure your spiders are polite and cause no harm to the websites you are scraping. Additionally, if the website informs you (or informs the proxy provider) that your scraping is burdening their site or is unwanted, you should limit your requests or cease scraping, depending on the complaint received. So long as you play nice, it's much less likely you will run into any legal issues.
As mentioned in our Web Scrapers Guide to GDPR, the other legal consideration you need to make when using residential or mobile IPs is do you have the IPs owners explicit consent to use their IP for web scraping?
As GDPR defines IP addresses as personally identifiable information you need to ensure that any EU residential IPs you use as proxies are GDPR compliant. This means that you need to ensure that the owner of that residential IP has given their explicit consent for their home or mobile IP to be used as a web scraping proxy.
If you own your own residential IPs then you will need to handle this consent yourself. However, if you are obtaining residential proxies from a 3rd party provider, then you need to ensure that they have obtained consent and are in compliance with GDPR prior to using the proxy for your web scraping project.