IP Rotation with Python
Accessing online services in an automated manner may often leads to blockages by those services. In case requests come from IPs that are continuous or lie within the same range, even the most primitive anti-scraping plugin can detect you are a bot and block content scraping.
Since the sites detect crawlers by their IP examining, a workable strategy to avoid being blocked is using a web scraper that operates in the cloud. It will not run off your local IP address. Another method to get around the IP blocks from targeted websites is IP rotation.
We’ll introduce 2 techniques here for an IP Rotation strategy. The first technique relies on automating free proxies, while the second technique relies on Amazon API Gateway services.
Automate Free Proxies
Free proxies don’t live long, some expire even before a scraping session is completed. To prevent it from disrupting, create some code to automatically select and refresh the list of proxies list with working IP addresses. This will save you much time and help avoid frustration.
The Free Proxy List service maintains free proxies that are checked and updated every 10 minutes.
The following python code shows a usage example, based on that service:
import requests
from lxml.html import fromstringsrc = 'https://api64.ipify.org'response = requests.get('https://free-proxy-list.net/') parser = fromstring(response.text)
proxies = set()
for i in parser.xpath('//tbody/tr'):
if i.xpath('.//td[7][contains(text(),"yes")]'):
proxy = ":".join([i.xpath('.//td[1]/text()')[0],
i.xpath('.//td[2]/text()')[0]])
proxies.add(proxy)if proxies:
proxy_pool = cycle(proxies)
for i in range(1, len(proxies)):
proxy = next(proxy_pool)
try:
r = requests.get(src, proxies={"http": proxy, "https": proxy})
file_request_succeed = r.ok
if file_request_succeed:
print('Rotated IP %s succeed' % proxy)
break
except Exception as e:
print('Rotated IP %s failed (%s)' % (proxy, str(e)))
The above code will first fetch the free proxies list from the Free Proxy List service, and will then try them out, one by one, for accessing the source URL (hereby https://api64.ipify.org for retrieving back the applied IP). The trials will stop/break upon the first successful request.
Note that there are many other common ways to parse the free proxies from the HTML. One such approach is Pandas’s read_html:
import requests
import pandas as pd
response = requests.get('https://free-proxy-list.net/')
df = pd.read_html(response.text)[0]
From here, it’s relatively easy to get the IP address' and associated ports from df['IP Address']
and df['Port']
.
Amazon API Gateway
Amazon API Gateway is an AWS service for creating, publishing, maintaining, monitoring, and securing REST, HTTP, and WebSocket APIs at any scale.
The requests-ip-rotator is an open-source Python library to utilize AWS API Gateway’s large IP pool as a proxy to generate pseudo-infinite IPs for web scraping and brute forcing. This library allow bypassing IP-based rate-limits for sites and services. X-Forwarded-For headers are automatically randomized and applied unless given. This is because otherwise, AWS will send the client’s true IP address in this header.
Amazon API Gateway sends its requests from any available IP — and since the AWS infrastructure is so large, it is almost guaranteed to be different each time (however, note that these requests can be easily identified and blocked, since they are sent with unique AWS headers i.e. “X-Amzn-Trace-Id”).
The following python code shows a usage example, based on that service:
import requests
from urllib.parse import urlparse
from requests_ip_rotator import ApiGateway, EXTRA_REGIONSaws_access_key_id = XXXXX
aws_secret_access_key = XXXXXsrc = 'https://api64.ipify.org'src_parsed = urlparse(src)
src_nopath = "%s://%s" % (src_parsed.scheme, src_parsed.netloc)gateway1 = ApiGateway(src_nopath,
regions=EXTRA_REGIONS,
access_key_id=f"{aws_access_key_id}",
access_key_secret=f"{aws_secret_access_key}") gateway1.start(force=True)
session1 = requests.Session() session1.mount(src_nopath, gateway1)
try:
r = session1.get(source, stream=True)
file_request_succeed = r.ok
if file_request_succeed:
print('Rotated IP succeed')
except Exception as e:
print('Rotated IP failed (%s)' % str(e))
if proxy_rotation_mode == 'AWS':
gateway1.shutdown()
The above code accesses the source URL (hereby https://api64.ipify.org for retrieving back the applied IP) with a pseudo-random rotated IP.
Notes:
- AmazonAPIGatewayAdministrator must be added to the Amazon’s IAM user permissions.
- Amazon API Gateway is free for the first million requests per region, which means that for most use cases this should be completely free.
- It’s important to shutdown the ApiGateway resources once you have finished with them, to prevent dangling public endpoints that can cause excess charges to your account (done through the
shutdown
method). - Additional credits for the requests-ip-rotator library shall also go for RhinoSecurityLabs and ustayready (see here).