In this article we will discuss some steps you can take to help troubleshoot any unexpected issues when trying to use your proxies via Proxy Pilot. As a reminder, Proxy Pilot is a tool that relies on proper configurations by the end user in order to work properly. If you set bad headers or cookies, use bad proxies, or so forth, then you will get poor results nonetheless.


At the core of web scraping, if you cannot load a request on your browser, using your home/work IP address, then it is unlikely you will be able to scrape a page using software + a proxy source. 


There are many ways to detect scraping software (see example1 and example2), so the more customization you add to loading a website (your software + proxies), the greater your footprint will be, and the easier it will be to detect you. 


If you do not wish to worry about such anti-scraping battles, please consider our API at: https://scrapingrobot.com/api/  Our Scraping Robot API was built to solve this exact issue:  allowing you to focus on your core business, instead of fighting with anti-scraping technologies.


If you wish to manage your own proxies, use developer resources, and pay for server compute power, then using Proxy Pilot will help (but not solve!) with some of these common scraping issues for you. 

 

Example of bad vs good scraping requests

Below you will find an example of a very bad scraping request to Amazon (or any site, really). Proxy Pilot's role is not to solve these bad requests - it is still on the developer's code to send good requests to avoid being banned.

 

curl -s 'https://www.amazon.com/dp/B07HNW68ZC/' \

     -x 'PROXY_LOGIN:PROXY_PASS@PROXY_IP:PROXY_PORT' \

     -k --compressed -v

 

The reason the above code would result in a ban is not because of Proxy Pilot, or even your proxies, but rather it is because normal browser requests would have more headers set in the request. Specifically, Amazon checks that the request has at least 'User-Agent' header, and no matter which proxies you're running this request from - it would most likely get blocked.

By simply adding user-agent to your request you can significantly decrease ban rates for your request:

 

curl -s 'https://www.amazon.com/dp/B07HNW68ZC/' \

     -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36' \

     -x 'PROXY_LOGIN:PROXY_PASS@PROXY_IP:PROXY_PORT' \

     -k --compressed -v

 

 

Tip #1:  Replicate your request in a browser to confirm it works there first

 

Description:  As mentioned above, the best way to know if it’s your software causing issues would be to run your request via a browser on your local machine. Because your local machine will be a pure “residential IP”, and a browser is not customized software, you will then be able to successfully load all pages. However, if you cannot load a page in your browser using the steps below, then it means you are passing incorrect headers or cookies to the target URL and would need to debug on your side to find the proper headers/cookies.

 

Steps to troubleshoot:

  1. Open a Chrome incognito tab and make sure you clear the cookies
    As most scraping software starts with no previous browsing history and no cookies - it’s best to do it this way to replicate how your software would work

  2. Replicate the URL you're going to scrape by just pasting it into the address field
    I.e.: https://www.amazon.com/gp/product/B08F7PTF53/

  3. Make sure it loads as expected. If it fails - you should take this into consideration when designing your scraping software
    In some cases the site might ban you even at this step simply because you have no previous browsing history (and no cookies). In this example, the following URL will most likely not give you expected results if you open it with no cookies:
    https://www.amazon.com/gp/aod/ajax/ref=dp_aod_ALL_mbc?asin=B08F7PTF53

  4. In Chrome Dev Tools open a network tab and check the first request (it will likely have a 'document' type). Check that the URL of that request matches the one you just made, and then right-click and choose 'Copy as cURL'
    The cURL sent by a browser should appear in your clipboard and would look something like this:
    curl 'https://www.amazon.com/gp/product/B08F7PTF53/' \
      -H 'authority: www.amazon.com' \
      -H 'sec-ch-ua: " Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"' \
      -H 'sec-ch-ua-mobile: ?0' \
      -H 'upgrade-insecure-requests: 1' \
      -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36' \
      -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
      -H 'sec-fetch-site: none' \
      -H 'sec-fetch-mode: navigate' \
      -H 'sec-fetch-user: ?1' \
      -H 'sec-fetch-dest: document' \
      -H 'accept-language: en-US,en;q=0.9' \
      --compressed

Note how many headers the browser sends even from an incognito mode. With cookies the request can easily be several times bigger.

Note: cURL is provided by default with macOS and most Linux distributions, as well as the latest Windows 10 updates.


In case you are running an older version of Windows you can install curl from their official site.


Tip #2:  Replicate the same request from the browser via Proxy Pilot

Description: After step #4 in the previous tip you should have a perfect cURL request with a perfect set of headers which you might try to replicate via the Proxy Pilot, by simply adding to this cURL request parameter:  -x 'PROXY_LOGIN:PROXY_PASS@PROXY_IP:PROXY_PORT' -v -k 

with the PP credentials provided to you earlier. This would send the same request via PP. 

You might also consider adding parameters 
-o test.html
This would save results into a test.html page, so you can open it with a browser and see it’s content to make sure it’s working properly.
If it returns a proper content at this stage - this means PP works fine and takes care of managing proxies, doing retries if it’s banned via some proxy, etc.

In case the request works directly (without setting Proxy Pilot via -x flag), but stops working via Proxy Pilot - please inform us about that and let us know which curl request you were sending

Tip #3:  Replicate the same behavior via your software

Description: Once you’ve tested your request via the browser and via the Proxy Pilot - you can apply this to your own scraping software. The integration with Proxy Pilot is almost as simple as using regular proxies for data scraping. More details and some code examples for different languages and frameworks can be found here: http://bit.ly/proxypilot


Please note, that if while integrating the same request which worked via cURL stops working with your software, the most possible reason is a set of headers. Many sites implement really sophisticated anti-scraping solutions which might take into account not only cookies, user-agents, but also a specific order of headers, compressing algorithms and specific browser market share (i.e. Chrome v41 is rarely used, so sending this via user-agent would look suspicious for the target site)