How does Python Selenium bypass cloudflare to crawl web pages?

Cloudflare, Python, Selenium · 21 Jul 2023

Like many websites, Cloudflare also detects access to see if it is initiated by a Selenium bot. This detection mainly focuses on whether there are unique js variables, such as variables containing “selenium” and “webdriver”, or file variables containing “$cdc_” and “$wdc_”.

The detection mechanism of each driver may be different, the following solutions are mainly for chromedriver.

Use Undetected-chromedriver This is a very convenient package that can be installed directly through pip. Then initialize the driver like below, after that it works just like regular Selenium usage.

python Copy code

import undetected_chromedriver as uc
driver = uc.Chrome()
driver.get('https://nowsecure.nl')

Directly modify the chromedriver executable file You can change the key variable to any character that does not contain “cdc”.

javascript Copy code

/**
 * Returns the global object cache for the page.
 * @param {Document=} opt_doc The document whose cache to retrieve. Defaults to
 *     the current document.
 * @return {!Cache} The page's object cache.
 */
function getPageCache(opt_doc, opt_w3c) {
  var doc = opt_doc || document;
  var w3c = opt_w3c || false;
  // |key| is a long random string, unlikely to conflict with anything else.
  var key = '$cdc_asdjflasutopfhvcZLmcfl_';
  if (w3c) {
    if (!(key in doc))
      doc[key] = new CacheWithUUID();
    return doc[key];
  } else {
    if (!(key in doc))
      doc[key] = new Cache();
    return doc[key];
  }
}

There is not much difference between these two methods in essence. In fact, undetected-chromedriver will apply a patch when starting chromedriver, completing the steps of modifying the key.

python Copy code

def patch_exe(self):
    """
    Patches the ChromeDriver binary
    :return: False on failure, binary name on success
    """
    logger.info("patching driver executable %s" % self.executable_path)
 
    linect = 0
    replacement = self.gen_random_cdc() #这里修改了cdc的名称
    with io.open(self.executable_path, "r+b") as fh:
        for line in iter(lambda: fh.readline(), b""):
            if b"cdc_" in line:
                fh.seek(-len(line), 1)
                newline = re.sub(b"cdc_.{22}", replacement, line)
                fh.write(newline)
                linect += 1 

Using the ScrapingBypass API, you can easily bypass Cloudflare robot verification, even if you need to send 100,000 requests, you don’t have to worry about being identified as a scraper.
A ScrapingBypass API can break through all anti-anti-bot robot inspections, easily bypass Cloudflare, CAPTCHA verification, WAF, CC protection, and provide HTTP API and Proxy, including interface address, request parameters, return processing; and set Referer, browser UA and headless status and other browser fingerprint device features.