Contents

Bypass Cloudflare Anti-scraping with Cyotek WebCopy + Fiddler

😗Translated Content😗

This article is machine translated which hasn’t been proofreaded by the author. The info it contains may be inaccurate. The author will do his best to get back (when he has time) and revise these articles. 🥰

For Chinese version of this article, see here.

Foreword

As a CDN service provider, Cloudflare is most present (and most objectionable) when its DDoS protection page pops up when browsing a website. In general, it’s fine, at most like the page in the picture below, leave people for a while. In some special network environments, or when performing some special operations (such as the crawler we are going to do in this article), Cloudflare will force users to enter graphical verification codes, and recently switched from Google’s reCaptcha to its own hCaptcha [^ hcaptcha], because it is suspected that reCaptcha requires money. After the update, some open source Cloudflare anti-crawler solutions [^ pycf] [^ jscf] [^ rucf] are also all cold, in which the author of the NodeJS library [^ jscf] directly stated that he abandoned the update and set the project’s Github repository to archive mode.

/2020/scraping-cloudflare-with-webcopy/00.png

But the open source project surrendered, we can’t surrender, the work needs wow. Combining the pros and cons and characteristics of each crawler tool, and considering that the author himself is lazy - tends to choose a tool with a GUI interface, he finally decided to use Cyotek WebCopy plus Fiddler as the crawler implementation solution. The former is responsible for task scheduling, and the latter is responsible for key request parameters, preprocessing and post-processing of returned content.

This article takes the resource website < gt r = “2”/> as an example to demonstrate how to use Cyotek WebCopy to easily and quickly create a crawler task, and control the request and return behavior of the crawler through the Fiddler script, and crawl a Cloudflare protected resource website.

Technology selection

I know that if I saw this article myself, after reading the above paragraph, I would immediately pick up the keyboard and start bb, why not use xxxxx and xxxxx? Below I will list some of the similar alternatives I know of, and talk about the advantages of these tools and the reasons for their eventual rejection under the specific needs of this article.

To make a crawler we need two functional components. One is task scheduling, which is responsible for website directory scanning, web request triggering, and download file organization and storage. The second is request construction, where we refer to: * Modify the data content before and after the web request is sent and the web response is returned through various methods such as HTTP proxy and middleware *. For our needs (the Cloudflare website), there is extra work here: use the browser to decode the JS captcha, and stuff the authentication result cookie into each request, and handle the error state when the cookie expires. Now we know that no automated script can do this, so we need to manually decapitate the captcha in the browser and fill the cookie back into the crawler in time. This manual process may take 3-4 times to complete.

Of course, some frameworks can do both, such as scrapy. But let’s talk about it step by step.

< Gt r = “3”/> Task scheduling component: **

Cyotek WebCopy (final choice)

  • Advantages:
    • GUI, I like it. The interface is clear and can clearly display the number of downloaded successfully, the number of errors, and the progress bar
    • There is a sitemap function, which can preview the sitemap before the task starts, and feedback the impact of crawler parameter adjustment on the crawling range in real time.
    • Accessing the HTTP proxy is convenient (at least not as rubbish as httrack), which is convenient for subsequent access to other tools to modify web requests.
  • Disadvantages:
    • No multithreading. I don’t know if it’s a configuration problem.
    • It does not support resuming from a breakpoint. After restarting the task, it will start crawling again. There is no caching mechanism like scrapy and httrack.

scrapy

  • Advantages: comprehensive functions, one step in place. Whatever you want.
  • Disadvantages:
    • You need to write code, and the workload is about 1 day to 1 week. unnecessary.
    • There is no way to decode captcha, and the advantages of automation are similar to none.

httrack

  • Pros: GUI, I like it.
  • Disadvantages:
    • The GUI only has parameter configuration, no monitoring of the running process, and almost no information in the log. You can’t see the progress or error information, so naturally you can’t debug it.
    • The customizability is extremely poor and the request cannot be modified. The backend accesses the http proxy and tries various methods without success.
    • There is a problem with the resume function of the breakpoint, and it is difficult to debug after it is broken.

archivebox

  • Currently v0.4.3 is being refactored on a large scale, and a PR 1 has been dove for a year. I don’t want to go wrong and fix bugs for him, I don’t want to see the code of open source projects at all.

wget

  • Advantages: There is a cache [^ wgett] (timestamping) function, indicating that the upload can be resumed at a breakpoint.
  • Disadvantages: Poor customization, no URL filtering function.

< Gt r = “4”/> Request constructor: **

Fiddler (final choice)

  • With Fiddler Script as the request control interface, there are not many functions. It can intercept and modify requests with script setting conditions, which is enough.
  • You can inject CA certificates into the system certificate area with one click, which is convenient for HTTPS data hijacking. This is a unique shortcut feature that no other tool can do.

scrapy

  • Need to write code. trouble.

burpsuite

  • Only supports Java 8.~~ The fourth brother has gone to challenge the mountains and mountains, and will not be updated.~~ I only have Java 11 in my computer, so I don’t install any J8. And it’s not clear if burp has a scripting function.

mitmproxy

  • Tools that don’t have a GUI and require user interaction are rubbish. Mitmweb is a toy.

WebCopy

  • Some request parameters (header, UA, etc.) can be modified, but the parameters cannot be dynamically modified after the task is started, which does not meet the requirements.

Reptile implementation

The software used in this article, Cyotek WebCopy and Fiddler, are free:

Cyotek WebCopy Downloads - Copy websites locally for offline browsing • Cyotek

https://www.cyotek.com/cyotek-webcopy/downloads

Download Fiddler Web Debugging Tool for Free by Telerik

https://www.telerik.com/download/fiddler

Get Cloudflare Authentication Cookies

In order for the crawler to work properly, we need to manually decode Cloudflare’s verification code and record the cookies returned by the server.

First visit the page with a browser (< gt r = “5”/>) and follow the prompts to operate hCaptcha. The characteristic of Cloudflare’s verification code is that there are relatively few categories, and the more common ones are umbrellas, planes, boats, and bicycles, but many pictures are very strange shooting angles and some subjects (for example, a picture only has the wheel of a bicycle). However, the form of the verification code is relatively simple compared to Google.

/2020/scraping-cloudflare-with-webcopy/01.png

After the verification is successful, enter the page, open the browser developer tools, refresh the page, in the “Network” tab, click a network request and copy the cookie value and User-Agent value for later use. Our step is OK.

/2020/scraping-cloudflare-with-webcopy/02.png

Be sure to master this step. We may repeat the operation 3~ 4 times during the whole website crawling process. Don’t close this page and the developer tools window for now, save it for later use.

Crawler configuration

First create a new project in WebCopy, it’s not difficult, just follow the new wizard all the way to the next step.

There are two points to note. First, click < gt r = “8”/> in the menu bar to enter the project options.

  1. In the User Agent settings, set the UA of the crawler to be the same as the browser. Find your own UA in any webmaster tool online or in the browser developer tool. ** This step is important ** because CloudFlare verifies the availability of cookies through UA.
  2. In the Query Strings setting, check < gt r = “9”/>. Because the interface we crawl is a file list page, there are several links at the top of the page for sorting, I am too lazy to write filter rules, checking this will ensure that the files we download only have the files and index.html in the list, and do not contain Miscellaneous link content.

Next, click < gt r = “10”/> in the menu bar to set the proxy server as the local port of Fiddler, the default is < gt r = “11”/>.

Fiddler configuration

We need to use Fiddler to modify the content of the request sent by WebCopy to achieve two functions:

  1. Insert the cookie we obtained in the first step before the request is sent.
  2. Before returning the result, determine whether Cloudflare has returned any error information, and if so, trigger a breakpoint to manually intervene.

Open Fiddler and switch the right column to the < gt r = “12”/> tag. By default, Fiddler will provide us with a code template. The code here is based on JScript. NET. We don’t need to pay attention to the details, and we don’t need to look at the documentation, just follow the gourd and draw the scoop.

Let’s first store the cookie we just got, and define a < gt r = “14”/> variable at the very beginning of the < gt r = “13”/> class. We put it at the beginning to facilitate modification when the cookie expires for a while.

/2020/scraping-cloudflare-with-webcopy/03.png

Next, scroll down to the < gt r = “16”/> method in the class definition and insert our cookie modification at the beginning (implementing function 1).

/2020/scraping-cloudflare-with-webcopy/04.png

Flip down further, in the < gt r = “18”/> method, determine whether the HTTP status code is 503, if it matches, set a special field [^ fbreak], Fiddler will hit a breakpoint when he sees this field, A prompt pops up (implementing function 2), we will explain in the next section how to modify the content of the returned data.

/2020/scraping-cloudflare-with-webcopy/05.png

After all the modifications are completed, click < gt r = “20”/> to save the script, be sure to click Oh.

Run and monitor crawler status

Next, click the big button < gt r = “21”/> at the top right in WebCopy to start the crawler! Isn’t this crawler super convenient!

Let the progress bar go, let’s not worry. According to the author’s experience, Cloudflare’s cookies will expire when the progress bar has gone about a quarter of the way. At this time, we need to re-acquire a new cookie. Be sure to follow up in time, otherwise WebCopy’s request will time out and it will request the next resource by itself.

When the cookie expires, Fiddler encounters a 503 return value, which will trigger a breakpoint and pop up a notification similar to the figure below.

/2020/scraping-cloudflare-with-webcopy/06.png

The content of the request reporting an error is as follows. At this time, the request reporting an error has been blocked by Fiddler, and WebCopy is blocked, and it has not yet known that the request is reporting an error. We changed it and WebCopy will never know what happened.

/2020/scraping-cloudflare-with-webcopy/07.png

We return to the browser window opened in the second step, F5 refreshes the web page, and Cloudflare will let us complete hCaptcha again. Complete the verification code according to the prompts, then repeat the operation in the second step, fill in the updated cookie into the variable we defined earlier in FiddlerScript, click < gt r = “24”/> to save the script.

Note that the previous intercepted request is still blocked at this time. The icon of the truncated request in the list on the left side of the session is shown in the figure below. We select and press R to replay the just intercepted request. If the cookie is ok, the replayed request returns a normal status code of 200. We need to replace this normally returned data with the just truncated request.

/2020/scraping-cloudflare-with-webcopy/08.png

Select the normally returned request session, in the right window, click through the following methods, you can get the entire original HTTP data in the Raw tag, copy all the content, and then select the request that was just intercepted, and paste all the content.

/2020/scraping-cloudflare-with-webcopy/09.png

After the operation is completed, click the green button to release, and WebCopy will receive a normal return data.

When the progress bars are all gone, WebCopy will summarize the returned results and list all errors encountered by the crawler. Among them, HTTP 500 has been verified that it is a problem with server configuration, and browser access cannot be downloaded either. The following timeouts are that WebCopy timed out due to slow hands when the author took a screenshot of this article. Of course, right-click the browser to open, and the missing individual files can also be easily downloaded.

/2020/scraping-cloudflare-with-webcopy/10.png

At this point, our crawling work is complete.

Conclusion

3.66GB, running time 55 minutes. It’s not that I didn’t write any code at all, but we also saw that there is absolutely no need to write a lot of code. If you give me two options,

  1. Scrapy, 55 minutes to write code, 10 minutes to run.
  2. Webcopy, 10 minutes to write code, 55 minutes to run.

Then I definitely choose the latter! The single thread is a little slow, and Bilibili will pass after a while. How can it be better than coding to lose hair?

I just love GUIs, I just don’t like writing code! The click of the mouse is productivity!

Throw the cover image directly 2:

/2020/scraping-cloudflare-with-webcopy/11.png

References


  1. v0.4.3 (first Django release) by pirate · Pull Request #207 · pirate/ArchiveBox < gt r = “33”/> ↩︎

  2. Meme Templates - Imgflip < gt r = “36”/> ↩︎