Presenting: A Tight Scrape

Web Scraping

14 Sept

Written By Kieron Turk

Methodological approaches to cybercrime research data collection in adversarial environments

Kieron Turk

Sergio Pastrana

Ben Collier

Last week, I had the priveledge of presenting our paper at WACCO: 2nd Workshop on Attackers and Cyber-Crime Operations. Today, I'll be presenting it here as well. This is my first academic paper and I couldn't be more proud of how it turned out. If you want to read the paper, it can be found in the IEEE digital library.

Background

So, what's all this about? Well, when doing research, we often need to gather large amounts of data off of websites so that we can analyze it. Web scrapers are programs which do exactly that, visiting many pages of a website, grab the data that we are interested in, and store it in a useful format for later. In some cases, the website owners may not want to allow scrapers to run on their site, and so they will put various defences in place to stop anyone from scraping it. We often have a need (and an ethical case!) to scrape these sites anyway, and so we need to bypass these defences to scrape the site. This process is known as adversarial scraping.

Our work is analyzing the various defences encountered during adversarial scraping, based on the creation of our own scrapers. These include:

CrimeBB: a collection of 26 hacker forum scrapers spread across the surface and dark web
Chat channel scrapers: Gathering data from Discord and Telegram across several hundred individual communities
fget, a Firefox implementation of wget: a directory scraper specifically designed for adversarial environments.

The scraping process can be split up into 4 distinct stages: accessing the site, navigating to the pages of interest, loading these pages and the content on them, and gathering data off of these pages. The defences found during adversarial scraping generally target one specific stage of the scraping process, although this is not always the case.

I'm going to cover some key defences from each of the stages; if you're interested in seeing the full list in full detail, please have a read of our paper :)

Accessing the site

On many web sites, certain functionality will be restricted to logged in users. This may include specific parts of a forum, or in some cases access to any part of the site. We will need to register an account to get past this, so that our scraper is able to access this restricted content. Although there are many automation defences in place when registering an account, all of these can be bypassed easily by a human registering the account, and so the main issues when creating an account are invite-only forums, where we must know someone on the forum to get access, or sites which require a payment to fully register the account. Payments can be especially problematic, as a scraper getting banned forces further payment to continue scraping.

Once we have an account, we need to login. This can either be done automatically or manually. In the automated case, we find the username and password fields and type into them, complete any other information required by the login form, and then click submit. Some forms will monitor how the user fills out the form, and may make it harder if bot-like behaviours are detected - some examples are submitting the form very shortly after loading the page, or filling out the fields too quickly. These can be worked around by adding a short pause between filling out each field, and limiting the typing speed of the crawler. Furthermore, some elements of login forms can be hard to automate - for example, certain forms of 2 factor authentication/2FA are difficult to automate. If a code gets sent to my phone, but I have to pass this to a scraper runnning on a server, it can be difficult to automate login. Similarly, Captchas can be hard to work around, although there are solvers for certain types of Captcha and solving services for others. If you are not able (or willing) to work around these measures, then you will instead have to use manual login.

The alternative to automated login is to have a human manually login to the site, using the "remember me" option where available, and then copy the cookies for that logged in session over to the scraper. The scraper can then access the site while logged in to this session. The problem here is session timeouts: the cookies will expire after some time period, and we will need to login again. On surface web sites, we found the worst case scenario to be about 1 week, in which case you can set a weekly reminder to login again and give the cookies to the scrapers, however on some dark net sites we found the timeout to be as short as one hour, in which case manual login becomes much less feasible. In these cases however, the login forms were much easier to automate, and so we use that approach instead.

Navigating the site

When navigating a website, the main issue for a scraper is avoiding being banned. This can involve an account being banned, or in more extreme cases any traffic from the scraper's IP address, and is usually due to the user "looking like a bot". When scraping our sites, we managed to identify several behaviours that got our scrapers banned:

Rapid navigation: Moving between pages faster than a human can, or not staying on any page. Human users will take time to process the navigation pages and find the next link to click, and will also stop on content pages to view the content instead of moving on immediately.
Predictable navigation: scrapers will often visit every post on a site in order, either in the order presented on navigation pages, or just iterating through all post ID numbers.
Old content: we found that our scrapers got banned a lot more when viewing (years) old content on several forums. We suspect this is because there is much less traffic to older content on these sites; the majority of users are looking at the incoming, new content on each part of the site, and so traffic to older pages may be flagged as suspicious. Scrapers have to operate much slower on older content to avoid raising suspicion.

As well as the measures we detected, we were able to find some discussion of our scrapers on the sites we were scraping. From these discussions, we can determine the following factors which humans use to identify bots on a site:

24/7 access: humans need sleep. Robots do not. Humans also tend to do more with their life than look at a single site all day.
Same account, different IPs: some sites show profiles of different accounts available to users of the site, including information such as the IP address accessing the account. Some users noticed that the same account was being accessed from multiple IP addresses, which was suspicious.
Lack of interaction: especially when combined with 24/7 access, most users of a forum will be interacting with the discussions they read, and so a user who never posts anything is more likely to be a bot.

The other side of scraper navigation concerns malicious links. Consider the example below: Account delete trap No human is ever going to click on this link, however a naive scraper which is vising every link it sees will, and will remove their access to the site in the process. There are other forms of these links, such as those hidden with display:none, so that humans cannot see them but bots can. These are designed to identify and trap bots, and some of the behaviours we see when following these are account deletion (as above) and redirect loops, where the crawler is sent into a navigation loop between two or more pages. Modern browsers such as Chrome and Firefox will detect this loop and prevent you following it, so the use of a browser based scraper is advised to avoid this.

In some more extreme cases, mostly with our directory scraper, we found more extreme "attacks on the browser", which attempt to crash the browser instance rather than just trap the scraper. These were found on pages you wouldn't expect a user to be interested in, such as database/cache/testing/, and while they could affect users of the site they appear to be targeting crawlers and scrapers. Some of these attacks include: an infinitely recursing function, which slowly fills up the call stack and will crash the browser when it runs out of memory; a page which endless appends lolwhy to itself at an incredible rate, using up both the computer's processing power and memory until the page is too large and crashes; and a massive file which may be too large to load into memory and causes issues on certain browsers.

Page loading

There's two main issues faced when loading individual pages. First, we need to step back and realise that scraping is generating a lot of traffic. We're forcing a server to respond to a large number of requests, and especially for small sites this can have the same effect as a DDoS attack on the site. As a result of this, when loading pages we often find DDoS protections in use when scraping to try and reduce the malicious traffic to the site. These can be commercial DDoS protection services like Cloudflare or Blazingfast, or a homebrewed version as seen on a small number of websites. Another protection against DDoS is rate limiting: the server may track the number of page requests coming from an account or IP over some period of time, and if there are too many the server will stop responding to HTTP requests from this source. An example from one of our scrapers was that going over about 50 page requests in a minute prevented us from accessing content for the next hour, being served 429 - Too Many Requests errors instead.

The other side of page loading is loading the content we are interested in. Not all of the content is available when we first view a page - we may have to wait for some animations to complete, asynchronous content to load or interact with the page by clicking buttons or scrolling to make other content load.

Data Gathering

There are lots of things which unintentionally make data gathering harder, but there are two defences which intentionally try to defeat scrapers. The first is obfuscation - the practice of replacing "sensitive" data with images, css sprites or other formats which cannot be easily scraped. If the scraper has access to OCR, then this can be used to gather the data anyway. This defence is not commonly employed by site owners, as it causes a few usability issues for users with a slow connection or using screen readers, but it is used by some users of the site to hide content such as drug advertisements from being easily identified.

The other main defence against data gathering is changing the layout of the page. This involves changing the underlying HTML structure of the page, often while keeping the user interface visually the same. As most scrapers will be looking at an exact path with the HTML to find the content they scrape, this is very effective at defeating existing scrapers. It does require a large amount of effort on the admin side to do, and so it is less commonly seen on smaller websites, however it is one of the most effective techniques we see in use against scrapers.

Discussion

So, what have we learned from all this? Well, the long and short of it is that many of the defences in use against scrapers are inneffective. They can either be bypassed with a small amount of manual effort, such as registering for a site, or they can be easily worked around, such as the group of defences bypassed entirely by using a browser based scraper. Some of the defences are more effective, but the overall most impactful thing are the slowdowns faced during scraping. They come in several forms: we have to slow down to avoid looking like a bot and triggering rate limiting; we have to further slow down scraping of older content; we have to wait for content loading and DDoS pages to complete and so on. The many slowdowns can make scraping all of the content required in a reasonable timeframe difficult, although if the resources are available multiple scrapers can be run in parallel to aid this.

We are able to identify some specific adversarial environments with different behaviours when scraping. The first of these are onions: websites with the .onion top level domain that can only be accessed over Tor. One of the expectations when running Tor, both as a security and privacy feature, is that Javascript is disabled in the browser. This prevents XSS attacks and lots of tracking features, but also means that website owners have to create their sites without Javascript on them*, and so many of the defences against scraping either cannot work or have to be replaced with weaker versions. For example, Google's reCaptcha is the most common Captcha service, but because it makes use of Javascript** they are not found on darknet websites. Instead, older text Captchas are commonly used, and on some sites there are novel forms of Captcha which are generally easy to create a solver for.

* A common exception to this is to have a Javascript alert pop up to tell users to disable JS if they are seeing the alert.

** Additionally, this is because Google often gives much harder Captchas to Tor users, or outright refuses to let them solve a Captcha and errors.

The other distinct environment are chat channels. The chat platforms themselves will implement few, if any, technical defences against scrapers, and so it is much easier to create a scraper for these sites. On the other hand, each individual community on these platforms will have dedicated moderators and admins, and so scrapers are subject to a lot more human scrutiny and can expect intervention from humans, such as being banned from some of the communities. This is very noticeable when part of a small community, where it is much harder to go unnoticed. Overall, one can expect to have to deal much more with the human element than the technical side when working with chat channels.

A final distinction to be made is based on the size of the sites. When working on hacker forums, we see that the larger forums are more likely to have their own dedicated moderators, and so scrapers may have to deal with humans more than they do on smaller sites. This is highlighted with banning: overall, we were only banned on three of the forums we scraped, and these were some of the much larger forums. This implies that larger sites put more focus on preventing scraping, whereas the smaller sites tend to implement fewer intentional defences against scraping. Indeed, many of the defences used by smaller sites tend to be focussed on stopping DDoS attacks, rather than scraping attempts.

This blog is but a summary of the points discussed in our paper; I would highly recommend reading the full paper to learn more about the defences in use, our countermeasures to them, and the ethical discussion around creating scrapers in adversarial environments. Thank you fo reading!

Kieron Turk

Presenting: A Tight Scrape

Methodological approaches to cybercrime research data collection in adversarial environments

Kieron Turk

Sergio Pastrana

Ben Collier

Background

Accessing the site

Navigating the site

Page loading

Data Gathering

Discussion

Presenting: Pinpoint

Google CTF 2020 Writeup: Log-me-in