How Web Crawling Can Help Detect Clues in Big Data

5 February, 2019

Julien Nioche

In many ways, web crawling is the virtual equivalent of searching for a needle in a haystack. Between Google, Amazon, Microsoft and Facebook alone, for example, at least 1,200 petabytes of data are stored. This means locating specific data entails traversing a tremendous number of URLs – and achieving this quickly and efficiently is not an easy task.

But when web crawlers are enlisted to assist law enforcement agencies (LEAs) in their search for perpetrators of online child sexual exploitation (CSE), the stakes are high and time is limited. Increasing the yield of leverageable data for LEAs could result in the prosecution of more sexual predators and, thereby, the safeguarding of more children. Conversely, by failing to do so, key leads could be lost. Web crawlers operating in this space, therefore, must be smart in their approaches in order to find the right data at the right time.

Beyond quantity

While crawling a greater volume of URLs will theoretically increase the chances of locating useful evidence, garnering an optimal yield is not solely dependent on it. For example, if one was to crawl the whole web, volume statistics would artificially increase without necessarily homing in on any information of use.

This is particularly true when it comes to crawling images which could connect the covert and overt personas of sexual predators. By targeting specific sites in a vertical crawl, getting an optimal yield is far more likely.

Drilling into data

Even targeted crawling springs up its own challenges. Individual sites can, of course, be enormous in themselves, making the partitioning of data across several machines complex. In order to drill down into potentially useful images, then, you need nimble crawling software running on a cluster of machines.
StormCrawler – a library which enables the development of open source, low latency and scalable web crawlers – is perfect for this usage. And by coupling it with Elasticsearch – a multitenant-capable full-text search engine – data visibility is optimised. This allows LEAs to truly interrogate the results they have fetched.

The implications of this are huge: by using CameraForensics’s crawling strategies, LEAs can ascertain the properties of photos on the open web and connect images with corresponding data on the dark web. This could provide the all-important link between the overt personas of predators to the covert ones they use to produce and distribute online CSE materials.

Online etiquette

Achieving optimal results through web crawling is not possible without a degree of nuance, however. Hitting a server with lots of different requests at the same time could lead to a crawler being blacklisted. This could hugely compromise a targeted approach and crucial evidence could fall through the cracks as a result.

“Politeness”, therefore, is an integral component of any web crawling strategy. It can be achieved by following robots.txt rules – essentially a set of directives which webmasters can use to guide their “politeness”. Other mechanisms such as sitemaps allow websites to instruct the crawler to fetch specific pages, which saves the crawler from having to incrementally discover what they’re looking for within a site.

As all of these mechanisms are in-built in StormCrawler in order to cope with volume and provide reliability, it serves as one of the most comprehensive options available to web crawlers in this space.

Unlocking global CSE networks through web crawling

In the field of CSE prosecution and prevention, web crawling is an increasingly popular technique. This is because, by finding the right image at the right time, LEAs have the power to unlock global child abuse networks. By doing so, they can bring more perpetrators of CSE to justice, helping to make the world a safer one for children.