StormCrawler demonstrates the power of open source

23 March, 2021

An interview with Julien Nioche

StormCrawler demonstrates the power of open source

We sat down with our team member Julien Nioche, resident web crawling and big data expert and the developer of the open-source project ‘StormCrawler’. StormCrawler is a lightweight, scalable, and internationally used ‘collection of reusable components to build distributed web crawlers based on the streaming framework ‘Apache Storm'.

Through StormCrawler, web data can be retrieved and processed for a multitude of use cases ranging from small to very big data. It is used by a global audience, as well as forming the vital backbone of CameraForensics’ operations.

Below, we discuss how the project was formed and developed, the future of StormCrawler outside and within the CameraForensics platform, and the stunning power of the open-source software.

Hi Julien, thanks for taking the time to talk to us. To begin, what inspired the StormCrawler project?

I have been using and contributing to various open-source projects since the early 2000s. In particular, I became involved in Apache Nutch, a scalable web crawler based on Apache Hadoop, one of the first software platforms for big data processing. I enjoyed customising and optimising it to my preferences, and would often do the same for my consulting clients – ensuring that their specific needs were being catered to.

Being exposed to so many different uses of Apache Nutch, and also because I knew it very well, I was aware of its limitations and shortcomings. Around that time, scalable streaming platforms like Apache Storm were gaining momentum and I decided to use it as a starting point for StormCrawler - hence its name!

The project got very good reception and I started seeing people using it in production pretty soon.

Not long after that Matt Burns got in contact, wanting to find out more about StormCrawler after trying other web crawling solutions. Now, together with Elasticsearch, it enables CameraForensics’s advanced indexing, aiding law enforcement agencies to save time and resources.

Tell us more about the cutting-edge capabilities and use cases of Stormcrawler.

Like other open source web crawlers, StormCrawler is freely available and its code can be not only seen but, more importantly, modified and customised by users.

What I wanted to achieve with StormCrawler was to have something not only scalable but also modern, nice to use, and lightweight. Users can have a web crawler capable of dealing with large quantities of data with just a handful of scripts and configuration files!

What are the benefits of building StormCrawler as an open-source software?

One of the most exciting advantages of open-source is that people can actively contribute and develop alongside us. I truly believe that StormCrawler is a direct product of invited collaboration and international participation. It’s a much larger project than what one individual could create, and more contributions are happening all the time. Users can make changes based on their specific needs, and companies can directly modify them for bespoke use cases.

For CameraForensics, this has a wide range of benefits. Whenever we want additional capabilities or want to move in a different direction, we can use StormCrawler as a toolbox to get there. It’s easy to change as well as we obviously have the capabilities in-house, eliminating long periods of development, deliberation, and design time.

How does StormCrawler work with CameraForensics?

The crawler is integral to CameraForensics’ mission; using digital image forensics to support law enforcement to solve complex crimes. Our daily operations revolve around crawling and analysing incredibly vast sets of digital imagery, with our database currently holding billions of images. As the database continues to expand and develop, it’s important to have a service like StormCrawler which is reliable, scalable, and multi-faceted. The crawler’s autonomous operation also works very much in our favour - allowing us to concentrate on what matters.

We also enjoy the fact that the crawler ‘acts politely’, which is to say that it isn’t too invasive. We could go further, fetching data through intrusion and forced access, but we feel that this goes against our core values and ethics.

You may also like: What AI means for open source intelligence investigations

Are you currently working on any new projects?

We are currently in the very early stages of a new and exciting project that aims to bring more of a focus onto how various crawlers interact with URLs.

All web crawlers have functions in common, such as the storage and access of the URLs they have already visited, ones that are left to visit, politeness settings, and so on. This is known as a URL frontier. I started a project recently under that name. It is funded through the NGI0 Discovery Fund and is aimed at defining an API that can be used across different crawlers regardless of their programming language, as well as providing a reference implementation.

Just like StormCrawler, I made this software open-source too and gave it a home at crawler-commons, another open-source effort I am a member of. It is, as usual, about not reinventing the wheel and getting better results through shared effort. So far, we’ve had excellent feedback, and when it’s released, it will change the crawling landscape quite a bit.

Are there any plans for future StormCrawler development?

The future of StormCrawler truly depends on how users interact with it, as well as what they demand. There will be a newer version being released very soon, which aims to add increased functionality and more, having taken feedback from our users into account.

Our longer-term goal for StormCrawler is to switch it to Apache Storm v2 - thus keeping in sync with the latest development of the underlying platform.

Try StormCrawler for yourself!

StormCrawler is free and open source. Anyone can use it, and we encourage you to pick it up and try it out. Open-source software continues to develop at an exciting rate, and we look forward to the future innovations and services that will be born out of collaboration, feedback, and exploration.

For more information, to get started, or to provide any feedback you may have, get started here. You can also get in touch with us at CameraForensics if you have any questions about our other tech-focused projects.

Subscribe to the Newsletter