Data privacy at CameraForensics: our responsibility and ethics
By Alan Gould
Horizontal scaling capabilities enable us to provide our users with the critical intelligence that they need to further their research and uncover valuable data.
This is fundamentally important to the work we do and demands serious consideration to prevent an issue in our user interface. With so much data and processes that need equal attention, understanding the best way of presenting data back seamlessly and efficiently is essential – presenting a real risk to the advancement of investigations.
In our blog below, we’re exploring what horizontal scaling is and why we do it here at CameraForensics, as well as how we navigate the interface issue and what it means for our users.
The more research that we do, and the more sites our crawlers index, the more our datasets – and workload – grow.
Trying to process this increasing load of data without modifying the underlying infrastructure would lead to big slowdowns. Also, for our ongoing R&D projects where we need to experiment and test new ideas on demand, this can quickly become counterproductive.
That’s where horizontal-scaling capabilities come in. They enable us to scale our infrastructure and larger datasets for this important work.
Overcoming larger and more challenging datasets can be done through either horizontal or vertical scaling. So what’s the difference?
When we discuss vertical scaling, we simply mean introducing more power and speed to current machines, such as upgrading CPUs or adding more RAM. This allows us to take on the additional requirements of larger datasets.
On the other hand, horizontal scaling spreads the workload across multiple machines of the same size. The machines don’t get any bigger, but they can deal with the immense workload without entering a resource deficit.
There are also different types of horizontal scaling, such as fixed and auto-scaling. Fixed horizontal scaling is done manually, while auto-scaling will scale out independently and on demand by monitoring thresholds of resource use. When a high-threshold has been breached, more resources are added, and when low-thresholds aren’t being met, resources are removed.
Core challenges come with scaling any system, like navigating the issue of updating a system with zero downtime. There also isn’t one formula for scaling that works for all datasets. For us, we have our sightings index that uses one kind of data, but our PhotoDNA index uses data in a different way that involves employing data science techniques to scale it.
And the list goes on!
Interestingly, one of the most significant issues that we had to address with scaling wasn’t to do with infrastructure, but with user experience.
Our users need to access critical intelligence as quickly as possible. However, when a simple search can return thousands of results from billions of datasets, getting to the important information quickly is a tricky UX problem to solve!
The question then was this: how can we streamline the experience so users only receive the most relevant and high-quality results? And if they did get back a large dataset, how could we help them find the needle in that haystack without becoming overwhelmed?
This is vital, as it ensures that victims can be safeguarded as soon as possible. So, navigating this challenge when designing the user interface was essential.
Learn more about our mission in our blog here: The importance of data-driven intelligence in driving positive change.
Back when our user interface focused on searching for individual pieces of data, a user came to us with a question: how can I take my entire case load, or multiple case loads, and compare it with your data?
Finding a solution was our top priority. But where to start?
First, we started by understanding the priorities of result insights – tracing an investigator’s usual search behaviour and flagging the results that they would be most interested in, marking some results as potentially more relevant than others.
For example, it’s not interesting to know that we have several million matches for Canon 7Ds. But it is interesting to know if we have matches for Canon 7Ds with specific serial numbers.
It’s not necessarily interesting to know that we have matches for the Picasa software tag, which is very common, but it would be interesting to know that there’re software tags not commonly found and if we only have a small number of matches, because that indicates a needle found in the haystack.
Once we understood what made results more interesting than others, we could apply a scoring mechanism to give each result a percentage interest score.
Even after you’ve marked some data as potentially more interesting than others, you can’t make the ultimate decision on that – it’s down to the investigator to decide. As a result, the next challenge was how to present this mass of results in a manageable, overwhelm-free way.
To start, we broke down the mass of data into four main categories:
Then we presented a high-level view of that data, allowing the user to drill what they found most interesting.
For example, the matching devices section might say at the highest level that out of the X number of devices identified in their data, we found matches for Y of them. Then, the user could drill into that to see which devices had matches, and what those matches were.
Once this is done, they could then take a step further and be reminded of which images they gleaned the data from were associated with these devices, and could examine the matches right down on the image by image level to find out what other pieces of data associated with those images also matched our data.
We finally concluded this challenge after multiple concepts and iterations of what is now our results page. As we move forward, we want to continue to improve user experience so our users can easily get the most out of their tools.
Learn more: Why are we committed to research and development.
When we turn to the future of our scaling capabilities, we hope to see innovation in two core areas of scaling processes and presentation.
The first development is to expand how users interact with search functions. By integrating natural language processing (NLP) features, users could freely explore our image database using authentic and natural human language. This may include phrases such as, “show me images at this postcode”, or “present all images that match this serial number.”
This would radically change the way that image forensics are performed and encourage a more experimental approach to research.
We also hope to see continuous development in faster and more accurate processing capabilities to reduce the friction of searching even further. This would give investigators the quickest research processes possible, bringing them closer to critical intelligence at any time.
At CameraForensics, we’re committed to helping provide advanced and relevant insights for our users, helping to drive positive change on a global level.
Horizontal scaling is a powerful tool in our arsenal, helping us to develop new processes, and analyse more data, without sacrificing efficiency. For our users, this means insights that are reliable, fast, and actionable.
To learn more about our ongoing R&D capabilities and our other areas of expertise, visit our Research & Development section.