Hash-Based File Content Identification Using Distributed Systems


A serious problem in digital forensics is handling very large amounts of data. Since forensic investigators often have to analyze several terabytes of data within a single case, efficient and effective tools for automatic data identification or filtering are very important. A commonly used data identification technique is using the cryptographic hash of a file and match it against white and black lists containing hashes of files with harmless or harmful/illegal content. However, such lists are never complete and miss the hashes of most existing files. Also, cryptographic hashes can be easily defeated e.g. when used to identify multimedia content. In this work we analyze different distributed systems available in the Internet regarding their suitability to support the identification of file content. We present a framework which is able to support an automatic file content identification by searching for file hashes and collecting, aggregating, and presenting the search results. In our evaluation we were able to identify the content of about 26% of the files of a test set by using found file names which briefly describe the file content. Therefore, our framework can help to significantly reduce the workload of forensic investigators

Similar works

This paper was published in TUbiblio.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.