8 research outputs found

    An Efficient Similarity Digests Database Lookup – A Logarithmic Divide & Conquer Approach

    Get PDF
    Investigating seized devices within digital forensics represents a challenging task due to the increasing amount of data. Common procedures utilize automated file identification, which reduces the amount of data an investigator has to examine manually. In the past years the research field of approximate matching arises to detect similar data. However, if n denotes the number of similarity digests in a database, then the lookup for a single similarity digest is of complexity of O(n). This paper presents a concept to extend existing approximate matching algorithms, which reduces the lookup complexity from O(n) to O(log(n)). Our proposed approach is based on the well-known divide and conquer paradigm and builds a Bloom filter-based tree data structure in order to enable an efficient lookup of similarity digests. Further, it is demonstrated that the presented technique is highly scalable operating a trade-off between storage requirements and computational efficiency. We perform a theoretical assessment based on recently published results and reasonable magnitudes of input data, and show that the complexity reduction achieved by the proposed technique yields a 2 20-fold acceleration of look-up costs

    Professor Frank Breitinger\u27s Full Bibliography

    Get PDF

    An Efficient Similarity Digests Database Lookup -- a Logarithmic Divide and Conquer Approach

    Get PDF
    Investigating seized devices within digital forensics represents a challenging task due to the increasing amount of data. Common procedures utilize automated file identification, which reduces the amount of data an investigator has to examine manually. In the past years the research field of approximate matching arises to detect similar data. However, if n denotes the number of similarity digests in a database, then the lookup for a single similarity digest is of complexity of O(n). This paper presents a concept to extend existing approximate matching algorithms, which reduces the lookup complexity from O(n) to O(log(n)). Our proposed approach is based on the well-known divide and conquer paradigm and builds a Bloom filter-based tree data structure in order to enable an efficient lookup of similarity digests. Further, it is demonstrated that the presented technique is highly scalable operating a trade-off between storage requirements and computational efficiency. We perform a theoretical assessment based on recently published results and reasonable magnitudes of input data, and show that the complexity reduction achieved by the proposed technique yields a 220-fold acceleration of look-up costs

    Bytewise Approximate Matching: The Good, The Bad, and The Unknown

    Get PDF
    Hash functions are established and well-known in digital forensics, where they are commonly used for proving integrity and file identification (i.e., hash all files on a seized device and compare the fingerprints against a reference database). However, with respect to the latter operation, an active adversary can easily overcome this approach because traditional hashes are designed to be sensitive to altering an input; output will significantly change if a single bit is flipped. Therefore, researchers developed approximate matching, which is a rather new, less prominent area but was conceived as a more robust counterpart to traditional hashing. Since the conception of approximate matching, the community has constructed numerous algorithms, extensions, and additional applications for this technology, and are still working on novel concepts to improve the status quo. In this survey article, we conduct a high-level review of the existing literature from a non-technical perspective and summarize the existing body of knowledge in approximate matching, with special focus on bytewise algorithms. Our contribution allows researchers and practitioners to receive an overview of the state of the art of approximate matching so that they may understand the capabilities and challenges of the field. Simply, we present the terminology, use cases, classification, requirements, testing methods, algorithms, applications, and a list of primary and secondary literature

    An Efficient Similarity Digests Database Lookup – A Logarithmic Divide & Conquer Approach

    No full text
    Investigating seized devices within digital forensics represents a challenging task due to the increasing amount of data. Common procedures utilize automated file identification, which reduces the amount of data an investigator has to examine manually. In the past years the research field of approximate matching arises to detect similar data. However, if n denotes the number of similarity digests in a database, then the lookup for a single similarity digest is of complexity of O(n). This paper presents a concept to extend existing approximate matching algorithms, which reduces the lookup complexity from O(n) to O(log(n)). Our proposed approach is based on the well-known divide and conquer paradigm and builds a Bloom filter-based tree data structure in order to enable an efficient lookup of similarity digests. Further, it is demonstrated that the presented technique is highly scalable operating a trade-off between storage requirements and computational efficiency. We perform a theoretical assessment based on recently published results and reasonable magnitudes of input data, and show that the complexity reduction achieved by the proposed technique yields a 220-fold acceleration of look-up costs

    Um estudo sobre pareamento aproximado para busca por similaridade : técnicas, limitações e melhorias para investigações forenses digitais

    Get PDF
    Orientador: Marco Aurélio Amaral HenriquesTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: A forense digital é apenas um dos ramos da Ciência da Computação que visa investigar e analisar dispositivos eletrônicos na busca por evidências de crimes. Com o rápido aumento da capacidade de armazenamento de dados, é necessário o uso de procedimentos automatizados para lidar com o grande volume de dados disponíveis atualmente, principalmente em investigações forenses, nas quais o tempo é um recurso escasso. Uma possível abordagem para tornar o processo mais eficiente é através da técnica KFF (Filtragem por arquivos conhecidos - Known File Filtering), onde uma lista de objetos de interesse é usada para reduzir/separar dados para análise. Com um banco de dados de hashes destes objetos, o examinador realiza buscas no dispositivo de destino sob investigação por qualquer item que seja igual ao buscado. No entanto, devido a limitações nas funções criptográficas de hash (incapacidade de detectar objetos semelhantes), novos métodos foram projetados baseando-se em funções de Pareamento Aproximado (ou Approximate Matching) (AM). Estas funções aparecem como candidatos para realizar buscas uma vez que elas têm a capacidade de identificar similaridade (no nível de bits) de uma maneira muito eficiente, criando e comparando representações compactas de objetos (conhecidos como resumos). Neste trabalho, apresentamos as funções de Pareamento Aproximado. Mostramos algumas das ferramentas de AM mais conhecidas e apresentamos as Estratégias de Busca por Similaridade baseadas em resumos, capazes de realizar a busca de similaridade (usando AM) de maneira mais eficiente, principalmente ao lidar com grandes conjuntos de dados. Realizamos também uma análise detalhada das estratégias atuais e, dado que as mesmas trabalham somente com algumas ferramentas específicas de AM, nós propomos uma nova abordagem baseada em uma ferramenta diferente que possui boas características para investigações forenses. Além disso, abordamos algumas limitações das ferramentas atuais de AM em relação ao processo de detecção de similaridade, onde muitas comparações apontadas como semelhantes, são de fato falsos positivos; as ferramentas geralmente são enganadas por blocos comuns (dados comuns em muitos objetos diferentes). Ao remover estes blocos dos resumos de AM, obtemos melhorias significativas na detecção de objetos similares. Também apresentamos neste trabalho uma análise teórica detalhada das capacidades de detecção da ferramenta de AM sdhash e propomos melhorias em sua função de comparação, onde a versão aprimorada apresenta uma medida de similaridade (score) mais precisa. Por último, novas aplicações de AM são apresentadas e analisadas: uma de identificação rápida de arquivos por meio de amostragem de dados e outra de identificação eficiente de impressões digitais. Esperamos que profissionais da área forense e de outras áreas relacionadas se beneficiem de nosso estudo sobre AM para resolver seus problemasAbstract: Digital forensics is a branch of Computer Science aiming at investigating and analyzing electronic devices in the search for crime evidence. With the rapid increase in data storage capacity, the use of automated procedures to handle the massive volume of data available nowadays is required, especially in forensic investigations, in which time is a scarce resource. One possible approach to make the process more efficient is the Known File Filter (KFF) technique, where a list of interest objects is used to reduce/separate data for analysis. Holding a database of hashes of such objects, the examiner performs lookups for matches against the target device under investigation. However, due to limitations over cryptographic hash functions (inability to detect similar objects), new methods have been designed based on Approximate Matching (AM). They appear as suitable candidates to perform this process because of their ability to identify similarity (bytewise level) in a very efficient way, by creating and comparing compact representations of objects (a.k.a. digests). In this work, we present the Approximate Matching functions. We show some of the most known AM tools and present the Similarity Digest Search Strategies (SDSS), capable of performing the similarity search (using AM) more efficiently, especially when dealing with large data sets. We perform a detailed analysis of current SDSS approaches and, given that current strategies only work for a few particular AM tools, we propose a new strategy based on a different tool that has good characteristics for forensic investigations. Furthermore, we address some limitations of current AM tools regarding the similarity detection process, where many matches pointed out as similar, are indeed false positives; the tools are usually misled by common blocks (pieces of data common in many different objects). By removing such blocks from AM digests, we obtain significant improvements in the detection of similar data. We also present a detailed theoretical analysis of the capabilities of sdhash AM tool and provide some improvements to its comparison function, where our improved version has a more precise similarity measure (score). Lastly, new applications of AM are presented and analyzed: One for fast file identification based on data samples and another for efficient fingerprint identification. We hope that practitioners in the forensics field and other related areas will benefit from our studies on AM when solving their problemsDoutoradoEngenharia de ComputaçãoDoutor em Engenharia Elétrica23038.007604/2014-69CAPE

    Automated Digital Forensic Triage: Rapid Detection of Anti-Forensic Tools

    Get PDF
    We live in the information age. Our world is interconnected by digital devices and electronic communication. As such, criminals are finding opportunities to exploit our information rich electronic data. In 2014, the estimated annual cost from computer-related crime was more than 800 billion dollars. Examples include the theft of intellectual property, electronic fraud, identity theft and the distribution of illicit material. Digital forensics grew out of necessity to combat computer crime and involves the investigation and analysis of electronic data after a suspected criminal act. Challenges in digital forensics exist due to constant changes in technology. Investigation challenges include exponential growth in the number of cases and the size of targets; for example, forensic practitioners must analyse multi-terabyte cases comprised of numerous digital devices. A variety of applied challenges also exist, due to continual technological advancements; for example, anti-forensic tools, including the malicious use of encryption or data wiping tools, hinder digital investigations by hiding or removing the availability of evidence. In response, the objective of the research reported here was to automate the effective and efficient detection of anti-forensic tools. A design science research methodology was selected as it provides an applied research method to design, implement and evaluate an innovative Information Technology (IT) artifact to solve a specified problem. The research objective require that a system be designed and implemented to perform automated detection of digital artifacts (e.g., data files and Windows Registry entries) on a target data set. The goal of the system is to automatically determine if an anti-forensic tool is present, or absent, in order to prioritise additional in-depth investigation. The system performs rapid forensic triage, suitable for execution against multiple investigation targets, providing an analyst with high-level information regarding potential malicious anti-forensic tool usage. The system is divided into two main stages: 1) Design and implementation of a solution to automate creation of an application profile (application software reference set) of known unique digital artifacts; and 2) Digital artifact matching between the created reference set and a target data set. Two tools were designed and implemented: 1) A live differential analysis tool, named LiveDiff, to reverse engineer application software with a specific emphasis on digital forensic requirements; 2) A digital artifact matching framework, named Vestigium, to correlate digital artifact metadata and detect anti-forensic tool presence. In addition, a forensic data abstraction, named Application Profile XML (APXML), was designed to store and distribute digital artifact metadata. An associated Application Programming Interface (API), named apxml.py, was authored to provide automated processing of APXML documents. Together, the tools provided an automated triage system to detect anti-forensic tool presence on an investigation target. A two-phase approach was employed in order to assess the research products. The first phase of experimental testing involved demonstration in a controlled laboratory environment. First, the LiveDiff tool was used to create application profiles for three anti-forensic tools. The automated data collection and comparison procedure was more effective and efficient than previous approaches. Two data reduction techniques were tested to remove irrelevant operating system noise: application profile intersection and dynamic blacklisting were found to be effective in this regard. Second, the profiles were used as input to Vestigium and automated digital artifact matching was performed against authored known data sets. The results established the desired system functionality and demonstration then led to refinements of the system, as per the cyclical nature of design science. The second phase of experimental testing involved evaluation using two additional data sets to establish effectiveness and efficiency in a real-world investigation scenario. First, a public data set was subjected to testing to provide research reproducibility, as well as to evaluate system effectiveness in a variety of complex detection scenarios. Results showed the ability to detect anti-forensic tools using a different version than that included in the application profile and on a different Windows operating system version. Both are scenarios where traditional hash set analysis fails. Furthermore, Vestigium was able to detect residual and deleted information, even after a tool had been uninstalled by the user. The efficiency of the system was determined and refinements made, resulting in an implementation that can meet forensic triage requirements. Second, a real-world data set was constructed using a collection of second-hand hard drives. The goal was to test the system using unpredictable and diverse data to provide more robust findings in an uncontrolled environment. The system detected one anti-forensic tool on the data set and processed all input data successfully without error, further validating system design and implementation. The key outcome of this research is the design and implementation of an automated system to detect anti-forensic tool presence on a target data set. Evaluation suggested the solution was both effective and efficient, adhering to forensic triage requirements. Furthermore, techniques not previously utilised in forensic analysis were designed and applied throughout the research: dynamic blacklisting and profile intersection removed irrelevant operating system noise from application profiles; metadata matching methods resulted in efficient digital artifact detection and path normalisation aided full path correlation in complex matching scenarios. The system was subjected to rigorous experimental testing on three data sets that comprised more than 10 terabytes of data. The ultimate outcome is a practically implemented solution that has been executed on hundreds of forensic disk images, thousands of Windows Registry hives, more than 10 million data files, and approximately 50 million Registry entries. The research has resulted in the design of a scalable triage system implemented as a set of computer forensic tools
    corecore