36 research outputs found

    An Automated Approach for Digital Forensic Analysis of Heterogeneous Big Data

    Get PDF
    The major challenges with big data examination and analysis are volume, complex interdependence across content, and heterogeneity. The examination and analysis phases are considered essential to a digital forensics process. However, traditional techniques for the forensic investigation use one or more forensic tools to examine and analyse each resource. In addition, when multiple resources are included in one case, there is an inability to cross-correlate findings which often leads to inefficiencies in processing and identifying evidence. Furthermore, most current forensics tools cannot cope with large volumes of data. This paper develops a novel framework for digital forensic analysis of heterogeneous big data. The framework mainly focuses upon the investigations of three core issues: data volume, heterogeneous data and the investigators cognitive load in understanding the relationships between artefacts. The proposed approach focuses upon the use of metadata to solve the data volume problem, semantic web ontologies to solve the heterogeneous data sources and artificial intelligence models to support the automated identification and correlation of artefacts to reduce the burden placed upon the investigator to understand the nature and relationship of the artefacts

    Effects of the Factory Reset on Mobile Devices

    Get PDF
    Mobile devices usually provide a “factory-reset” tool to erase user-specific data from the main secondary storage. 9 Apple iPhones, 10 Android devices, and 2 BlackBerry devices were tested in the first systematic evaluation of the effectiveness of factory resets. Tests used the Cellebrite UME-36 Pro with the UFED Physical Analyzer, the Bulk Extractor open-source tool, and our own programs for extracting metadata, classifying file paths, and comparing them between images. Two phones were subjected to more detailed analysis. Results showed that many kinds of data were removed by the resets, but much user-specific configuration data was left. Android devices did poorly at removing user documents and media, and occasional surprising user data was left on all devices including photo images, audio, documents, phone numbers, email addresses, geolocation data, configuration data, and keys. A conclusion is that reset devices can still provide some useful information to a forensic investigation

    Automating the harmonisation of heterogeneous data in digital forensics

    Get PDF
    Digital forensics has become an increasingly important tool in the fight against cyber and computer-assisted crime. However, with an increasing range of technologies at people’s disposal, investigators find themselves having to process and analyse many systems (e.g. PC, laptop, tablet, Smartphone) in a single case. Unfortunately, current tools operate within an isolated manner, investigating systems and applications on an individual basis. The heterogeneity of the evidence places time constraints and additional cognitive loads upon the investigator. Examples of heterogeneity include applications such as messaging (e.g. iMessenger, Viber, Snapchat and Whatsapp), web browsers (e.g. Firefox and Chrome) and file systems (e.g. NTFS, FAT, and HFS). Being able to analyse and investigate evidence from across devices and applications based upon categories would enable investigators to query all data at once. This paper proposes a novel algorithm to the merging of datasets through a ‘characterisation and harmonisation’ process. The characterisation process analyses the nature of the metadata and the harmonisation process merges the data. A series of experiments using real-life forensic datasets are conducted to evaluate the algorithm across five different categories of datasets (i.e. messaging, graphical files, file system, Internet history, and emails), each containing data from different applications across difference devices (a total of 22 disparate datasets). The results showed that the algorithm is able to merge all fields successfully, with the exception of some binary-based data found within the messaging datasets (contained within Viber and SMS). The error occurred due to a lack of information for the characterisation process to make a useful determination. However, upon the further analysis it was found the error had a minimal impact on subsequent merged data

    Evidence identification in heterogeneous data using clustering

    Get PDF
    Digital forensics faces several challenges in examining and analyzing data due to an increasing range of technologies at people\u27s disposal. The investigators find themselves having to process and analyze many systems manually (e.g. PC, laptop, Smartphone) in a single case. Unfortunately, current tools such as FTK and Encase have a limited ability to achieve the automation in finding evidence. As a result, a heavy burden is placed on the investigator to both find and analyze evidential artifacts in a heterogenous environment. This paper proposed a clustering approach based on Fuzzy C-Means (FCM) and K-means algorithms to identify the evidential files and isolate the non-related files based on their metadata. A series of experiments using heterogenous real-life forensic cases are conducted to evaluate the approach. Within each case, various types of metadata categories were created based on file systems and applications. The results showed that the clustering based on file systems gave the best results of grouping the evidential artifacts within only five clusters. The proportion across the five clusters was 100% using small configurations of both FCM and K-means with less than 16% of the non-evidential artifacts across all cases -- representing a reduction in having to analyze 84% of the benign files. In terms of the applications, the proportion of evidence was more than 97%, but the proportion of benign files was also relatively high based upon small configurations. However, with a large configuration, the proportion of benign files became very low less than 10%. Successfully prioritizing large proportions of evidence and reducing the volume of benign files to be analyzed, reduces the time taken and cognitive load upon the investigator

    Automated Identification of Digital Evidence across Heterogeneous Data Resources

    Get PDF
    Digital forensics has become an increasingly important tool in the fight against cyber and computer-assisted crime. However, with an increasing range of technologies at people’s disposal, investigators find themselves having to process and analyse many systems with large volumes of data (e.g., PCs, laptops, tablets, and smartphones) within a single case. Unfortunately, current digital forensic tools operate in an isolated manner, investigating systems and applications individually. The heterogeneity and volume of evidence place time constraints and a significant burden on investigators. Examples of heterogeneity include applications such as messaging (e.g., iMessenger, Viber, Snapchat, and WhatsApp), web browsers (e.g., Firefox and Google Chrome), and file systems (e.g., NTFS, FAT, and HFS). Being able to analyse and investigate evidence from across devices and applications in a universal and harmonized fashion would enable investigators to query all data at once. In addition, successfully prioritizing evidence and reducing the volume of data to be analysed reduces the time taken and cognitive load on the investigator. This thesis focuses on the examination and analysis phases of the digital investigation process. It explores the feasibility of dealing with big and heterogeneous data sources in order to correlate the evidence from across these evidential sources in an automated way. Therefore, a novel approach was developed to solve the heterogeneity issues of big data using three developed algorithms. The three algorithms include the harmonising, clustering, and automated identification of evidence (AIE) algorithms. The harmonisation algorithm seeks to provide an automated framework to merge similar datasets by characterising similar metadata categories and then harmonising them in a single dataset. This algorithm overcomes heterogeneity issues and makes the examination and analysis easier by analysing and investigating the evidential artefacts across devices and applications based on the categories to query data at once. Based on the merged datasets, the clustering algorithm is used to identify the evidential files and isolate the non-related files based on their metadata. Afterwards, the AIE algorithm tries to identify the cluster holding the largest number of evidential artefacts through searching based on two methods: criminal profiling activities and some information from the criminals themselves. Then, the related clusters are identified through timeline analysis and a search of associated artefacts of the files within the first cluster. A series of experiments using real-life forensic datasets were conducted to evaluate the algorithms across five different categories of datasets (i.e., messaging, graphical files, file system, internet history, and emails), each containing data from different applications across different devices. The results of the characterisation and harmonisation process show that the algorithm can merge all fields successfully, with the exception of some binary-based data found within the messaging datasets (contained within Viber and SMS). The error occurred because of a lack of information for the characterisation process to make a useful determination. However, on further analysis, it was found that the error had a minimal impact on subsequent merged data. The results of the clustering process and AIE algorithm showed the two algorithms can collaborate and identify more than 92% of evidential files.HCED Ira

    Constructing and classifying email networks from raw forensic images

    Get PDF
    Email addresses extracted from secondary storage devices are important to a forensic analyst when conducting an investigation. They can provide insight into the user's social network and help identify other potential persons of interest. However, a large portion of the email addresses from any given device are artifacts of installed software and are of no interest to the analyst.We propose a method for discovering relevant email addresses by creating graphs consisting of extracted email addresses along with their byte-offset location in storage.We compute certain global attributes of these graphs to construct feature vectors, which we use to classify graphs into useful and not useful categories. This process filters out the majority of uninteresting email addresses. We show that using the network topological measures on the dataset tested, Naïve Bayes and SVM were successful in identifying 100% and 95:5%, respectively, of all graphs that contained useful email addresses both with areas under the curve above :97 and F1 scores at :80 and :90 for Naïve Bayes and SVM, respectively. Our results show that using network science metrics as attributes to classify graphs of email addresses based on the graph's topology could be an effective and efficient tool for automatically delivering evidence to an analyst.http://archive.org/details/constructingndcl1094550520Lieutenant, United States NavyApproved for public release; distribution is unlimited

    Availability of Datasets for Digital Forensics–And What is Missing

    Get PDF
    This paper targets two main goals. First, we want to provide an overview of available datasets that can be used by researchers and where to find them. Second, we want to stress the importance of sharing datasets to allow researchers to replicate results and improve the state of the art. To answer the first goal, we analyzed 715 peer-reviewed research articles from 2010 to 2015 with focus and relevance to digital forensics to see what datasets are available and focused on three major aspects: (1) the origin of the dataset (e.g., real world vs. synthetic), (2) if datasets were released by researchers and (3) the types of datasets that exist. Additionally, we broadened our results to include the outcome of online search results.We also discuss what we think is missing. Overall, our results show that the majority of datasets are experiment generated (56.4%) followed by real world data (36.7%). On the other hand, 54.4% of the articles use existing datasets while the rest created their own. In the latter case, only 3.8% actually released their datasets. Finally, we conclude that there are many datasets for use out there but finding them can be challenging

    Multi-Stakeholder Case Prioritization in Digital Investigations

    Get PDF
    This work examines the problem of case prioritization in digital investigations for better utilization of limited criminal investigation resources. Current methods of case prioritization, as well as observed prioritization methods used in digital forensic investigation laboratories are examined. After, a multi-stakeholder approach to case prioritization is given that may help reduce reputational risk to digital forensic laboratories while improving resource allocation. A survey is given that shows differing opinions of investigation priority between Law Enforcement and the public that is used in the development of a prioritization model. Finally, an example case is given to demonstrate the practicality of the proposed method

    Multi-Stakeholder Case Prioritization in Digital Investigations

    Get PDF
    This work examines the problem of case prioritization in digital investigations for better utilization of limited criminal investigation resources. Current methods of case prioritization, as well as observed prioritization methods used in digital forensic investigation laboratories are examined. After, a multi-stakeholder approach to case prioritization is given that may help reduce reputational risk to digital forensic laboratories while improving resource allocation. A survey is given that shows differing opinions of investigation priority between Law Enforcement and the public that is used in the development of a prioritization model. Finally, an example case is given to demonstrate the practicality of the proposed method.</span

    Unlocking Environmental Narratives

    Get PDF
    Understanding the role of humans in environmental change is one of the most pressing challenges of the 21st century. Environmental narratives – written texts with a focus on the environment – offer rich material capturing relationships between people and surroundings. We take advantage of two key opportunities for their computational analysis: massive growth in the availability of digitised contemporary and historical sources, and parallel advances in the computational analysis of natural language. We open by introducing interdisciplinary research questions related to the environment and amenable to analysis through written sources. The reader is then introduced to potential collections of narratives including newspapers, travel diaries, policy documents, scientific proposals and even fiction. We demonstrate the application of a range of approaches to analysing natural language computationally, introducing key ideas through worked examples, and providing access to the sources analysed and accompanying code. The second part of the book is centred around case studies, each applying computational analysis to some aspect of environmental narrative. Themes include the use of language to describe narratives about glaciers, urban gentrification, diversity and writing about nature and ways in which locations are conceptualised and described in nature writing. We close by reviewing the approaches taken, and presenting an interdisciplinary research agenda for future work. The book is designed to be of interest to newcomers to the field and experienced researchers, and set out in a way that it can be used as an accompanying text for graduate level courses in, for example, geography, environmental history or the digital humanities
    corecore