1,018 research outputs found

    Advancing Automation in Digital Forensic Investigations Using Machine Learning Forensics

    Get PDF
    In the last few years, most of the data such as books, videos, pictures, medical and even the genetic information of humans are moving toward digital formats. Laptops, tablets, smartphones and wearable devices are the major source of this digital data transformation and are becoming the core part of our daily life. As a result of this transformation, we are becoming the soft target of various types of cybercrimes. Digital forensic investigation provides the way to recover lost or purposefully deleted or hidden files from a suspect’s device. However, current man power and government resources are not enough to investigate the cybercrimes. Unfortunately, existing digital investigation procedures and practices require huge interaction with humans; as a result it slows down the process with the pace digital crimes are committed. Machine learning (ML) is the branch of science that has governs from the field of AI. This advance technology uses the explicit programming to depict the human-like behaviour. Machine learning combined with automation in digital investigation process at different stages of investigation has significant potential to aid digital investigators. This chapter aims at providing the research in machine learning-based digital forensic investigation, identifies the gaps, addresses the challenges and open issues in this field

    Methodology for the Automated Metadata-Based Classification of Incriminating Digital Forensic Artefacts

    Full text link
    The ever increasing volume of data in digital forensic investigation is one of the most discussed challenges in the field. Usually, most of the file artefacts on seized devices are not pertinent to the investigation. Manually retrieving suspicious files relevant to the investigation is akin to finding a needle in a haystack. In this paper, a methodology for the automatic prioritisation of suspicious file artefacts (i.e., file artefacts that are pertinent to the investigation) is proposed to reduce the manual analysis effort required. This methodology is designed to work in a human-in-the-loop fashion. In other words, it predicts/recommends that an artefact is likely to be suspicious rather than giving the final analysis result. A supervised machine learning approach is employed, which leverages the recorded results of previously processed cases. The process of features extraction, dataset generation, training and evaluation are presented in this paper. In addition, a toolkit for data extraction from disk images is outlined, which enables this method to be integrated with the conventional investigation process and work in an automated fashion

    File system modelling for digital triage: An inductive profiling approach

    Get PDF
    Digital Triage is the initial, rapid screening of electronic devices as a precursor to full forensic analysis. Triage has numerous benefits including resource prioritisation, greater involvement of criminal investigators and the rapid provision of initial outcomes. In traditional scientific forensics and criminology, certain behavioural attributes and character traits can be identified and used to construct a case profile to focus an investigation and narrow down a list of suspects. This research introduces the Triage Modelling Tool (TMT), that uses a profiling approach to identify how offenders utilise and structure files through the creation of file system models. Results from the TMT have proven to be extremely promising when compared to Encase’s similar in-built functionality, which provides a strong justification for future work within this area

    A Practitioner Survey Exploring the Value of Forensic Tools, AI, Filtering, & Safer Presentation for Investigating Child Sexual Abuse Material

    Get PDF
    For those investigating cases of Child Sexual Abuse Material (CSAM), there is the potential harm of experiencing trauma after illicit content exposure over a period of time. Research has shown that those working on such cases can experience psychological distress. As a result, there has been a greater effort to create and implement technologies that reduce exposure to CSAM. However, not much work has explored gathering insight regarding the functionality, effectiveness, accuracy, and importance of digital forensic tools and data science technologies from practitioners who use them. This study focused specifically on examining the value practitioners give to the tools and technologies they utilize to investigate CSAM cases. General findings indicated that implementing filtering technologies is more important than safe-viewing technologies; false positives are a greater concern than false negatives; resources such as time, personnel, and money continue to be a concern; and an improved workflow is highly desirable. Results also showed that practitioners are not well-versed in data science and Artificial Intelligence (AI), which is alarming given that tools already implement these techniques and that practitioners face large amounts of data during investigations. Finally, the data exemplified that practitioners are generally not taking advantage of tools that implement data science techniques, and that the biggest need for them is in automated child nudity detection, age estimation and skin tone detection

    Harnessing Predictive Models for Assisting Network Forensic Investigations of DNS Tunnels

    Get PDF
    In recent times, DNS tunneling techniques have been used for malicious purposes, however network security mechanisms struggle to detect them. Network forensic analysis has been proven effective, but is slow and effort intensive as Network Forensics Analysis Tools struggle to deal with undocumented or new network tunneling techniques. In this paper, we present a machine learning approach, based on feature subsets of network traffic evidence, to aid forensic analysis through automating the inference of protocols carried within DNS tunneling techniques. We explore four network protocols, namely, HTTP, HTTPS, FTP, and POP3. Three features are extracted from the DNS tunneled traffic: IP packet length, DNS Query Name Entropy, and DNS Query Name Length. We benchmark the performance of four classification models, i.e., decision trees, support vector machines, k-nearest neighbours, and neural networks, on a data set of DNS tunneled traffic. Classification accuracy of 95% is achieved and the feature set reduces the original evidence data size by a factor of 74%. More importantly, our findings provide strong evidence that predictive modeling machine learning techniques can be used to identify network protocols within DNS tunneled traffic in real-time with high accuracy from a relatively small-sized feature-set, without necessarily infringing on privacy from the outset, nor having to collect complete DNS Tunneling sessions

    Text Categorization and Machine Learning Methods: Current State Of The Art

    Get PDF
    In this informative age, we find many documents are available in digital forms which need classification of the text. For solving this major problem present researchers focused on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of pre classified documents, the characteristics of the categories. The main benefit of the present approach is consisting in the manual definition of a classifier by domain experts where effectiveness, less use of expert work and straightforward portability to different domains are possible. The paper examines the main approaches to text categorization comparing the machine learning paradigm and present state of the art. Various issues pertaining to three different text similarity problems, namely, semantic, conceptual and contextual are also discussed

    DATA SCIENCE METHODS FOR STANDARDIZATION, SAFETY, AND QUALITY ASSURANCE IN RADIATION ONCOLOGY

    Get PDF
    Radiation oncology is the field of medicine that deals with treating cancer patients through ionizing radiation. The clinical modality or technique used to treat the cancer patients in the radiation oncology domain is referred to as radiation therapy. Radiation therapy aims to deliver precisely measured dose irradiation to a defined tumor volume (target) with as minimal damage as possible to surrounding healthy tissue (organs-at-risk), resulting in eradication of the tumor, high quality of life, and prolongation of survival. A typical radiotherapy process requires the use of different clinical systems at various stages of the workflow. The data generated in these different stages of workflow is stored in an unstructured and non-standard format, which hinders interoperability and interconnectivity of data, thereby making it difficult to translate all of these datasets into knowledge that supports decision-making in routine clinical practice. In this dissertation, we present an enterprise-level informatics platform that can automatically extract and efficiently store clinical, treatment, imaging, and genomics data from radiation oncology patients. Additionally, we propose data science methods for data standardization, safety, and treatment quality analysis in radiation oncology. We demonstrate that our data standardization methods using word embeddings and machine learning are robust and highly generalizable on real-word clinical datasets collected from the nationwide radiation therapy centers administered by the US Veterans\u27 Health Administration. We also present different heterogeneous data integration approaches to enhance the data standardization process. For patient safety, we analyze the radiation oncology incident reports and propose an integrated natural language processing and machine learning based pipeline to automate the incident triage and prioritization process. We demonstrate that a deep learning based transfer learning approach helps in the automated incident triage process. Finally, we address the issue of treatment quality in terms of automated treatment planning in clinical decision support systems. We show that supervised machine learning methods can efficiently generate clinical hypotheses from radiation oncology treatment plans and demonstrate our framework\u27s data analytics capability

    Exploratory Analysis of Highly Heterogeneous Document Collections

    Full text link
    We present an effective multifaceted system for exploratory analysis of highly heterogeneous document collections. Our system is based on intelligently tagging individual documents in a purely automated fashion and exploiting these tags in a powerful faceted browsing framework. Tagging strategies employed include both unsupervised and supervised approaches based on machine learning and natural language processing. As one of our key tagging strategies, we introduce the KERA algorithm (Keyword Extraction for Reports and Articles). KERA extracts topic-representative terms from individual documents in a purely unsupervised fashion and is revealed to be significantly more effective than state-of-the-art methods. Finally, we evaluate our system in its ability to help users locate documents pertaining to military critical technologies buried deep in a large heterogeneous sea of information.Comment: 9 pages; KDD 2013: 19th ACM SIGKDD Conference on Knowledge Discovery and Data Minin

    Tuberculosis Disease Detection through CXR Images based on Deep Neural Network Approach

    Get PDF
    Tuberculosis (TB) is a disease that, if left untreated for an extended period of time, can ultimately be fatal. Early TB detection can be aided by using a deep learning ensemble. In previous work, ensemble classifiers were only trained on images that shared similar characteristics. It is necessary for an ensemble to produce a diverse set of errors in order for it to be useful; this can be accomplished by making use of a number of different classifiers and/or features. In light of this, a brand-new framework has been constructed in this study for the purpose of segmenting and identifying TB in human Chest X-ray. It was determined that searching traditional web databases for chest X-ray was necessary. At this point, we pass the photos that we have collected over to Swin ResUnet3 so that they may be segmented. After the segmented chest X-ray have been provided to it, the Multi-scale Attention-based Densenet with Extreme Learning Machine (MAD-ELM) model will be applied in the detection stage in order to effectively diagnose tuberculosis from human chest X-ray. This will be done in order to maximize efficiency. Because it increased the variety of errors made by the basic classifiers, the supplied variation of the approach that was proposed was able to detect tuberculosis more effectively. The proposed ensemble method produced results with an accuracy of 94.2 percent, which are comparable to those obtained by past efforts
    • …
    corecore