13 research outputs found

    Applicability of Latent Dirichlet Allocation to Multi-Disk Search

    Get PDF
    Digital forensics practitioners face a continual increase in the volume of data they must analyze, which exacerbates the problem of finding relevant information in a noisy domain. Current technologies make use of keyword based search to isolate relevant documents and minimize false positives with respect to investigative goals. Unfortunately, selecting appropriate keywords is a complex and challenging task. Latent Dirichlet Allocation (LDA) offers a possible way to relax keyword selection by returning topically similar documents. This research compares regular expression search techniques and LDA using the Real Data Corpus (RDC). The RDC, a set of over 2400 disks from real users, is first analyzed to craft effective tests. Three tests are executed with the results indicating that, while LDA search should not be used as a replacement to regular expression search, it does offer benefits. First, it is able to locate documents when few, if any, of the keywords exist within them. Second, it improves data browsing and deals with keyword ambiguity by segmenting the documents into topics

    A fuzzy approach for measuring development of topics in patents using Latent Dirichlet Allocation

    Full text link
    © 2015 IEEE. Technology progress brings the very rapid growth of patent publications, which increases the difficulty of domain experts to measure the development of various topics, handle linguistic terms used in evaluation and understand massive technological content. To overcome the limitations of keyword-ranking type of text mining result in existing research, and at the same time deal with the vagueness of linguistic terms to assist thematic evaluation, this research proposes a fuzzy set-based topic development measurement (FTDM) approach to estimate and evaluate the topics hidden in a large volume of patent claims using Latent Dirichlet Allocation. In this study, latent semantic topics are first discovered from patent corpus and measured by a temporal-weight matrix to reveal the importance of all topics in different years. For each topic, we then calculate a temporal-weight coefficient based on the matrix, which is associated with a set of linguistic terms to describe its development state over time. After choosing a suitable linguistic term set, fuzzy membership functions are created for each term. The temporal-weight coefficients are then transformed to membership vectors related to the linguistic terms, which can be used to measure the development states of all topics directly and effectively. A case study using solar cell related patents is given to show the effectiveness of the proposed FTDM approach and its applicability for estimating hidden topics and measuring their corresponding development states efficiently

    An Automated Approach for Digital Forensic Analysis of Heterogeneous Big Data

    Get PDF
    The major challenges with big data examination and analysis are volume, complex interdependence across content, and heterogeneity. The examination and analysis phases are considered essential to a digital forensics process. However, traditional techniques for the forensic investigation use one or more forensic tools to examine and analyse each resource. In addition, when multiple resources are included in one case, there is an inability to cross-correlate findings which often leads to inefficiencies in processing and identifying evidence. Furthermore, most current forensics tools cannot cope with large volumes of data. This paper develops a novel framework for digital forensic analysis of heterogeneous big data. The framework mainly focuses upon the investigations of three core issues: data volume, heterogeneous data and the investigators cognitive load in understanding the relationships between artefacts. The proposed approach focuses upon the use of metadata to solve the data volume problem, semantic web ontologies to solve the heterogeneous data sources and artificial intelligence models to support the automated identification and correlation of artefacts to reduce the burden placed upon the investigator to understand the nature and relationship of the artefacts

    Modeling technological topic changes in patent claims

    Full text link
    © 2014 Portland International Conference on Management of Engineering and Technology. Patent claims usually embody the most essential terms and the core technological scope to define the protection of an invention, which makes them the ideal resource for patent content and topic change analysis. However, manually conducting content analysis on massive technical terms is very time consuming and laborious. Even with the help of traditional text mining techniques, it is still difficult to model topic changes over time, because single keywords alone are usually too general or ambiguous to represent a concept. Moreover, term frequency which used to define a topic cannot separate polysemous words that are actually describing a different theme. To address this issue, this research proposes a topic change identification approach based on Latent Dirichlet Allocation to model and analyze topic changes with minimal human intervention. After textual data cleaning, underlying semantic topics hidden in large archives of patent claims are revealed automatically. Concepts are defined by probability distributions over words instead of term frequency, so that polysemy is allowed. A case study using patents published in the United States Patent and Trademark Office (USPTO) from 2009 to 2013 with Australia as their assignee country is presented to demonstrate the validity of the proposed topic change identification approach. The experimental result shows that the proposed approach can be used as an automatic tool to provide machine-identified topic changes for more efficient and effective R&D management assistance

    Towards Sound Forensic Arguments: Structured Argumentation Applied to Digital Forensics Practice

    Get PDF
    Digital forensic practitioners are increasingly facing examinations which are both complex in nature and structure. Throughout this process, during the examination and analysis phases, the practitioner is constantly drawing logical inferences which will be reflected in the reporting of results. Therefore, it is important to expose how all the elements of an investigation fit together to allow review and scrutiny, and to support associated parties to understand the components within it. This paper proposes the use of ‘Structured Argumentation’ as a valuable and flexible ingredient of the practitioners’ thinking toolbox. It explores this approach using three case examples which allow discussion of the benefits and application of structured argumentation to real world contexts. We argue that, despite requiring a short learning curve, structured argumentation is a practical method which promotes accessibility of findings facilitating communication between technical and legal parties, peer review, logical reconstruction, jury interpretation, and error detection

    Availability of Datasets for Digital Forensics–And What is Missing

    Get PDF
    This paper targets two main goals. First, we want to provide an overview of available datasets that can be used by researchers and where to find them. Second, we want to stress the importance of sharing datasets to allow researchers to replicate results and improve the state of the art. To answer the first goal, we analyzed 715 peer-reviewed research articles from 2010 to 2015 with focus and relevance to digital forensics to see what datasets are available and focused on three major aspects: (1) the origin of the dataset (e.g., real world vs. synthetic), (2) if datasets were released by researchers and (3) the types of datasets that exist. Additionally, we broadened our results to include the outcome of online search results.We also discuss what we think is missing. Overall, our results show that the majority of datasets are experiment generated (56.4%) followed by real world data (36.7%). On the other hand, 54.4% of the articles use existing datasets while the rest created their own. In the latter case, only 3.8% actually released their datasets. Finally, we conclude that there are many datasets for use out there but finding them can be challenging

    Image Annotation and Topic Extraction Using Super-Word Latent Dirichlet

    Get PDF
    This research presents a multi-domain solution that uses text and images to iteratively improve automated information extraction. Stage I uses local text surrounding an embedded image to provide clues that help rank-order possible image annotations. These annotations are forwarded to Stage II, where the image annotations from Stage I are used as highly-relevant super-words to improve extraction of topics. The model probabilities from the super-words in Stage II are forwarded to Stage III where they are used to refine the automated image annotation developed in Stage I. All stages demonstrate improvement over existing equivalent algorithms in the literature

    Exploring Text Mining and Analytics for Applications in Public Security: An in-depth dive into a systematic literature review

    Get PDF
    Text mining and related analytics emerge as a technological approach to support human activities in extracting useful knowledge through texts in several formats. From a managerial point of view, it can help organizations in planning and decision-making processes, providing information that was not previously evident through textual materials produced internally or even externally. In this context, within the public/governmental scope, public security agencies are great beneficiaries of the tools associated with text mining, in several aspects, from applications in the criminal area to the collection of people's opinions and sentiments about the actions taken to promote their welfare. This article reports details of a systematic literature review focused on identifying the main areas of text mining application in public security, the most recurrent technological tools, and future research directions. The searches covered four major article bases (Scopus, Web of Science, IEEE Xplore, and ACM Digital Library), selecting 194 materials published between 2014 and the first half of 2021, among journals, conferences, and book chapters. There were several findings concerning the targets of the literature review, as presented in the results of this article

    Air Force Institute of Technology Research Report 2014

    Get PDF
    This report summarizes the research activities of the Air Force Institute of Technology’s Graduate School of Engineering and Management. It describes research interests and faculty expertise; lists student theses/dissertations; identifies research sponsors and contributions; and outlines the procedures for contacting the school. Included in the report are: faculty publications, conference presentations, consultations, and funded research projects. Research was conducted in the areas of Aeronautical and Astronautical Engineering, Electrical Engineering and Electro-Optics, Computer Engineering and Computer Science, Systems Engineering and Management, Operational Sciences, Mathematics, Statistics and Engineering Physics
    corecore