1,174 research outputs found

    Matching Possible Mitigations to Cyber Threats: A Document-Driven Decision Support Systems Approach

    Get PDF
    Despite more than a decade of heightened focus on cybersecurity, the threat continues. To address possible impacts, cyber threats must be addressed. Mitigation catalogs exist in practice today, but these do not map mitigations to the specific threats they counter. Currently, mitigations are manually selected by cybersecurity experts (CSE) who are in short supply. To reduce labor and improve repeatability, an automated approach is needed for matching mitigations to cyber threats. This research explores the application of supervised machine learning and text retrieval techniques to automate matching of relevant mitigations to cyber threats where both are expressed as text, resulting in a novel method that combines two techniques: support vector machine classification and latent semantic analysis. In five test cases, the approach demonstrates high recall for known relevant mitigation documents, bolstering confidence that potentially relevant mitigations will not be overlooked. It automatically excludes 97% of non-relevant mitigations, greatly reducing the CSE’s workload over purely manual matching

    Matching Possible Mitigations to Cyber Threats: A Document-Driven Decision Support Systems Approach

    Get PDF
    Cyber systems are ubiquitous in all aspects of society. At the same time, breaches to cyber systems continue to be front-page news (Calfas, 2018; Equifax, 2017) and, despite more than a decade of heightened focus on cybersecurity, the threat continues to evolve and grow, costing globally up to $575 billion annually (Center for Strategic and International Studies, 2014; Gosler & Von Thaer, 2013; Microsoft, 2016; Verizon, 2017). To address possible impacts due to cyber threats, information system (IS) stakeholders must assess the risks they face. Following a risk assessment, the next step is to determine mitigations to counter the threats that pose unacceptably high risks. The literature contains a robust collection of studies on optimizing mitigation selections, but they universally assume that the starting list of appropriate mitigations for specific threats exists from which to down-select. In current practice, producing this starting list is largely a manual process and it is challenging because it requires detailed cybersecurity knowledge from highly decentralized sources, is often deeply technical in nature, and is primarily described in textual form, leading to dependence on human experts to interpret the knowledge for each specific context. At the same time cybersecurity experts remain in short supply relative to the demand, while the delta between supply and demand continues to grow (Center for Cyber Safety and Education, 2017; Kauflin, 2017; Libicki, Senty, & Pollak, 2014). Thus, an approach is needed to help cybersecurity experts (CSE) cut through the volume of available mitigations to select those which are potentially viable to offset specific threats. This dissertation explores the application of machine learning and text retrieval techniques to automate matching of relevant mitigations to cyber threats, where both are expressed as unstructured or semi-structured English language text. Using the Design Science Research Methodology (Hevner & March, 2004; Peffers, Tuunanen, Rothenberger, & Chatterjee, 2007), we consider a number of possible designs for the matcher, ultimately selecting a supervised machine learning approach that combines two techniques: support vector machine classification and latent semantic analysis. The selected approach demonstrates high recall for mitigation documents in the relevant class, bolstering confidence that potentially viable mitigations will not be overlooked. It also has a strong ability to discern documents in the non-relevant class, allowing approximately 97% of non-relevant mitigations to be excluded automatically, greatly reducing the CSE’s workload over purely manual matching. A false v positive rate of up to 3% prevents totally automated mitigation selection and requires the CSE to reject a few false positives. This research contributes to theory a method for automatically mapping mitigations to threats when both are expressed as English language text documents. This artifact represents a novel machine learning approach to threat-mitigation mapping. The research also contributes an instantiation of the artifact for demonstration and evaluation. From a practical perspective the artifact benefits all threat-informed cyber risk assessment approaches, whether formal or ad hoc, by aiding decision-making for cybersecurity experts whose job it is to mitigate the identified cyber threats. In addition, an automated approach makes mitigation selection more repeatable, facilitates knowledge reuse, extends the reach of cybersecurity experts, and is extensible to accommodate the continued evolution of both cyber threats and mitigations. Moreover, the selection of mitigations applicable to each threat can serve as inputs into multifactor analyses of alternatives, both automated and manual, thereby bridging the gap between cyber risk assessment and final mitigation selection

    Information Retrieval Performance Enhancement Using The Average Standard Estimator And The Multi-criteria Decision Weighted Set

    Get PDF
    Information retrieval is much more challenging than traditional small document collection retrieval. The main difference is the importance of correlations between related concepts in complex data structures. These structures have been studied by several information retrieval systems. This research began by performing a comprehensive review and comparison of several techniques of matrix dimensionality estimation and their respective effects on enhancing retrieval performance using singular value decomposition and latent semantic analysis. Two novel techniques have been introduced in this research to enhance intrinsic dimensionality estimation, the Multi-criteria Decision Weighted model to estimate matrix intrinsic dimensionality for large document collections and the Average Standard Estimator (ASE) for estimating data intrinsic dimensionality based on the singular value decomposition (SVD). ASE estimates the level of significance for singular values resulting from the singular value decomposition. ASE assumes that those variables with deep relations have sufficient correlation and that only those relationships with high singular values are significant and should be maintained. Experimental results over all possible dimensions indicated that ASE improved matrix intrinsic dimensionality estimation by including the effect of both singular values magnitude of decrease and random noise distracters. Analysis based on selected performance measures indicates that for each document collection there is a region of lower dimensionalities associated with improved retrieval performance. However, there was clear disagreement between the various performance measures on the model associated with best performance. The introduction of the multi-weighted model and Analytical Hierarchy Processing (AHP) analysis helped in ranking dimensionality estimation techniques and facilitates satisfying overall model goals by leveraging contradicting constrains and satisfying information retrieval priorities. ASE provided the best estimate for MEDLINE intrinsic dimensionality among all other dimensionality estimation techniques, and further, ASE improved precision and relative relevance by 10.2% and 7.4% respectively. AHP analysis indicates that ASE and the weighted model ranked the best among other methods with 30.3% and 20.3% in satisfying overall model goals in MEDLINE and 22.6% and 25.1% for CRANFIELD. The weighted model improved MEDLINE relative relevance by 4.4%, while the scree plot, weighted model, and ASE provided better estimation of data intrinsic dimensionality for CRANFIELD collection than Kaiser-Guttman and Percentage of variance. ASE dimensionality estimation technique provided a better estimation of CISI intrinsic dimensionality than all other tested methods since all methods except ASE tend to underestimate CISI document collection intrinsic dimensionality. ASE improved CISI average relative relevance and average search length by 28.4% and 22.0% respectively. This research provided evidence supporting a system using a weighted multi-criteria performance evaluation technique resulting in better overall performance than a single criteria ranking model. Thus, the weighted multi-criteria model with dimensionality reduction provides a more efficient implementation for information retrieval than using a full rank model

    Information Retrieval: Recent Advances and Beyond

    Full text link
    In this paper, we provide a detailed overview of the models used for information retrieval in the first and second stages of the typical processing chain. We discuss the current state-of-the-art models, including methods based on terms, semantic retrieval, and neural. Additionally, we delve into the key topics related to the learning process of these models. This way, this survey offers a comprehensive understanding of the field and is of interest for for researchers and practitioners entering/working in the information retrieval domain

    Gene prioritization and clustering by multi-view text mining

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Text mining has become a useful tool for biologists trying to understand the genetics of diseases. In particular, it can help identify the most interesting candidate genes for a disease for further experimental analysis. Many text mining approaches have been introduced, but the effect of disease-gene identification varies in different text mining models. Thus, the idea of incorporating more text mining models may be beneficial to obtain more refined and accurate knowledge. However, how to effectively combine these models still remains a challenging question in machine learning. In particular, it is a non-trivial issue to guarantee that the integrated model performs better than the best individual model.</p> <p>Results</p> <p>We present a multi-view approach to retrieve biomedical knowledge using different controlled vocabularies. These controlled vocabularies are selected on the basis of nine well-known bio-ontologies and are applied to index the vast amounts of gene-based free-text information available in the MEDLINE repository. The text mining result specified by a vocabulary is considered as a view and the obtained multiple views are integrated by multi-source learning algorithms. We investigate the effect of integration in two fundamental computational disease gene identification tasks: gene prioritization and gene clustering. The performance of the proposed approach is systematically evaluated and compared on real benchmark data sets. In both tasks, the multi-view approach demonstrates significantly better performance than other comparing methods.</p> <p>Conclusions</p> <p>In practical research, the relevance of specific vocabulary pertaining to the task is usually unknown. In such case, multi-view text mining is a superior and promising strategy for text-based disease gene identification.</p

    Topic Modeling for Automatic Analysis of Natural Language: A Case Study in an Italian Customer Support Center

    Get PDF
    This paper focuses on the automatic analysis of conversation transcriptions in the call center of a customer care service. The goal is to recognize topics related to problems and complaints discussed in several dialogues between customers and agents. Our study aims to implement a framework able to automatically cluster conversation transcriptions into cohesive and well-separated groups based on the content of the data. The framework can alleviate the analyst selecting proper values for the analysis and the clustering processes. To pursue this goal, we consider a probabilistic model based on the latent Dirichlet allocation, which associates transcriptions with a mixture of topics in different proportions. A case study consisting of transcriptions in the Italian natural language, and collected in a customer support center of an energy supplier, is considered in the paper. Performance comparison of different inference techniques is discussed using the case study. The experimental results demonstrate the approach’s efficacy in clustering Italian conversation transcriptions. It also results in a practical tool to simplify the analytic process and off-load the parameter tuning from the end-user. According to recent works in the literature, this paper may be valuable for introducing latent Dirichlet allocation approaches in topic modeling for the Italian natural language

    Recommending Relevant Classes for Bug Reports Using Multi-Objective Search

    Full text link
    Developers may follow a tedious process to find the cause of a bug based on code reviews and reproducing the abnormal behavior. In this thesis, we propose an automated approach for finding and ranking potential classes with the respect to the probability of containing a bug based on a bug report description. Our approach finds a good balance between minimizing the number of recommended classes and maximizing the relevance of the proposed solution using a multi-objective optimization algorithm. The relevance of the recommended classes (solution) is estimated based on the use of the history of changes and bug-fixing, and the lexical similarity between the bug report description and the API documentation. We evaluated our system on 6 open source Java projects including more than 22,000 bug reports, using the version of the project before fixing the bug of many bug reports. The experimental results show that the search-based approach significantly outperforms three state-of-the-art methods in recommending relevant files for bug reports. In particular, our multi-objective approach is able to successfully locate the true buggy methods within the top 10 recommendations for over 87% of the bug reports.Master of ScienceSoftware Engineering, College of Engineering and Computer ScienceUniversity of Michigan-Dearbornhttp://deepblue.lib.umich.edu/bitstream/2027.42/136064/1/Recommending Relevant Classes for Bug Reports Using Multi-Objective Search.pdfDescription of Recommending Relevant Classes for Bug Reports Using Multi-Objective Search.pdf : Master of Science Thesi
    corecore