1,627 research outputs found

    Dos and Don'ts of Machine Learning in Computer Security

    Get PDF
    With the growing processing power of computing systems and the increasing availability of massive datasets, machine learning algorithms have led to major breakthroughs in many different areas. This development has influenced computer security, spawning a series of work on learning-based security systems, such as for malware detection, vulnerability discovery, and binary code analysis. Despite great potential, machine learning in security is prone to subtle pitfalls that undermine its performance and render learning-based systems potentially unsuitable for security tasks and practical deployment. In this paper, we look at this problem with critical eyes. First, we identify common pitfalls in the design, implementation, and evaluation of learning-based security systems. We conduct a study of 30 papers from top-tier security conferences within the past 10 years, confirming that these pitfalls are widespread in the current security literature. In an empirical analysis, we further demonstrate how individual pitfalls can lead to unrealistic performance and interpretations, obstructing the understanding of the security problem at hand. As a remedy, we propose actionable recommendations to support researchers in avoiding or mitigating the pitfalls where possible. Furthermore, we identify open problems when applying machine learning in security and provide directions for further research.Comment: to appear at USENIX Security Symposium 202

    Influence of features discretization on accuracy of random forest classifier for web user identification

    Get PDF
    Web user identification based on linguistic or stylometric features helps to solve several tasks in computer forensics and cybersecurity, and can be used to prevent and investigate high-tech crimes and crimes where computer is used as a tool. In this paper we present research results on influence of features discretization on accuracy of Random Forest classifier. To evaluate the influence were carried out series of experiments on text corpus, contains Russian online texts of different genres and topics. Was used data sets with various level of class imbalance and amount of training texts per user. The experiments showed that the discretization of features improves the accuracy of identification for all data sets. We obtained positive results for extremely low amount of online messages per one user, and for maximum imbalance level

    Too Trivial To Test? An Inverse View on Defect Prediction to Identify Methods with Low Fault Risk

    Get PDF
    Background. Test resources are usually limited and therefore it is often not possible to completely test an application before a release. To cope with the problem of scarce resources, development teams can apply defect prediction to identify fault-prone code regions. However, defect prediction tends to low precision in cross-project prediction scenarios. Aims. We take an inverse view on defect prediction and aim to identify methods that can be deferred when testing because they contain hardly any faults due to their code being "trivial". We expect that characteristics of such methods might be project-independent, so that our approach could improve cross-project predictions. Method. We compute code metrics and apply association rule mining to create rules for identifying methods with low fault risk. We conduct an empirical study to assess our approach with six Java open-source projects containing precise fault data at the method level. Results. Our results show that inverse defect prediction can identify approx. 32-44% of the methods of a project to have a low fault risk; on average, they are about six times less likely to contain a fault than other methods. In cross-project predictions with larger, more diversified training sets, identified methods are even eleven times less likely to contain a fault. Conclusions. Inverse defect prediction supports the efficient allocation of test resources by identifying methods that can be treated with less priority in testing activities and is well applicable in cross-project prediction scenarios.Comment: Submitted to PeerJ C

    A novel application of machine learning and zero-shot classification methods for automated abstract screening in systematic reviews.

    Get PDF
    Zero-shot classification refers to assigning a label to a text (sentence, paragraph, whole paper) without prior training. This is possible by teaching the system how to codify a question and find its answer in the text. In many domains, especially health sciences, systematic reviews are evidence-based syntheses of information related to a specific topic. Producing them is demanding and time-consuming in terms of collecting, filtering, evaluating and synthesising large volumes of literature, which require significant effort performed by experts. One of its most demanding steps is abstract screening, which requires scientists to sift through various abstracts of relevant papers and include or exclude papers based on pre-established criteria. This process is time-consuming and subjective and requires a consensus between scientists, which may not always be possible. With the recent advances in machine learning and deep learning research, especially in natural language processing, it becomes possible to automate or semi-automate this task. This paper proposes a novel application of traditional machine learning and zero-shot classification methods for automated abstract screening for systematic reviews. Extensive experiments were carried out using seven public datasets. Competitive results were obtained in terms of accuracy, precision and recall across all datasets, which indicate that the burden and the human mistake in the abstract screening process might be reduced

    Classification Using Association Rules

    Get PDF
    This research investigates the use of an unsupervised learning technique, association rules, to make class predictions. The use of association rules to make class predictions is a growing area of focus within data mining research. The research to date has focused predominately on balanced datasets or synthetized imbalanced datasets. There have been concerns raised that the algorithms using association rules to make classifications do not perform well on imbalanced datasets. This research comprehensively evaluates the accuracy of a number of association rule classifiers in predicting home loan sales in an Irish retail banking context. The experiments designed test three associative classifier algorithms CBA, CMAR and SPARCCC against two benchmark algorithms conditional inference trees and random forests on a naturally imbalanced dataset. The experiments implemented and evaluated show that the benchmark tree based algorithms conditional inference trees and random forests outperform the associative classifier models across a range of balanced accuracy measures. This research contributes to the growing body of research in extending association rules to make class prediction

    Modeling Non-Standard Text Classification Tasks

    Get PDF
    Text classification deals with discovering knowledge in texts and is used for extracting, filtering, or retrieving information in streams and collections. The discovery of knowledge is operationalized by modeling text classification tasks, which is mainly a human-driven engineering process. The outcome of this process, a text classification model, is used to inductively learn a text classification solution from a priori classified examples. The building blocks of modeling text classification tasks cover four aspects: (1) the way examples are represented, (2) the way examples are selected, (3) the way classifiers learn from examples, and (4) the way models are selected. This thesis proposes methods that improve the prediction quality of text classification solutions for unseen examples, especially for non-standard tasks where standard models do not fit. The original contributions are related to the aforementioned building blocks: (1) Several topic-orthogonal text representations are studied in the context of non-standard tasks and a new representation, namely co-stems, is introduced. (2) A new active learning strategy that goes beyond standard sampling is examined. (3) A new one-class ensemble for improving the effectiveness of one-class classification is proposed. (4) A new model selection framework to cope with subclass distribution shifts that occur in dynamic environments is introduced

    Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets

    Get PDF
    corecore