8 research outputs found

    In Search of netUnicorn: A Data-Collection Platform to Develop Generalizable ML Models for Network Security Problems

    Full text link
    The remarkable success of the use of machine learning-based solutions for network security problems has been impeded by the developed ML models' inability to maintain efficacy when used in different network environments exhibiting different network behaviors. This issue is commonly referred to as the generalizability problem of ML models. The community has recognized the critical role that training datasets play in this context and has developed various techniques to improve dataset curation to overcome this problem. Unfortunately, these methods are generally ill-suited or even counterproductive in the network security domain, where they often result in unrealistic or poor-quality datasets. To address this issue, we propose an augmented ML pipeline that leverages explainable ML tools to guide the network data collection in an iterative fashion. To ensure the data's realism and quality, we require that the new datasets should be endogenously collected in this iterative process, thus advocating for a gradual removal of data-related problems to improve model generalizability. To realize this capability, we develop a data-collection platform, netUnicorn, that takes inspiration from the classic "hourglass" model and is implemented as its "thin waist" to simplify data collection for different learning problems from diverse network environments. The proposed system decouples data-collection intents from the deployment mechanisms and disaggregates these high-level intents into smaller reusable, self-contained tasks. We demonstrate how netUnicorn simplifies collecting data for different learning problems from multiple network environments and how the proposed iterative data collection improves a model's generalizability

    A systematic methodology to evaluating optimised machine learning based network intrusion detection systems

    Get PDF
    A network intrusion detection system (NIDS) is essential for mitigating computer network attacks in various scenarios. However, the increasing complexity of computer networks and attacks makes classifying unseen or novel network traffic challenging. Supervised machine learning techniques (ML) used in a NIDS can be affected by different scenarios. Thus, dataset recency, size, and applicability are essential factors when selecting and tuning a machine learning classifier. This thesis explores developing and optimising several supervised ML algorithms with relatively new datasets constructed to depict real-world scenarios. The methodology includes empirical analyses of systematic ML-based NIDS for a near real-world network system to improve intrusion detection. The thesis is experimental heavy for model assessment. Data preparation methods are explored, followed by feature engineering techniques. The model evaluation process involves three experiments testing against a validation, un-trained, and retrained set. They compare several traditional machine learning and deep learning classifiers to identify the best NIDS model. Results show that the focus on feature scaling, feature selection methods and ML algo- rithm hyper-parameter tuning per model is an essential optimisation component. Distance based ML algorithm performed much better with quantile transformation whilst the tree based algorithms performed better without scaling. Permutation importance performs as a feature selection method compared to feature extraction using Principal Component Analysis (PCA) when applied against all ML algorithms explored. Random forests, Sup- port Vector Machines and recurrent neural networks consistently achieved the best results with high macro f1-score results of 90% 81% and 73% for the CICIDS 2017 dataset; and 72% 68% and 73% against the CICIDS 2018 dataset.Thesis (MSc) -- Faculty of Science, Computer Science, 202

    Text Similarity Between Concepts Extracted from Source Code and Documentation

    Get PDF
    Context: Constant evolution in software systems often results in its documentation losing sync with the content of the source code. The traceability research field has often helped in the past with the aim to recover links between code and documentation, when the two fell out of sync. Objective: The aim of this paper is to compare the concepts contained within the source code of a system with those extracted from its documentation, in order to detect how similar these two sets are. If vastly different, the difference between the two sets might indicate a considerable ageing of the documentation, and a need to update it. Methods: In this paper we reduce the source code of 50 software systems to a set of key terms, each containing the concepts of one of the systems sampled. At the same time, we reduce the documentation of each system to another set of key terms. We then use four different approaches for set comparison to detect how the sets are similar. Results: Using the well known Jaccard index as the benchmark for the comparisons, we have discovered that the cosine distance has excellent comparative powers, and depending on the pre-training of the machine learning model. In particular, the SpaCy and the FastText embeddings offer up to 80% and 90% similarity scores. Conclusion: For most of the sampled systems, the source code and the documentation tend to contain very similar concepts. Given the accuracy for one pre-trained model (e.g., FastText), it becomes also evident that a few systems show a measurable drift between the concepts contained in the documentation and in the source code.</p

    LIPIcs, Volume 251, ITCS 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 251, ITCS 2023, Complete Volum

    Abstracts on Radio Direction Finding (1899 - 1995)

    Get PDF
    The files on this record represent the various databases that originally composed the CD-ROM issue of "Abstracts on Radio Direction Finding" database, which is now part of the Dudley Knox Library's Abstracts and Selected Full Text Documents on Radio Direction Finding (1899 - 1995) Collection. (See Calhoun record https://calhoun.nps.edu/handle/10945/57364 for further information on this collection and the bibliography). Due to issues of technological obsolescence preventing current and future audiences from accessing the bibliography, DKL exported and converted into the three files on this record the various databases contained in the CD-ROM. The contents of these files are: 1) RDFA_CompleteBibliography_xls.zip [RDFA_CompleteBibliography.xls: Metadata for the complete bibliography, in Excel 97-2003 Workbook format; RDFA_Glossary.xls: Glossary of terms, in Excel 97-2003 Workbookformat; RDFA_Biographies.xls: Biographies of leading figures, in Excel 97-2003 Workbook format]; 2) RDFA_CompleteBibliography_csv.zip [RDFA_CompleteBibliography.TXT: Metadata for the complete bibliography, in CSV format; RDFA_Glossary.TXT: Glossary of terms, in CSV format; RDFA_Biographies.TXT: Biographies of leading figures, in CSV format]; 3) RDFA_CompleteBibliography.pdf: A human readable display of the bibliographic data, as a means of double-checking any possible deviations due to conversion
    corecore