3,698 research outputs found

    NLP-Based Techniques for Cyber Threat Intelligence

    Full text link
    In the digital era, threat actors employ sophisticated techniques for which, often, digital traces in the form of textual data are available. Cyber Threat Intelligence~(CTI) is related to all the solutions inherent to data collection, processing, and analysis useful to understand a threat actor's targets and attack behavior. Currently, CTI is assuming an always more crucial role in identifying and mitigating threats and enabling proactive defense strategies. In this context, NLP, an artificial intelligence branch, has emerged as a powerful tool for enhancing threat intelligence capabilities. This survey paper provides a comprehensive overview of NLP-based techniques applied in the context of threat intelligence. It begins by describing the foundational definitions and principles of CTI as a major tool for safeguarding digital assets. It then undertakes a thorough examination of NLP-based techniques for CTI data crawling from Web sources, CTI data analysis, Relation Extraction from cybersecurity data, CTI sharing and collaboration, and security threats of CTI. Finally, the challenges and limitations of NLP in threat intelligence are exhaustively examined, including data quality issues and ethical considerations. This survey draws a complete framework and serves as a valuable resource for security professionals and researchers seeking to understand the state-of-the-art NLP-based threat intelligence techniques and their potential impact on cybersecurity

    Distribution of Terms Across Genres in the Annotated Lithuanian Cybersecurity Corpus

    Get PDF
    The paper provides results of the frequential distribution analysis of cybersecurity terms used in the Lithuanian cybersecurity corpus composed of texts of different genres. The research focuses on the following aspects: overall distribution of cybersecurity terms (their density and diversity) across genres, distribution of English and English-Lithuanian terms and their usage patterns in Lithuanian sentences, and, finally, the most frequent cybersecurity terms and their thematic groups in each genre. The research was performed in several stages: compilation of a cybersecurity corpus and its subdivision into genre-specific subcorpora, manual annotation of cybersecurity terms, automatic lemmatisation of annotated terms and, finally, quantitative analysis of the distribution of the terms across the subcorpora. The results reveal the similarities and differences of the use of cybersecurity terminology across genres which are important to consider to get a complete picture of terminology usage trends in this domain

    Seed-Guided Fine-Grained Entity Typing in Science and Engineering Domains

    Full text link
    Accurately typing entity mentions from text segments is a fundamental task for various natural language processing applications. Many previous approaches rely on massive human-annotated data to perform entity typing. Nevertheless, collecting such data in highly specialized science and engineering domains (e.g., software engineering and security) can be time-consuming and costly, without mentioning the domain gaps between training and inference data if the model needs to be applied to confidential datasets. In this paper, we study the task of seed-guided fine-grained entity typing in science and engineering domains, which takes the name and a few seed entities for each entity type as the only supervision and aims to classify new entity mentions into both seen and unseen types (i.e., those without seed entities). To solve this problem, we propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus using the contextualized representations of pre-trained language models. It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types. Extensive experiments on two datasets covering four domains demonstrate the effectiveness of SEType in comparison with various baselines.Comment: 9 pages; Accepted to AAAI 2024 (Code: https://github.com/yuzhimanhua/SEType

    A New View on Classification of Software Vulnerability Mitigation Methods

    Get PDF
    Software vulnerability mitigation is a well-known research area and many methods have been proposed for it Some papers try to classify these methods from different specific points of views In this paper we aggregate all proposed classifications and present a comprehensive classification of vulnerability mitigation methods We define software vulnerability as a kind of software fault and correspond the classes of software vulnerability mitigation methods accordingly In this paper the software vulnerability mitigation methods are classified into vulnerability prevention vulnerability tolerance vulnerability removal and vulnerability forecasting We define each vulnerability mitigation method in our new point of view and indicate some methods for each class Our general point of view helps to consider all of the proposed methods in this review We also identify the fault mitigation methods that might be effective in mitigating the software vulnerabilities but are not yet applied in this area Based on that new directions are suggested for the future researc

    Automatically analysing large texts in a GIS environment:the Registrar General’s reports and cholera in the nineteenth century

    Get PDF
    The aim of this article is to present new research showcasing how Geographic Information Systems in combination with Natural Language Processing and Corpus Linguistics methods can offer innovative venues of research to analyze large textual collections in the Humanities, particularly in historical research. Using as examples parts of the collection of the Registrar General's Reports that contain more than 200,000 pages of descriptions, census data and vital statistics for the UK, we introduce newly developed automated textual tools and well known spatial analyses used in combination to investigate a case study of the references made to cholera and other diseases in these historical sources, and their relationship to place-names during Victorian times. The integration of such techniques has allowed us to explore, in an automatic way, this historical source containing millions of words, to examine the geographies depicted in it, and to identify textual and geographic patterns in the corpus

    Automatically analysing large texts in a GIS environment: The Registrar General’s reports and cholera in the nineteenth century

    Get PDF
    This is the peer reviewed version of the following article: Murrieta-Flores, P., Baron, A., Gregory, I., Hardie, A., & Rayson, P. (2015). Automatically analysing large texts in a GIS environment: The Registrar General’s reports and cholera in the nineteenth century. Transactions in GIS, 19(2), 296-320. DOI: 10.1111/tgis.12106., which has been published in final form at http://onlinelibrary.wiley.com/doi/10.1111/tgis.12106/abstract. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-ArchivingThe aim of this article is to present new research showcasing how Geographic Information Systems in combination with Natural Language Processing and Corpus Linguistics methods can offer innovative venues of research to analyze large textual collections in the Humanities, particularly in historical research. Using as examples parts of the collection of the Registrar General’s Reports that contain more than 200,000 pages of descriptions, census data and vital statistics for the UK, we introduce newly developed automated textual tools and well known spatial analyses used in combination to investigate a case study of the references made to cholera and other diseases in these historical sources, and their relationship to place-names during Victorian times. The integration of such techniques has allowed us to explore, in an automatic way, this historical source containing millions of words, to examine the geographies depicted in it, and to identify textual and geographic patterns in the corpus
    corecore