1,117 research outputs found

    Knowledge extraction from a small corpus of unstructured safeguarding reports

    Get PDF
    This paper presents results on the performance of a range of analysis tools for extracting entities and sentiments from a small corpus of unstructured, safeguarding reports. We use sentiment analysis to identify strongly positive and strongly negative segments in an attempt to attribute patterns on the sentiments extracted to specific entities. We use entity extraction for identifying key entities. We evaluate tool performance against non-specialist human annotators. An initial study comparing the inter-human agreement against inter-machine agreement shows higher overall scores from human annotators than software tools. However, the degree of consensus between the human annotators for entity extraction is lower than expected which suggests a need for trained annotators. For sentiment analysis, the annotators reached a higher agreement for annotating descriptive sentences compared to reflective sentences, while the inter-tool agreement was similarly low for the two sentence types. The poor performance of the entity extraction and sentiment analysis approaches point to the need for domain-specific approaches for knowledge extraction on these kinds of document. However, there is currently a lack of pre-existing ontologies in the safeguarding domain. Thus, in future, our focus is the development of such a domain-specific ontology

    Extracting knowledge from complex unstructured corpora: Text classification and a case study on the safeguarding domain

    Get PDF
    The advances in internet, data collection and sharing technologies have lead to an increase in the amount of unstructured information in the form of news, articles, and social media. Additionally, many specialised domains such as the medical, law, and social science-related domains use unstructured documents as a main platform for collecting, storing and sharing domain-specific knowledge. However, the manual processing of these documents is a resource-consuming and error-prone process. This is especially apparent when the volume of the documents that need annotating constantly increases over time. Therefore, automated information extraction techniques have been widely used to efficiently analyse text and discover patterns. Specifically, text classification methods have become valuable for specialised domains for organising content, such as patient notes, and help fast topic-based retrieval of information. However, many specialised domains suffer from lack of data and class imbalance problems because documents are hard to obtain. In addition, the manual annotation needs to be performed by experts which can be costly. This makes the application of supervised classification approaches a challenging task. In this thesis, we research methods for improving the performance of text classifiers for specialised domains with limited amounts of data and highly domain-specific terminology where the annotation of documents is performed by domain experts. First, we study the applicability of traditional feature enhancement approaches using publicly available resources for improving classifiers performance for specialised domains. Then, we conduct extensive research into suitability of existing classification algorithms and the importance of both domain and task specific data for few-shot classification which helps identify classification strategies applicable to small datasets. This gives the basis for the development of a methodology for improving a classifier’s performance for few-shot settings using text generation-based data augmentation techniques. Specifically, we aim to improve quality of generated data by using strategies for selecting class representative samples from the original dataset used to produce additional training instances. We perform extensive analysis, considering multiple strategies, datasets, and few-shot text classification settings. Our study uses a corpus of safeguarding reports as an exemplary case study of a specialised domain with a small volume of data. The safeguarding reports contain valuable information about learning experiences and reflections on tackling serious crimes involving children and vulnerable adults. They carry great potential to improve multiagency work and help develop better crime prevention strategies. However, the lack of centralised access and the constant growth of the collection, make the manual analysis of the reports unfeasible. Therefore, we collaborated with the Crime and Security Research Institute (CSRI) at Cardiff University for the creation of a Wales Safeguarding Repository (WSR) for providing a centralised access to the safeguarding reports and means for automatic information extraction. The aim of the repository is to facilitate efficient searchability of the collection and thus help free up resources and assist practitioners from health and social care agencies in making faster and more accurate decisions. In particular, we apply methods identified in the thesis, in order to support automated annotation of the documents using a thematic framework, created by subject-matter experts. Our close work with domain experts throughout the thesis allowed incorporating experts‘ knowledge into classification and augmentation techniques which proved beneficial for the improvement of automated supervised methods for specialised domains

    NLP-Based Techniques for Cyber Threat Intelligence

    Full text link
    In the digital era, threat actors employ sophisticated techniques for which, often, digital traces in the form of textual data are available. Cyber Threat Intelligence~(CTI) is related to all the solutions inherent to data collection, processing, and analysis useful to understand a threat actor's targets and attack behavior. Currently, CTI is assuming an always more crucial role in identifying and mitigating threats and enabling proactive defense strategies. In this context, NLP, an artificial intelligence branch, has emerged as a powerful tool for enhancing threat intelligence capabilities. This survey paper provides a comprehensive overview of NLP-based techniques applied in the context of threat intelligence. It begins by describing the foundational definitions and principles of CTI as a major tool for safeguarding digital assets. It then undertakes a thorough examination of NLP-based techniques for CTI data crawling from Web sources, CTI data analysis, Relation Extraction from cybersecurity data, CTI sharing and collaboration, and security threats of CTI. Finally, the challenges and limitations of NLP in threat intelligence are exhaustively examined, including data quality issues and ethical considerations. This survey draws a complete framework and serves as a valuable resource for security professionals and researchers seeking to understand the state-of-the-art NLP-based threat intelligence techniques and their potential impact on cybersecurity

    Automatic preservation watch using information extraction on the Web: a case study on semantic extraction of natural language for digital preservation

    Get PDF
    The ability to recognize when digital content is becoming endangered is essential for maintaining the long-term, continuous and authentic access to digital assets. To achieve this ability, knowledge about aspects of the world that might hinder the preservation of content is needed. However, the processes of gathering, managing and reasoning on knowledge can become manually infeasible when the volume and heterogeneity of content increases, multiplying the aspects to monitor. Automation of these processes is possible [11,21], but its usefulness is limited by the data it is able to gather. Up to now, automatic digital preservation processes have been restricted to knowledge expressed in a machine understandable language, ignoring a plethora of data expressed in natural language, such as the DPC Technology Watch Reports, which could greatly contribute to the completeness and freshness of data about aspects of the world related to digital preservation. This paper presents a real case scenario from the National Library of the Netherlands, where the monitoring of publishers and journals is needed. This knowledge is mostly represented in natural language on Web sites of the publishers and, therefore, is dificult to automatically monitor. In this paper, we demonstrate how we use information extraction technologies to end and extract machine readable information on publishers and journals for ingestion into automatic digital preservation watch tools. We show that the results of automatic semantic extraction are a good complement to existing knowledge bases on publishers [9, 20], finding newer and more complete data. We demonstrate the viability of the approach as an alternative or auxiliary method for automatically gathering information on preservation risks in digital content.KEEP SOLUTION

    Unstructured data for cybersecurity and internal control

    Get PDF
    This paper proposes a research framework for studying the connections-realized and potential-between unstructured data (UD) and cybersecurity and internal controls. In the framework, cybersecurity and internal control goals determine the tasks to be conducted. The task influences the types of UD to be accessed and the types of analysis to be done, which in turn influences the outcomes that can be achieved. Patterns in UD are relevant for cybersecurity and internal control, but UD poses unique challenges for its analysis and management. This paper discusses some of these challenges including veracity, structuralizing, bias, and explainability

    Unstructured Data for Cybersecurity and Internal Control

    Get PDF
    This paper proposes a research framework for studying the connections--realized and potential--between unstructured data and cybersecurity and internal controls. In the framework, cybersecurity and internal control goals determine the tasks to be conducted. The task influences the types of unstructured data to be accessed and the types of analysis to be done, which in turn influences the outcomes that can be achieved. Patterns in unstructured data are relevant for cybersecurity and internal control, but unstructured data poses unique challenges for its analysis and management. This paper discusses some of these challenges including veracity, structuralizing, bias, and explainability

    Novel Heuristic Recurrent Neural Network Framework to Handle Automatic Telugu Text Categorization from Handwritten Text Image

    Get PDF
    In the near future, the digitization and processing of the current paper documents describe efficient role in the creation of a paperless environment. Deep learning techniques for handwritten recognition have been extensively studied by various researchers. Deep neural networks can be trained quickly thanks to a lot of data and other algorithmic advancements. Various methods for extracting text from handwritten manuscripts have been developed in literature. To extract features from written Telugu Text image having some other neural network approaches like convolution neural network (CNN), recurrent neural networks (RNN), long short-term memory (LSTM). Different deep learning related approaches are widely used to identification of handwritten Telugu Text; various techniques are used in literature for the identification of Telugu Text from documents. For automatic identification of Telugu written script efficiently to eliminate noise and other semantic features present in Telugu Text, in this paper, proposes Novel Heuristic Advanced Neural Network based Telugu Text Categorization Model (NHANNTCM) based on sequence-to-sequence feature extraction procedure. Proposed approach extracts the features using RNN and then represents Telugu Text in sequence-to-sequence format for the identification advanced neural network performs both encoding and decoding to identify and explore visual features from sequence of Telugu Text in input data. The classification accuracy rates for Telugu words, Telugu numerals, Telugu characters, Telugu sentences, and the corresponding Telugu sentences were 99.66%, 93.63%, 91.36%, 99.05%, and 97.73% consequently. Experimental evaluation describe extracted with revealed which are textured i.e. TENG shown considerable operations in applications such as private information protection, security defense, and personal handwriting signature identification
    • 

    corecore