2,695 research outputs found

    Back Attention Knowledge Transfer for Low-Resource Named Entity Recognition

    Full text link
    In recent years, great success has been achieved in the field of natural language processing (NLP), thanks in part to the considerable amount of annotated resources. For named entity recognition (NER), most languages do not have such an abundance of labeled data as English, so the performances of those languages are relatively lower. To improve the performance, we propose a general approach called Back Attention Network (BAN). BAN uses a translation system to translate other language sentences into English and then applies a new mechanism named back attention knowledge transfer to obtain task-specific information from pre-trained high-resource languages NER model. This strategy can transfer high-layer features of well-trained model and enrich the semantic representations of the original language. Experiments on three different language datasets indicate that the proposed approach outperforms other state-of-the-art methods

    Linguistically-Informed Neural Architectures for Lexical, Syntactic and Semantic Tasks in Sanskrit

    Full text link
    The primary focus of this thesis is to make Sanskrit manuscripts more accessible to the end-users through natural language technologies. The morphological richness, compounding, free word orderliness, and low-resource nature of Sanskrit pose significant challenges for developing deep learning solutions. We identify four fundamental tasks, which are crucial for developing a robust NLP technology for Sanskrit: word segmentation, dependency parsing, compound type identification, and poetry analysis. The first task, Sanskrit Word Segmentation (SWS), is a fundamental text processing task for any other downstream applications. However, it is challenging due to the sandhi phenomenon that modifies characters at word boundaries. Similarly, the existing dependency parsing approaches struggle with morphologically rich and low-resource languages like Sanskrit. Compound type identification is also challenging for Sanskrit due to the context-sensitive semantic relation between components. All these challenges result in sub-optimal performance in NLP applications like question answering and machine translation. Finally, Sanskrit poetry has not been extensively studied in computational linguistics. While addressing these challenges, this thesis makes various contributions: (1) The thesis proposes linguistically-informed neural architectures for these tasks. (2) We showcase the interpretability and multilingual extension of the proposed systems. (3) Our proposed systems report state-of-the-art performance. (4) Finally, we present a neural toolkit named SanskritShala, a web-based application that provides real-time analysis of input for various NLP tasks. Overall, this thesis contributes to making Sanskrit manuscripts more accessible by developing robust NLP technology and releasing various resources, datasets, and web-based toolkit.Comment: Ph.D. dissertatio

    Ontology-based methodology for error detection in software design

    Get PDF
    Improving the quality of a software design with the goal of producing a high quality software product continues to grow in importance due to the costs that result from poorly designed software. It is commonly accepted that multiple design views are required in order to clearly specify the required functionality of software. There is universal agreement as to the importance of identifying inconsistencies early in the software design process, but the challenge is how to reconcile the representations of the diverse views to ensure consistency. To address the problem of inconsistencies that occur across multiple design views, this research introduces the Methodology for Objects to Agents (MOA). MOA utilizes a new ontology, the Ontology for Software Specification and Design (OSSD), as a common information model to integrate specification knowledge and design knowledge in order to facilitate the interoperability of formal requirements modeling tools and design tools, with the end goal of detecting inconsistency errors in a design. The methodology, which transforms designs represented using the Unified Modeling Language (UML) into representations written in formal agent-oriented modeling languages, integrates object-oriented concepts and agent-oriented concepts in order to take advantage of the benefits that both approaches can provide. The OSSD model is a hierarchical decomposition of software development concepts, including ontological constructs of objects, attributes, behavior, relations, states, transitions, goals, constraints, and plans. The methodology includes a consistency checking process that defines a consistency framework and an Inter-View Inconsistency Detection technique. MOA enhances software design quality by integrating multiple software design views, integrating object-oriented and agent-oriented concepts, and defining an error detection method that associates rules with ontological properties

    Evaluating Pre-training Objectives for Low-Resource Translation into Morphologically Rich Languages

    Get PDF
    The scarcity of parallel data is a major limitation for Neural Machine Translation (NMT) systems, in particular for translation into morphologically rich languages (MRLs). An important way to overcome the lack of parallel data is to leverage target monolingual data, which is typically more abundant and easier to collect. We evaluate a number of techniques to achieve this, ranging from back-translation to random token masking, on the challenging task of translating English into four typologically diverse MRLs, under low-resource settings. Additionally, we introduce Inflection Pre-Training (or PT-Inflect), a novelpre-training objective whereby the NMT system is pre-trained on the task of re-inflecting lemmatized target sentences before being trained on standard source-to-target language translation. We conduct our evaluation on four typologically diverse target MRLs, and find that PT-Inflect surpasses NMT systems trained only on parallel data. While PT-Inflect is outperformed by back-translation overall, combining the two techniques leads to gains in some of the evaluated language pairs

    Theory and Applications for Advanced Text Mining

    Get PDF
    Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data include useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques have been developed, the text mining research field continues to expand for the needs arising from various application fields. This book is composed of 9 chapters introducing advanced text mining techniques. They are various techniques from relation extraction to under or less resourced language. I believe that this book will give new knowledge in the text mining field and help many readers open their new research fields

    NLP-Based Techniques for Cyber Threat Intelligence

    Full text link
    In the digital era, threat actors employ sophisticated techniques for which, often, digital traces in the form of textual data are available. Cyber Threat Intelligence~(CTI) is related to all the solutions inherent to data collection, processing, and analysis useful to understand a threat actor's targets and attack behavior. Currently, CTI is assuming an always more crucial role in identifying and mitigating threats and enabling proactive defense strategies. In this context, NLP, an artificial intelligence branch, has emerged as a powerful tool for enhancing threat intelligence capabilities. This survey paper provides a comprehensive overview of NLP-based techniques applied in the context of threat intelligence. It begins by describing the foundational definitions and principles of CTI as a major tool for safeguarding digital assets. It then undertakes a thorough examination of NLP-based techniques for CTI data crawling from Web sources, CTI data analysis, Relation Extraction from cybersecurity data, CTI sharing and collaboration, and security threats of CTI. Finally, the challenges and limitations of NLP in threat intelligence are exhaustively examined, including data quality issues and ethical considerations. This survey draws a complete framework and serves as a valuable resource for security professionals and researchers seeking to understand the state-of-the-art NLP-based threat intelligence techniques and their potential impact on cybersecurity
    • …
    corecore