Search CORE

2,695 research outputs found

Back Attention Knowledge Transfer for Low-Resource Named Entity Recognition

Author: Chen Huanhuan
Liu Yong
Lyu Shengfei
Miao Chunyan
Sun Linghao
Yi Huixiong
Publication venue
Publication date: 18/06/2021
Field of study

In recent years, great success has been achieved in the field of natural language processing (NLP), thanks in part to the considerable amount of annotated resources. For named entity recognition (NER), most languages do not have such an abundance of labeled data as English, so the performances of those languages are relatively lower. To improve the performance, we propose a general approach called Back Attention Network (BAN). BAN uses a translation system to translate other language sentences into English and then applies a new mechanism named back attention knowledge transfer to obtain task-specific information from pre-trained high-resource languages NER model. This strategy can transfer high-layer features of well-trained model and enrich the semantic representations of the original language. Experiments on three different language datasets indicate that the proposed approach outperforms other state-of-the-art methods

arXiv.org e-Print Archive

Linguistically-Informed Neural Architectures for Lexical, Syntactic and Semantic Tasks in Sanskrit

Author: Sandhan Jivnesh
Publication venue
Publication date: 17/08/2023
Field of study

The primary focus of this thesis is to make Sanskrit manuscripts more accessible to the end-users through natural language technologies. The morphological richness, compounding, free word orderliness, and low-resource nature of Sanskrit pose significant challenges for developing deep learning solutions. We identify four fundamental tasks, which are crucial for developing a robust NLP technology for Sanskrit: word segmentation, dependency parsing, compound type identification, and poetry analysis. The first task, Sanskrit Word Segmentation (SWS), is a fundamental text processing task for any other downstream applications. However, it is challenging due to the sandhi phenomenon that modifies characters at word boundaries. Similarly, the existing dependency parsing approaches struggle with morphologically rich and low-resource languages like Sanskrit. Compound type identification is also challenging for Sanskrit due to the context-sensitive semantic relation between components. All these challenges result in sub-optimal performance in NLP applications like question answering and machine translation. Finally, Sanskrit poetry has not been extensively studied in computational linguistics. While addressing these challenges, this thesis makes various contributions: (1) The thesis proposes linguistically-informed neural architectures for these tasks. (2) We showcase the interpretability and multilingual extension of the proposed systems. (3) Our proposed systems report state-of-the-art performance. (4) Finally, we present a neural toolkit named SanskritShala, a web-based application that provides real-time analysis of input for various NLP tasks. Overall, this thesis contributes to making Sanskrit manuscripts more accessible by developing robust NLP technology and releasing various resources, datasets, and web-based toolkit.Comment: Ph.D. dissertatio

arXiv.org e-Print Archive

Ontology-based methodology for error detection in software design

Author: Hoss Allyson M.
Publication venue: LSU Digital Commons
Publication date: 01/01/2006
Field of study

Improving the quality of a software design with the goal of producing a high quality software product continues to grow in importance due to the costs that result from poorly designed software. It is commonly accepted that multiple design views are required in order to clearly specify the required functionality of software. There is universal agreement as to the importance of identifying inconsistencies early in the software design process, but the challenge is how to reconcile the representations of the diverse views to ensure consistency. To address the problem of inconsistencies that occur across multiple design views, this research introduces the Methodology for Objects to Agents (MOA). MOA utilizes a new ontology, the Ontology for Software Specification and Design (OSSD), as a common information model to integrate specification knowledge and design knowledge in order to facilitate the interoperability of formal requirements modeling tools and design tools, with the end goal of detecting inconsistency errors in a design. The methodology, which transforms designs represented using the Unified Modeling Language (UML) into representations written in formal agent-oriented modeling languages, integrates object-oriented concepts and agent-oriented concepts in order to take advantage of the benefits that both approaches can provide. The OSSD model is a hierarchical decomposition of software development concepts, including ontological constructs of objects, attributes, behavior, relations, states, transitions, goals, constraints, and plans. The methodology includes a consistency checking process that defines a consistency framework and an Inter-View Inconsistency Detection technique. MOA enhances software design quality by integrating multiple software design views, integrating object-oriented and agent-oriented concepts, and defining an error detection method that associates rules with ontological properties

Louisiana State University

Evaluating Pre-training Objectives for Low-Resource Translation into Morphologically Rich Languages

Author: Bisazza Arianna
Dhar Prajit
van Noord Gertjan
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/06/2022
Field of study

The scarcity of parallel data is a major limitation for Neural Machine Translation (NMT) systems, in particular for translation into morphologically rich languages (MRLs). An important way to overcome the lack of parallel data is to leverage target monolingual data, which is typically more abundant and easier to collect. We evaluate a number of techniques to achieve this, ranging from back-translation to random token masking, on the challenging task of translating English into four typologically diverse MRLs, under low-resource settings. Additionally, we introduce Inflection Pre-Training (or PT-Inflect), a novelpre-training objective whereby the NMT system is pre-trained on the task of re-inflecting lemmatized target sentences before being trained on standard source-to-target language translation. We conduct our evaluation on four typologically diverse target MRLs, and find that PT-Inflect surpasses NMT systems trained only on parallel data. While PT-Inflect is outperformed by back-translation overall, combining the two techniques leads to gains in some of the evaluated language pairs

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Evaluating Pre-training Objectives for Low-Resource Translation into Morphologically Rich Languages

Author: Bisazza Arianna
Dhar Prajit
van Noord Gertjan
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/06/2022
Field of study

University of Groningen

Theory and Applications for Advanced Text Mining

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data include useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques have been developed, the text mining research field continues to expand for the needs arising from various application fields. This book is composed of 9 chapters introducing advanced text mining techniques. They are various techniques from relation extraction to under or less resourced language. I believe that this book will give new knowledge in the text mining field and help many readers open their new research fields

Directory of Open Access Books (DOAB)

NLP-Based Techniques for Cyber Threat Intelligence

Author: A. Rafidha Rehiman K.
Arazzi Marco
Arikkat Dincy R.
Conti Mauro
Nicolazzo Serena
Nocera Antonino
P. Vinod
Publication venue
Publication date: 15/11/2023
Field of study

In the digital era, threat actors employ sophisticated techniques for which, often, digital traces in the form of textual data are available. Cyber Threat Intelligence~(CTI) is related to all the solutions inherent to data collection, processing, and analysis useful to understand a threat actor's targets and attack behavior. Currently, CTI is assuming an always more crucial role in identifying and mitigating threats and enabling proactive defense strategies. In this context, NLP, an artificial intelligence branch, has emerged as a powerful tool for enhancing threat intelligence capabilities. This survey paper provides a comprehensive overview of NLP-based techniques applied in the context of threat intelligence. It begins by describing the foundational definitions and principles of CTI as a major tool for safeguarding digital assets. It then undertakes a thorough examination of NLP-based techniques for CTI data crawling from Web sources, CTI data analysis, Relation Extraction from cybersecurity data, CTI sharing and collaboration, and security threats of CTI. Finally, the challenges and limitations of NLP in threat intelligence are exhaustively examined, including data quality issues and ethical considerations. This survey draws a complete framework and serves as a valuable resource for security professionals and researchers seeking to understand the state-of-the-art NLP-based threat intelligence techniques and their potential impact on cybersecurity

arXiv.org e-Print Archive