3,530 research outputs found

    Chinese Named Entity Recognition Method for Domain-Specific Text

    Get PDF
    The Chinese named entity recognition (NER) is a critical task in natural language processing, aiming at identifying and classifying named entities in text. However, the specificity of domain texts and the lack of large-scale labelled datasets have led to the poor performance of NER methods trained on public domain corpora on domain texts. In this paper, a named entity recognition method incorporating sentence semantic information is proposed, mainly by adaptively incorporating sentence semantic information into character semantic information through an attention mechanism and a gating mechanism to enhance entity feature representation while attenuating the noise generated by irrelevant character information. In addition, to address the lack of large-scale labelled samples, we used data self-augmentation methods to expand the training samples. Furthermore, we introduced a Weighted Strategy considering that the low-quality samples generated by the data self-augmentation process can have a negative impact on the model. Experiments on the TCM prescriptions corpus showed that the F1 values of our method outperformed the comparison methods

    NLP and ML Methods for Pre-processing, Clustering and Classification of Technical Logbook Datasets

    Get PDF
    Technical logbooks are a challenging and under-explored text type in automated event identification. These texts are typically short and written in non-standard yet technical language, posing challenges to off-the-shelf NLP pipelines. These datasets typically represent a domain (a technical field such as automotive) and an application (e.g., maintenance). The granularity of issue types described in these datasets additionally leads to class imbalance, making it challenging for models to accurately predict which issue each logbook entry describes. In this research, we focus on the problem of technical issue pre-processing, clustering, and classification by considering logbook datasets from the automotive, aviation, and facility maintenance domains. We developed MaintNet, a collaborative open source library including logbook datasets from various domains and a pre-processing pipeline to clean unstructured datasets. Additionally, we adapted a feedback loop strategy from computer vision for handling extreme class imbalance, which resamples the training data based on its error in the prediction process. We further investigated the benefits of using transfer learning from sources within the same domain (but different applications), from within the same application (but different domains), and from all available data to improve the performance of the classification models. Finally, we evaluated several data augmentation approaches including synonym replacement, random swap, and random deletion to address the issue of data scarcity in technical logbooks

    Natural Language Processing: Emerging Neural Approaches and Applications

    Get PDF
    This Special Issue highlights the most recent research being carried out in the NLP field to discuss relative open issues, with a particular focus on both emerging approaches for language learning, understanding, production, and grounding interactively or autonomously from data in cognitive and neural systems, as well as on their potential or real applications in different domains

    Finding Answers to Definition Questions Using Web Knowledge Bases

    Get PDF
    PACLIC 23 / City University of Hong Kong / 3-5 December 200

    Theory and Applications for Advanced Text Mining

    Get PDF
    Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data include useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques have been developed, the text mining research field continues to expand for the needs arising from various application fields. This book is composed of 9 chapters introducing advanced text mining techniques. They are various techniques from relation extraction to under or less resourced language. I believe that this book will give new knowledge in the text mining field and help many readers open their new research fields

    Ontology Enrichment from Free-text Clinical Documents: A Comparison of Alternative Approaches

    Get PDF
    While the biomedical informatics community widely acknowledges the utility of domain ontologies, there remain many barriers to their effective use. One important requirement of domain ontologies is that they achieve a high degree of coverage of the domain concepts and concept relationships. However, the development of these ontologies is typically a manual, time-consuming, and often error-prone process. Limited resources result in missing concepts and relationships, as well as difficulty in updating the ontology as domain knowledge changes. Methodologies developed in the fields of Natural Language Processing (NLP), Information Extraction (IE), Information Retrieval (IR), and Machine Learning (ML) provide techniques for automating the enrichment of ontology from free-text documents. In this dissertation, I extended these methodologies into biomedical ontology development. First, I reviewed existing methodologies and systems developed in the fields of NLP, IR, and IE, and discussed how existing methods can benefit the development of biomedical ontologies. This previously unconducted review was published in the Journal of Biomedical Informatics. Second, I compared the effectiveness of three methods from two different approaches, the symbolic (the Hearst method) and the statistical (the Church and Lin methods), using clinical free-text documents. Third, I developed a methodological framework for Ontology Learning (OL) evaluation and comparison. This framework permits evaluation of the two types of OL approaches that include three OL methods. The significance of this work is as follows: 1) The results from the comparative study showed the potential of these methods for biomedical ontology enrichment. For the two targeted domains (NCIT and RadLex), the Hearst method revealed an average of 21% and 11% new concept acceptance rates, respectively. The Lin method produced a 74% acceptance rate for NCIT; the Church method, 53%. As a result of this study (published in the Journal of Methods of Information in Medicine), many suggested candidates have been incorporated into the NCIT; 2) The evaluation framework is flexible and general enough that it can analyze the performance of ontology enrichment methods for many domains, thus expediting the process of automation and minimizing the likelihood that key concepts and relationships would be missed as domain knowledge evolves
    corecore