1,151 research outputs found

    Learning to Correct Noisy Labels for Fine-Grained Entity Typing via Co-Prediction Prompt Tuning

    Full text link
    Fine-grained entity typing (FET) is an essential task in natural language processing that aims to assign semantic types to entities in text. However, FET poses a major challenge known as the noise labeling problem, whereby current methods rely on estimating noise distribution to identify noisy labels but are confused by diverse noise distribution deviation. To address this limitation, we introduce Co-Prediction Prompt Tuning for noise correction in FET, which leverages multiple prediction results to identify and correct noisy labels. Specifically, we integrate prediction results to recall labeled labels and utilize a differentiated margin to identify inaccurate labels. Moreover, we design an optimization objective concerning divergent co-predictions during fine-tuning, ensuring that the model captures sufficient information and maintains robustness in noise identification. Experimental results on three widely-used FET datasets demonstrate that our noise correction approach significantly enhances the quality of various types of training samples, including those annotated using distant supervision, ChatGPT, and crowdsourcing.Comment: Accepted by Findings of EMNLP 2023, 11 page

    CDˆ2CR:Co-reference resolution across documents and domains

    Get PDF
    Cross-document co-reference resolution (CDCR) is the task of identifying and linking mentions to entities and concepts across many text documents. Current state-of-the-art models for this task assume that all documents are of the same type (e.g. news articles) or fall under the same theme. However, it is also desirable to perform CDCR across different domains (type or theme). A particular use case we focus on in this paper is the resolution of entities mentioned across scientific work and newspaper articles that discuss them. Identifying the same entities and corresponding concepts in both scientific articles and news can help scientists understand how their work is represented in mainstream media. We propose a new task and English language dataset for cross-document cross-domain co-reference resolution (CD2^2CR). The task aims to identify links between entities across heterogeneous document types. We show that in this cross-domain, cross-document setting, existing CDCR models do not perform well and we provide a baseline model that outperforms current state-of-the-art CDCR models on CD2^2CR. Our data set, annotation tool and guidelines as well as our model for cross-document cross-domain co-reference are all supplied as open access open source resources.Comment: 9 pages, 5 figures, accepted at EACL 202

    Resolving Regular Polysemy in Named Entities

    Full text link
    Word sense disambiguation primarily addresses the lexical ambiguity of common words based on a predefined sense inventory. Conversely, proper names are usually considered to denote an ad-hoc real-world referent. Once the reference is decided, the ambiguity is purportedly resolved. However, proper names also exhibit ambiguities through appellativization, i.e., they act like common words and may denote different aspects of their referents. We proposed to address the ambiguities of proper names through the light of regular polysemy, which we formalized as dot objects. This paper introduces a combined word sense disambiguation (WSD) model for disambiguating common words against Chinese Wordnet (CWN) and proper names as dot objects. The model leverages the flexibility of a gloss-based model architecture, which takes advantage of the glosses and example sentences of CWN. We show that the model achieves competitive results on both common and proper nouns, even on a relatively sparse sense dataset. Aside from being a performant WSD tool, the model further facilitates the future development of the lexical resource

    Entity Linking for the Biomedical Domain

    Get PDF
    Entity linking is the process of detecting mentions of different concepts in text documents and linking them to canonical entities in a target lexicon. However, one of the biggest issues in entity linking is the ambiguity in entity names. The ambiguity is an issue that many text mining tools have yet to address since different names can represent the same thing and every mention could indicate a different thing. For instance, search engines that rely on heuristic string matches frequently return irrelevant results, because they are unable to satisfactorily resolve ambiguity. Thus, resolving named entity ambiguity is a crucial step in entity linking. To solve the problem of ambiguity, this work proposes a heuristic method for entity recognition and entity linking over the biomedical knowledge graph concerning the semantic similarity of entities in the knowledge graph. Named entity recognition (NER), relation extraction (RE), and relationship linking make up a conventional entity linking (EL) system pipeline (RL). We have used the accuracy metric in this thesis. Therefore, for each identified relation or entity, the solution comprises identifying the correct one and matching it to its corresponding unique CUI in the knowledge base. Because KBs contain a substantial number of relations and entities, each with only one natural language label, the second phase is directly dependent on the accuracy of the first. The framework developed in this thesis enables the extraction of relations and entities from the text and their mapping to the associated CUI in the UMLS knowledge base. This approach derives a new representation of the knowledge base that lends it to the easy comparison. Our idea to select the best candidates is to build a graph of relations and determine the shortest path distance using a ranking approach. We test our suggested approach on two well-known benchmarks in the biomedical field and show that our method exceeds the search engine's top result and provides us with around 4% more accuracy. In general, when it comes to fine-tuning, we notice that entity linking contains subjective characteristics and modifications may be required depending on the task at hand. The performance of the framework is evaluated based on a Python implementation

    Framework for Enhanced Ontology Alignment using BERT-Based

    Get PDF
    This framework combines a few approaches to improve ontology alignment by using the data mining method with BERT. The method utilizes data mining techniques to identify the optimal characteristics for picking the data attributes of instances to match ontologies. Furthermore, this framework was developed to improve current precision and recall measures for ontology matching techniques. Since knowledge integration began, the main requirement for ontology alignment has always been syntactic and structural matching. This article presents a new approach that employs advanced methods like data mining and BERT embeddings to produce more expansive and contextually aware ontology alignment. The proposed system exploits contextual representation of BERT, semantic understanding, feature extraction, and pattern recognition through data mining techniques. The objective is to combine data-driven insights with semantic representation advantages to enhance accuracy and efficiency in the ontology alignment process. The evaluation conducted using annotated datasets as well as traditional approaches demonstrates how effective and adaptable, according to domains, our proposed framework is across several domains

    Efficient Methods for Knowledge Base Construction and Query

    Full text link
    Recently, knowledge bases have been widely used in search engines, question-answering systems, and many other applications. The abundant entity profiles and relational information in knowledge bases help the downstream applications learn more about the user queries. However, in automated knowledge base construction, ambiguity in data sources is one of the main challenges. Given a constructed knowledge base, it is hard to efficiently find entities of interest and extract their relatedness information from the knowledge base due to its large capacity. In this thesis, we adopt natural language processing tools, machine learning and graph/text query techniques to deal with such challenges. First, we introduce a machine-learning based framework for efficient entity linking to deal with the ambiguity issue in documents. For entity linking, deep-learning-based methods have outperformed traditional machine-learning-based ones but demand a large amount of data and have a high cost on the training time. We propose a lightweight, customisable and time-efficient method, which is based on traditional machine learning techniques. Our approach achieves comparable performances to the state-of-the-art deep learning-based ones while being significantly faster to train. Second, we adopt deep learning to deal with the Entity Resolution (ER) problem, which aims to reduce the data ambiguity in structural data sources. The existing BERT-based method has set new state-of-the-art performance on the ER task, but it suffers from the high computational cost due to the large cardinality to match. We propose to use Bert in a siamese network to encode the entities separately and adopt the blocking-matching scheme in a multi-task learning framework. The blocking module filters out candidate entity pairs that are unlikely to be matched, while the matching module uses an enhanced alignment network to decide if a pair is a match. Experiments show that our approach outperforms state-of-the-art models in both efficiency and effectiveness. Third, we proposed a flexible Query auto-completion (QAC) framework to support efficient error-tolerant QAC for entity queries in the knowledge base. Most existing works overlook the quality of the suggested completions, and the efficiency needs to be improved. Our framework is designed on the basis of a noisy channel model, which consists of a language model and an error model. Thus, many QAC ranking methods and spelling correction methods can be easily plugged into the framework. To address the efficiency issue, we devise a neighbourhood generation method accompanied by a trie index to quickly find candidates for the error model. The experiments show that our method improves the state of the art of error-tolerant QAC. Last but not least, we designed a visualisation system to facilitate efficient relatedness queries in a large-scale knowledge graph. Given a pair of entities, we aim to efficiently extract a succinct sub-graph to explain the relatedness of the pair of entities. Existing methods, either graph-based or list-based, all have some limitations when dealing with large complex graphs. We propose to use Bi-simulation to summarise the sub-graph, where semantically similar entities are combined. Our method exhibits the most prominent patterns while keeping them in an integrated graph

    The SourceData-NLP dataset: integrating curation into scientific publishing for training large language models

    Full text link
    Introduction: The scientific publishing landscape is expanding rapidly, creating challenges for researchers to stay up-to-date with the evolution of the literature. Natural Language Processing (NLP) has emerged as a potent approach to automating knowledge extraction from this vast amount of publications and preprints. Tasks such as Named-Entity Recognition (NER) and Named-Entity Linking (NEL), in conjunction with context-dependent semantic interpretation, offer promising and complementary approaches to extracting structured information and revealing key concepts. Results: We present the SourceData-NLP dataset produced through the routine curation of papers during the publication process. A unique feature of this dataset is its emphasis on the annotation of bioentities in figure legends. We annotate eight classes of biomedical entities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases), their role in the experimental design, and the nature of the experimental method as an additional class. SourceData-NLP contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 papers in molecular and cell biology. We illustrate the dataset's usefulness by assessing BioLinkBERT and PubmedBERT, two transformers-based models, fine-tuned on the SourceData-NLP dataset for NER. We also introduce a novel context-dependent semantic task that infers whether an entity is the target of a controlled intervention or the object of measurement. Conclusions: SourceData-NLP's scale highlights the value of integrating curation into publishing. Models trained with SourceData-NLP will furthermore enable the development of tools able to extract causal hypotheses from the literature and assemble them into knowledge graphs

    A Survey on Arabic Named Entity Recognition: Past, Recent Advances, and Future Trends

    Full text link
    As more and more Arabic texts emerged on the Internet, extracting important information from these Arabic texts is especially useful. As a fundamental technology, Named entity recognition (NER) serves as the core component in information extraction technology, while also playing a critical role in many other Natural Language Processing (NLP) systems, such as question answering and knowledge graph building. In this paper, we provide a comprehensive review of the development of Arabic NER, especially the recent advances in deep learning and pre-trained language model. Specifically, we first introduce the background of Arabic NER, including the characteristics of Arabic and existing resources for Arabic NER. Then, we systematically review the development of Arabic NER methods. Traditional Arabic NER systems focus on feature engineering and designing domain-specific rules. In recent years, deep learning methods achieve significant progress by representing texts via continuous vector representations. With the growth of pre-trained language model, Arabic NER yields better performance. Finally, we conclude the method gap between Arabic NER and NER methods from other languages, which helps outline future directions for Arabic NER.Comment: Accepted by IEEE TKD
    • …
    corecore