68,038 research outputs found

    Find the Funding: Entity Linking with Incomplete Funding Knowledge Bases

    Full text link
    Automatic extraction of funding information from academic articles adds significant value to industry and research communities, such as tracking research outcomes by funding organizations, profiling researchers and universities based on the received funding, and supporting open access policies. Two major challenges of identifying and linking funding entities are: (i) sparse graph structure of the Knowledge Base (KB), which makes the commonly used graph-based entity linking approaches suboptimal for the funding domain, (ii) missing entities in KB, which (unlike recent zero-shot approaches) requires marking entity mentions without KB entries as NIL. We propose an entity linking model that can perform NIL prediction and overcome data scarcity issues in a time and data-efficient manner. Our model builds on a transformer-based mention detection and bi-encoder model to perform entity linking. We show that our model outperforms strong existing baselines.Comment: Accepted at COLING 202

    Mining and Leveraging Background Knowledge for Improving Named Entity Linking

    Get PDF
    Knowledge-rich Information Extraction (IE) methods aspire towards combining classical IE with background knowledge obtained from third-party resources. Linked Open Data repositories that encode billions of machine readable facts from sources such as Wikipedia play a pivotal role in this development. The recent growth of Linked Data adoption for Information Extraction tasks has shed light on many data quality issues in these data sources that seriously challenge their usefulness such as completeness, timeliness and semantic correctness. Information Extraction methods are, therefore, faced with problems such as name variance and type confusability. If multiple linked data sources are used in parallel, additional concerns regarding link stability and entity mappings emerge. This paper develops methods for integrating Linked Data into Named Entity Linking methods and addresses challenges in regard to mining knowledge from Linked Data, mitigating data quality issues, and adapting algorithms to leverage this knowledge. Finally, we apply these methods to Recognyze, a graph-based Named Entity Linking (NEL) system, and provide a comprehensive evaluation which compares its performance to other well-known NEL systems, demonstrating the impact of the suggested methods on its own entity linking performance

    EAL: A toolkit and dataset for entity-aspect linking

    Full text link
    We present a toolkit and dataset for entity-aspect linking. The tool takes as input a sentence and provides the most relevant aspect for each mentioned entity; it is implemented in Python and available as a script and via an online demo. It is accompanied by the first large dataset of entity-aspects, comprising more than 20,000 entities manually linked to the most relevant aspect, given a sentence as context. Each is expressed in structured manner as Open Information Extraction (OIE) triples (Subject, Relation, Object), having semantic information for polarity, modality, quantity and attributions

    Entity Linking to Wikipedia : Grounding entity mentions in natural language text using thematic context distance and collective search

    Get PDF
    This thesis proposes new methods for entity linking in natural language text that assigns entity mentions in unstructured natural language text to the semi-structured encyclopedia Wikipedia. Doing so, entity linking grounds a mention to an encyclopedic entry in Wikipedia and embeds it into this Linked-Open-Data hub. This enables a higher level view on single documents, provides hints for further reading and may be used to add details from other sources. Furthermore, enriching text documents with such links simultaneously resolves the ambiguity of entity names. This ambiguity is an unsolved challenge for many text mining applications: one entity may be designated by a multitude of names and every mention may denote a multitude of entities. Resolving the ambiguity of entity names is thus a crucial step for entity based retrieval, an open problem for most information retrieval and extraction tasks. For instance, search engines relying on heuristic string matches often retrieve irrelevant results as they can not satisfyingly resolve ambiguity. Moreover, there is a huge number of entity mentions that can not be linked to Wikipedia since albeit of its size, Wikipedia has a restricted coverage. Earlier and current work often ignored this and consequently all mentions of uncovered entities. Other approaches handle only entity mentions of specific types or are focussed on English as target language. Apart from such restrictions, no method achieves perfect linking performance. These are the tasks approached in this thesis. We introduce new methods for candidate entity retrieval and candidate entity consolidation, the key components to recall and precision, exploiting both the vast amount of structured and unstructured information stored in Wikipedia. First, we propose a new contextual similarity measure based on latent topic distributions inferred from unstructured natural language text. We show that this thematic distance between mention and candidate entity contexts yields a lower linking error rate than purely word based distances. Being language independent, this method enables high performance entity linking in previously neglected languages such as German and French. This approach is especially suitable, albeit not restricted to link person names, the class of mentions with highest ambiguity. We next propose a new candidate retrieval method to enable successful entity linking also for other entities that are not referenced canonically or exhibit the thematic coherence of persons. We introduce collective search that uses the structured information encoded in Wikipedia’s hyperlink graph to arrive at sets of strongly related candidate entities. This enables us to better handle synonymy, one of the hardest problems in entity linking and not thoroughly treated in previous work. We emphasize on general applicability and evaluate this method on a broad collection of benchmark corpora both in a supervised as well as in an unsupervised setting. We show that candidate enhancement through collective search increases linking performance on nearly all of these corpora and that our method is the most stable compared to other state-of-the-art approaches. Presenting the first unification of diverse performance measures, we also make a step forward to the comparability of entity linking methods. In conclusion, we provide state-of-the-art entity linking methods for nearly all of the current use cases. When it comes to fine-tuning, we note that entity linking has subjective aspects and adaptions may be necessary depending on the task at hand

    HarriGT:A Tool for Linking News to Science

    Get PDF
    Being able to reliably link scientific works to the newspaper articles that discuss them could provide a breakthrough in the way we rationalise and measure the impact of science on our society. Linking these articles is challenging because the language used in the two domains is very different, and the gathering of online resources to align the two is a substantial information retrieval endeavour. We present HarriGT, a semi-automated tool for building corpora of news articles linked to the scientific papers that they discuss. Our aim is to facilitate future development of information-retrieval tools for newspaper/scientific work citation linking. HarriGT retrieves newspaper articles from an archive containing 17 years of UK web content. It also integrates with 3 large external citation networks, leveraging named entity extraction, and document classification to surface relevant examples of scientific literature to the user. We also provide a tuned candidate ranking algorithm to highlight potential links between scientific papers and newspaper articles to the user, in order of likelihood. HarriGT is provided as an open source tool (http://harrigt.xyz).preprintPeer reviewe

    Distantly Supervised Web Relation Extraction for Knowledge Base Population

    Get PDF
    Extracting information from Web pages for populating large, cross-domain knowledge bases requires methods which are suitable across domains, do not require manual effort to adapt to new domains, are able to deal with noise, and integrate information extracted from different Web pages. Recent approaches have used existing knowledge bases to learn to extract information with promising results, one of those approaches being distant supervision. Distant supervision is an unsupervised method which uses background information from the Linking Open Data cloud to automatically label sentences with relations to create training data for relation classifiers. In this paper we propose the use of distant supervision for relation extraction from the Web. Although the method is promising, existing approaches are still not suitable for Web extraction as they suffer from three main issues: data sparsity, noise and lexical ambiguity. Our approach reduces the impact of data sparsity by making entity recognition tools more robust across domains and extracting relations across sentence boundaries using unsupervised co- reference resolution methods. We reduce the noise caused by lexical ambiguity by employing statistical methods to strategically select training data. To combine information extracted from multiple sources for populating knowledge bases we present and evaluate several information integration strategies and show that those benefit immensely from additional relation mentions extracted using co-reference resolution, increasing precision by 8%. We further show that strategically selecting training data can increase precision by a further 3%

    CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information

    Full text link
    Open Information Extraction (OpenIE) methods extract (noun phrase, relation phrase, noun phrase) triples from text, resulting in the construction of large Open Knowledge Bases (Open KBs). The noun phrases (NPs) and relation phrases in such Open KBs are not canonicalized, leading to the storage of redundant and ambiguous facts. Recent research has posed canonicalization of Open KBs as clustering over manuallydefined feature spaces. Manual feature engineering is expensive and often sub-optimal. In order to overcome this challenge, we propose Canonicalization using Embeddings and Side Information (CESI) - a novel approach which performs canonicalization over learned embeddings of Open KBs. CESI extends recent advances in KB embedding by incorporating relevant NP and relation phrase side information in a principled manner. Through extensive experiments on multiple real-world datasets, we demonstrate CESI's effectiveness.Comment: Accepted at WWW 201
    corecore