243 research outputs found

    Named Entity Extraction for Knowledge Graphs: A Literature Overview

    Get PDF
    An enormous amount of digital information is expressed as natural-language (NL) text that is not easily processable by computers. Knowledge Graphs (KG) offer a widely used format for representing information in computer-processable form. Natural Language Processing (NLP) is therefore needed for mining (or lifting) knowledge graphs from NL texts. A central part of the problem is to extract the named entities in the text. The paper presents an overview of recent advances in this area, covering: Named Entity Recognition (NER), Named Entity Disambiguation (NED), and Named Entity Linking (NEL). We comment that many approaches to NED and NEL are based on older approaches to NER and need to leverage the outputs of state-of-the-art NER systems. There is also a need for standard methods to evaluate and compare named-entity extraction approaches. We observe that NEL has recently moved from being stepwise and isolated into an integrated process along two dimensions: the first is that previously sequential steps are now being integrated into end-to-end processes, and the second is that entities that were previously analysed in isolation are now being lifted in each other's context. The current culmination of these trends are the deep-learning approaches that have recently reported promising results.publishedVersio

    Doctor of Philosophy

    Get PDF
    dissertationEvents are one important type of information throughout text. Event extraction is an information extraction (IE) task that involves identifying entities and objects (mainly noun phrases) that represent important roles in events of a particular type. However, the extraction performance of current event extraction systems is limited because they mainly consider local context (mostly isolated sentences) when making each extraction decision. My research aims to improve both coverage and accuracy of event extraction performance by explicitly identifying event contexts before extracting individual facts. First, I introduce new event extraction architectures that incorporate discourse information across a document to seek out and validate pieces of event descriptions within the document. TIER is a multilayered event extraction architecture that performs text analysis at multiple granularities to progressively \zoom in" on relevant event information. LINKER is a unied discourse-guided approach that includes a structured sentence classier to sequentially read a story and determine which sentences contain event information based on both the local and preceding contexts. Experimental results on two distinct event domains show that compared to previous event extraction systems, TIER can nd more event information while maintaining a good extraction accuracy, and LINKER can further improve extraction accuracy. Finding documents that describe a specic type of event is also highly challenging because of the wide variety and ambiguity of event expressions. In this dissertation, I present the multifaceted event recognition approach that uses event dening characteristics (facets), in addition to event expressions, to eectively resolve the complexity of event descriptions. I also present a novel bootstrapping algorithm to automatically learn event expressions as well as facets of events, which requires minimal human supervision. Experimental results show that the multifaceted event recognition approach can eectively identify documents that describe a particular type of event and make event extraction systems more precise

    Neural information extraction from natural language text

    Get PDF
    Natural language processing (NLP) deals with building computational techniques that allow computers to automatically analyze and meaningfully represent human language. With an exponential growth of data in this digital era, the advent of NLP-based systems has enabled us to easily access relevant information via a wide range of applications, such as web search engines, voice assistants, etc. To achieve it, a long-standing research for decades has been focusing on techniques at the intersection of NLP and machine learning. In recent years, deep learning techniques have exploited the expressive power of Artificial Neural Networks (ANNs) and achieved state-of-the-art performance in a wide range of NLP tasks. Being one of the vital properties, Deep Neural Networks (DNNs) can automatically extract complex features from the input data and thus, provide an alternative to the manual process of handcrafted feature engineering. Besides ANNs, Probabilistic Graphical Models (PGMs), a coupling of graph theory and probabilistic methods have the ability to describe causal structure between random variables of the system and capture a principled notion of uncertainty. Given the characteristics of DNNs and PGMs, they are advantageously combined to build powerful neural models in order to understand the underlying complexity of data. Traditional machine learning based NLP systems employed shallow computational methods (e.g., SVM or logistic regression) and relied on handcrafting features which is time-consuming, complex and often incomplete. However, deep learning and neural network based methods have recently shown superior results on various NLP tasks, such as machine translation, text classification, namedentity recognition, relation extraction, textual similarity, etc. These neural models can automatically extract an effective feature representation from training data. This dissertation focuses on two NLP tasks: relation extraction and topic modeling. The former aims at identifying semantic relationships between entities or nominals within a sentence or document. Successfully extracting the semantic relationships greatly contributes in building structured knowledge bases, useful in downstream NLP application areas of web search, question-answering, recommendation engines, etc. On other hand, the task of topic modeling aims at understanding the thematic structures underlying in a collection of documents. Topic modeling is a popular text-mining tool to automatically analyze a large collection of documents and understand topical semantics without actually reading them. In doing so, it generates word clusters (i.e., topics) and document representations useful in document understanding and information retrieval, respectively. Essentially, the tasks of relation extraction and topic modeling are built upon the quality of representations learned from text. In this dissertation, we have developed task-specific neural models for learning representations, coupled with relation extraction and topic modeling tasks in the realms of supervised and unsupervised machine learning paradigms, respectively. More specifically, we make the following contributions in developing neural models for NLP tasks: 1. Neural Relation Extraction: Firstly, we have proposed a novel recurrent neural network based architecture for table-filling in order to jointly perform entity and relation extraction within sentences. Then, we have further extended our scope of extracting relationships between entities across sentence boundaries, and presented a novel dependency-based neural network architecture. The two contributions lie in the supervised paradigm of machine learning. Moreover, we have contributed in building a robust relation extractor constrained by the lack of labeled data, where we have proposed a novel weakly-supervised bootstrapping technique. Given the contributions, we have further explored interpretability of the recurrent neural networks to explain their predictions for the relation extraction task. 2. Neural Topic Modeling: Besides the supervised neural architectures, we have also developed unsupervised neural models to learn meaningful document representations within topic modeling frameworks. Firstly, we have proposed a novel dynamic topic model that captures topics over time. Next, we have contributed in building static topic models without considering temporal dependencies, where we have presented neural topic modeling architectures that also exploit external knowledge, i.e., word embeddings to address data sparsity. Moreover, we have developed neural topic models that incorporate knowledge transfers using both the word embeddings and latent topics from many sources. Finally, we have shown improving neural topic modeling by introducing language structures (e.g., word ordering, local syntactic and semantic information, etc.) that deals with bag-of-words issues in traditional topic models. The class of proposed neural NLP models in this section are based on techniques at the intersection of PGMs, deep learning and ANNs. Here, the task of neural relation extraction employs neural networks to learn representations typically at the sentence level, without access to the broader document context. However, topic models have access to statistical information across documents. Therefore, we advantageously combine the two complementary learning paradigms in a neural composite model, consisting of a neural topic and a neural language model that enables us to jointly learn thematic structures in a document collection via the topic model, and word relations within a sentence via the language model. Overall, our research contributions in this dissertation extend NLP-based systems for relation extraction and topic modeling tasks with state-of-the-art performances

    Biomedical relation extraction:from binary to complex

    Get PDF
    Biomedical relation extraction aims to uncover high-quality relations from life science literature with high accuracy and efficiency. Early biomedical relation extraction tasks focused on capturing binary relations, such as protein-protein interactions, which are crucial for virtually every process in a living cell. Information about these interactions provides the foundations for new therapeutic approaches. In recent years, more interests have been shifted to the extraction of complex relations such as biomolecular events. While complex relations go beyond binary relations and involve more than two arguments, they might also take another relation as an argument. In the paper, we conduct a thorough survey on the research in biomedical relation extraction. We first present a general framework for biomedical relation extraction and then discuss the approaches proposed for binary and complex relation extraction with focus on the latter since it is a much more difficult task compared to binary relation extraction. Finally, we discuss challenges that we are facing with complex relation extraction and outline possible solutions and future directions

    D4.1. Technologies and tools for corpus creation, normalization and annotation

    Get PDF
    The objectives of the Corpus Acquisition and Annotation (CAA) subsystem are the acquisition and processing of monolingual and bilingual language resources (LRs) required in the PANACEA context. Therefore, the CAA subsystem includes: i) a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web, ii) a component for cleanup and normalization (CNC) of these data and iii) a text processing component (TPC) which consists of NLP tools including modules for sentence splitting, POS tagging, lemmatization, parsing and named entity recognition
    • …
    corecore