41 research outputs found

    Linking named entities to Wikipedia

    Get PDF
    Natural language is fraught with problems of ambiguity, including name reference. A name in text can refer to multiple entities just as an entity can be known by different names. This thesis examines how a mention in text can be linked to an external knowledge base (KB), in our case, Wikipedia. The named entity linking (NEL) task requires systems to identify the KB entry, or Wikipedia article, that a mention refers to; or, if the KB does not contain the correct entry, return NIL. Entity linking systems can be complex and we present a framework for analysing their different components, which we use to analyse three seminal systems which are evaluated on a common dataset and we show the importance of precise search for linking. The Text Analysis Conference (TAC) is a major venue for NEL research. We report on our submissions to the entity linking shared task in 2010, 2011 and 2012. The information required to disambiguate entities is often found in the text, close to the mention. We explore apposition, a common way for authors to provide information about entities. We model syntactic and semantic restrictions with a joint model that achieves state-of-the-art apposition extraction performance. We generalise from apposition to examine local descriptions specified close to the mention. We add local description to our state-of-the-art linker by using patterns to extract the descriptions and matching against this restricted context. Not only does this make for a more precise match, we are also able to model failure to match. Local descriptions help disambiguate entities, further improving our state-of-the-art linker. The work in this thesis seeks to link textual entity mentions to knowledge bases. Linking is important for any task where external world knowledge is used and resolving ambiguity is fundamental to advancing research into these problems

    Sentence encoders for semantic textual similarity: A survey

    Get PDF
    The last decade has witnessed many accomplishments in the field of Natural Language Processing, especially in understanding the language semantics. Well-established machine learning models for generating word representation are available and has been proven useful. However, the existing techniques proposed for learning sentence level representations do not adequately capture the complexity of compositional semantics. Finding semantic similarity between sentences is a fundamental language understanding problem. In this project, we compare various machine learning models on their ability to capture the semantics of a sentence using Semantic Textual Similarity (STS) Task. We focus on models that exhibit state-of-the-art performance in Sem-Eval(2017) STS shared task. Also, we analyse the impact of models’ internal architectures on STS task performance. Out of all the models the we compared, Bi-LSTM RNN with max-pooling layer achieves the best performance in extracting a generic semantic representation and aids in better transfer learning when compared to hierarchical CNN.Natural Language Processinglanguage semanticsSemantic Textual Similarity (STS) TaskSem-Eval(2017) STSBi-LSTM RNNhierarchical CN

    Approaches for enriching and improving textual knowledge bases

    Get PDF
    [no abstract

    Text Summarization Across High and Low-Resource Settings

    Get PDF
    Natural language processing aims to build automated systems that can both understand and generate natural language textual data. As the amount of textual data available online has increased exponentially, so has the need for intelligence systems to comprehend and present it to the world. As a result, automatic text summarization, the process by which a text\u27s salient content is automatically distilled into a concise form, has become a necessary tool. Automatic text summarization approaches and applications vary based on the input summarized, which may constitute single or multiple documents of different genres. Furthermore, the desired output style may consist of a sentence or sub-sentential units chosen directly from the input in extractive summarization or a fusion and paraphrase of the input document in abstractive summarization. Despite differences in the above use-cases, specific themes, such as the role of large-scale data for training these models, the application of summarization models in real-world scenarios, and the need for adequately evaluating and comparing summaries, are common across these settings. This dissertation presents novel data and modeling techniques for deep neural network-based summarization models trained across high-resource (thousands of supervised training examples) and low-resource (zero to hundreds of supervised training examples) data settings and a comprehensive evaluation of the model and metric progress in the field. We examine both Recurrent Neural Network (RNN)-based and Transformer-based models to extract and generate summaries from the input. To facilitate the training of large-scale networks, we introduce datasets applicable for multi-document summarization (MDS) for pedagogical applications and for news summarization. While the high-resource settings allow models to advance state-of-the-art performance, the failure of such models to adapt to settings outside of that in which it was initially trained requires smarter use of labeled data and motivates work in low-resource summarization. To this end, we propose unsupervised learning techniques for both extractive summarization in question answering, abstractive summarization on distantly-supervised data for summarization of community question answering forums, and abstractive zero and few-shot summarization across several domains. To measure the progress made along these axes, we revisit the evaluation of current summarization models. In particular, this dissertation addresses the following research objectives: 1) High-resource Summarization. We introduce datasets for multi-document summarization, focusing on pedagogical applications for NLP, news summarization, and Wikipedia topic summarization. Large-scale datasets allow models to achieve state-of-the-art performance on these tasks compared to prior modeling techniques, and we introduce a novel model to reduce redundancy. However, we also examine how models trained on these large-scale datasets fare when applied to new settings, showing the need for more generalizable models. 2) Low-resource Summarization. While high-resource summarization improves model performance, for practical applications, data-efficient models are necessary. We propose a pipeline for creating synthetic training data for training extractive question-answering models, a form of query-based extractive summarization with short-phrase summaries. In other work, we propose an automatic pipeline for training a multi-document summarizer in answer summarization on community question-answering forums without labeled data. Finally, we push the boundaries of abstractive summarization model performance when little or no training data is available across several domains. 3) Automatic Summarization Evaluation. To understand the extent of progress made across recent modeling techniques and better understand the current evaluation protocols, we examine the current metrics used to compare summarization output quality across 12 metrics across 23 deep neural network models and propose better-motivated summarization evaluation guidelines as well as point to open problems in summarization evaluation

    Bootstrapping named entity resources for adaptive question answering systems

    Get PDF
    Los Sistemas de Búsqueda de Respuestas (SBR) amplían las capacidades de un buscador de información tradicional con la capacidad de encontrar respuestas precisas a las preguntas del usuario. El objetivo principal es facilitar el acceso a la información y disminuir el tiempo y el esfuerzo que el usuario debe emplear para encontrar una información concreta en una lista de documentos relevantes. En esta investigación se han abordado dos trabajos relacionados con los SBR. La primera parte presenta una arquitectura para SBR en castellano basada en la combinación y adaptación de diferentes técnicas de Recuperación y de Extracción de Información. Esta arquitectura está integrada por tres módulos principales que incluyen el análisis de la pregunta, la recuperación de pasajes relevantes y la extracción y selección de respuestas. En ella se ha prestado especial atención al tratamiento de las Entidades Nombradas puesto que, con frecuencia, son el tema de las preguntas o son buenas candidatas como respuestas. La propuesta se ha encarnado en el SBR del grupo MIRACLE que ha sido evaluado de forma independiente durante varias ediciones en la tarea compartida CLEF@QA, parte del foro de evaluación competitiva Cross-Language Evaluation Forum (CLEF). Se describen aquí las participaciones y los resultados obtenidos entre 2004 y 2007. El SBR de MIRACLE ha obtenido resultados moderados en el desempeño de la tarea con tasas de respuestas correctas entre el 20% y el 30%. Entre los resultados obtenidos destacan los de la tarea principal de 2005 y la tarea piloto de Búsqueda de Respuestas en tiempo real de 2006, RealTimeQA. Esta última tarea, además de requerir respuestas correctas incluía el tiempo de respuesta como un factor adicional en la evaluación. Estos resultados respaldan la validez de la arquitectura propuesta como una alternativa viable para los SBR sobre colecciones textuales y también corrobora resultados similares para el inglés y otras lenguas. Por otro lado, el análisis de los resultados a lo largo de las diferentes ediciones de CLEF así como la comparación con otros SBR apunta nuevos problemas y retos. Según nuestra experiencia, los sistemas de QA son más complicados de adaptar a otros dominios y lenguas que los sistemas de Recuperación de Información. Este problema viene heredado del uso de herramientas complejas de análisis de lenguaje como analizadores morfológicos, sintácticos y semánticos. Entre estos últimos se cuentan las herramientas para el Reconocimiento y Clasificación de Entidades Nombradas (NERC en inglés) así como para la Detección y Clasificación de Relaciones (RDC en inglés). Debido a la di cultad de adaptación del SBR a distintos dominios y colecciones, en la segunda parte de esta tesis se investiga una propuesta diferente basada en la adquisición de conocimiento mediante métodos de aprendizaje ligeramente supervisado. El objetivo de esta investigación es adquirir recursos semánticos útiles para las tareas de NERC y RDC usando colecciones de textos no anotados. Además, se trata de eliminar la dependencia de herramientas de análisis lingüístico con el fin de facilitar que las técnicas sean portables a diferentes dominios e idiomas. En primer lugar, se ha realizado un estudio de diferentes algoritmos para NERC y RDC de forma semisupervisada a partir de unos pocos ejemplos (bootstrapping). Este trabajo propone primero una arquitectura común y compara diferentes funciones que se han usado en la evaluación y selección de resultados intermedios, tanto instancias como patrones. La principal propuesta es un nuevo algoritmo que permite la adquisición simultánea e iterativa de instancias y patrones asociados a una relación. Incluye también la posibilidad de adquirir varias relaciones de forma simultánea y mediante el uso de la hipótesis de exclusividad obtener mejores resultados. Como característica distintiva el algoritmo explora la colección de textos con una estrategia basada en indización, que permite adquirir conocimiento de grandes colecciones. La estrategia de selección de candidatos y la evaluación se basan en la construcción de un grafo de instancias y patrones, que justifica nuestro método para la selección de candidatos. Este procedimiento es semejante al frente de exploración de una araña web y permite encontrar las instancias más parecidas a las semillas con las evidencias disponibles. Este algoritmo se ha implementado en el sistema SPINDEL y para su evaluación se ha comenzado con el caso concreto de la adquisición de recursos para las clases de Entidades Nombradas más comunes, Persona, Lugar y Organización. El objetivo es adquirir nombres asociados a cada una de las categorías así como patrones contextuales que permitan detectar menciones asociadas a una clase. Se presentan resultados para la adquisición de dos idiomas distintos, castellano e inglés, y para el castellano, en dos dominios diferentes, noticias y textos de una enciclopedia colaborativa, Wikipedia. En ambos casos el uso de herramientas de análisis lingüístico se ha limitado de acuerdo con el objetivo de avanzar hacia la independencia de idioma. Las listas adquiridas mediante bootstrapping parten de menos de 40 semillas por clase y obtienen del orden de 30.000 instancias de calidad variable. Además se obtienen listas de patrones indicativos asociados a cada clase de entidad. La evaluación indirecta confirma la utilidad de ambos recursos en la clasificación de Entidades Nombradas usando un enfoque simple basado únicamente en diccionarios. La mejor configuración obtiene para la clasificación en castellano una medida F de 67,17 y para inglés de 55,99. Además se confirma la utilidad de los patrones adquiridos que en ambos casos ayudan a mejorar la cobertura. El módulo requiere menor esfuerzo de desarrollo que los enfoques supervisados, si incluimos la necesidad de anotación, aunque su rendimiento es inferior por el momento. En definitiva, esta investigación constituye un primer paso hacia el desarrollo de aplicaciones semánticas como los SBR que requieran menos esfuerzo de adaptación a un dominio o lenguaje nuevo.-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Question Answering (QA) systems add new capabilities to traditional search engines with the ability to find precise answers to user questions. Their objective is to enable easier information access by reducing the time and effort that the user requires to find a concrete information among a list of relevant documents. In this thesis we have carried out two works related with QA systems. The first part introduces an architecture for QA systems for Spanish which is based on the combination and adaptation of different techniques from Information Retrieval (IR) and Information Extraction (IE). This architecture is composed by three modules that include question analysis, relevant passage retrieval and answer extraction and selection. The appropriate processing of Named Entities (NE) has received special attention because of their importance as question themes and candidate answers. The proposed architecture has been implemented as part of the MIRACLE QA system. This system has taken part in independent evaluations like the CLEF@QA track in the Cross-Language Evaluation Forum (CLEF). Results from 2004 to 2007 campaigns as well as the details and the evolution of the system have been described in deep. The MIRACLE QA system has obtained moderate performance with a first answer accuracy ranging between 20% and 30%. Nevertheless, it is important to highlight the results obtained in the 2005 main QA task and the RealTimeQA pilot task in 2006. The last one included response time as an important additional variable of the evaluation. These results back the proposed architecture as an option for QA from textual collection and confirm similar findings obtained for English and other languages. On the other hand, the analysis of the results along evaluation campaigns and the comparison with other QA systems point problems with current systems and new challenges. According to our experience, it is more dificult to tailor QA systems to different domains and languages than IR systems. The problem is inherited by the use of complex language analysis tools like POS taggers, parsers and other semantic analyzers, like NE Recognition and Classification (NERC) and Relation Detection and Characterization (RDC) tools. The second part of this thesis tackles this problem and proposes a different approach to adapting QA systems for di erent languages and collections. The proposal focuses on acquiring knowledge for the semantic analyzers based on lightly supervised approaches. The goal is to obtain useful resources that help to perform NERC or RDC using as few annotated resources as possible. Besides, we try to avoid dependencies from other language analysis tools with the purpose that these methods apply to different languages and domains. First of all, we have study previous work on building NERC and RDC modules with few supervision, particularly bootstrapping methods. We propose a common framework for different bootstrapping systems that help to unify different evaluation functions for intermediate results. The main proposal is a new algorithm that is able to simultaneously acquire instances and patterns associated to a relation of interest. It also uses mutual exclusion among relations to reduce concept drift and achieve better results. A distinctive characteristic is that it uses a query based exploration strategy of the text collection which enables their use for larger collections. Candidate selection and evaluation are based on incrementally building a graph of instances and patterns which also justifies our evaluation function. The discovery approach is analogous to the front of exploration in a web crawler and it is able to find the most similar instances to the available seeds. This algorithm has been implemented in the SPINDEL system. We have selected for evaluation the task of acquiring resources for the most common NE classes, Person, Location and Organization. The objective is to acquire name instances that belong to any of the classes as well as contextual patterns that help to detect mentions of NE that belong to that class. We present results for the acquisition of resources from raw text from two different languages, Spanish and English. We also performed experiments for Spanish in two different collections, news and texts from a collaborative encyclopedia, Wikipedia. Both cases are tackled with limited language analysis tools and resources. With an initial list of 40 instance seeds, the bootstrapping process is able to acquire large name lists containing up to 30.000 instances with a variable quality. Besides, large lists of indicative patterns are obtained too. Our indirect evaluation confirms the utility of both resources to classify NE using a simple dictionary recognition approach. Best results for Spanish obtained a F-score of 67,17 and for English this value is 55,99. The module requires much less development effort than annotation for supervised algorithms although the performance is not in pair yet. This research is a first step towards the development of semantic applications like QA for a new language or domain with no annotated corpora that requires less adaptation effort

    Acquiring syntactic and semantic transformations in question answering

    Get PDF
    One and the same fact in natural language can be expressed in many different ways by using different words and/or a different syntax. This phenomenon, commonly called paraphrasing, is the main reason why Natural Language Processing (NLP) is such a challenging task. This becomes especially obvious in Question Answering (QA) where the task is to automatically answer a question posed in natural language, usually in a text collection also consisting of natural language texts. It cannot be assumed that an answer sentence to a question uses the same words as the question and that these words are combined in the same way by using the same syntactic rules. In this thesis we describe methods that can help to address this problem. Firstly we explore how lexical resources, i.e. FrameNet, PropBank and VerbNet can be used to recognize a wide range of syntactic realizations that an answer sentence to a given question can have. We find that our methods based on these resources work well for web-based Question Answering. However we identify two problems: 1) All three resources as of yet have significant coverage issues. 2) These resources are not suitable to identify answer sentences that show some form of indirect evidence. While the first problem hinders performance currently, it is not a theoretical problem that renders the approach unsuitable–it rather shows that more efforts have to be made to produce more complete resources. The second problem is more persistent. Many valid answer sentences–especially in small, journalistic corpora–do not provide direct evidence for a question, rather they strongly suggest an answer without logically implying it. Semantically motivated resources like FrameNet, PropBank and VerbNet can not easily be employed to recognize such forms of indirect evidence. In order to investigate ways of dealing with indirect evidence, we used Amazon’s Mechanical Turk to collect over 8,000 manually identified answer sentences from the AQUAINT corpus to the over 1,900 TREC questions from the 2002 to 2006 QA tracks. The pairs of answer sentences and their corresponding questions form the QASP corpus, which we released to the public in April 2008. In this dissertation, we use the QASP corpus to develop an approach to QA based on matching dependency relations between answer candidates and question constituents in the answer sentences. By acquiring knowledge about syntactic and semantic transformations from dependency relations in the QASP corpus, additional answer candidates can be identified that could not be linked to the question with our first approach

    Rapport : a fact-based question answering system for portuguese

    Get PDF
    Question answering is one of the longest-standing problems in natural language processing. Although natural language interfaces for computer systems can be considered more common these days, the same still does not happen regarding access to specific textual information. Any full text search engine can easily retrieve documents containing user specified or closely related terms, however it is typically unable to answer user questions with small passages or short answers. The problem with question answering is that text is hard to process, due to its syntactic structure and, to a higher degree, to its semantic contents. At the sentence level, although the syntactic aspects of natural language have well known rules, the size and complexity of a sentence may make it difficult to analyze its structure. Furthermore, semantic aspects are still arduous to address, with text ambiguity being one of the hardest tasks to handle. There is also the need to correctly process the question in order to define its target, and then select and process the answers found in a text. Additionally, the selected text that may yield the answer to a given question must be further processed in order to present just a passage instead of the full text. These issues take also longer to address in languages other than English, as is the case of Portuguese, that have a lot less people working on them. This work focuses on question answering for Portuguese. In other words, our field of interest is in the presentation of short answers, passages, and possibly full sentences, but not whole documents, to questions formulated using natural language. For that purpose, we have developed a system, RAPPORT, built upon the use of open information extraction techniques for extracting triples, so called facts, characterizing information on text files, and then storing and using them for answering user queries done in natural language. These facts, in the form of subject, predicate and object, alongside other metadata, constitute the basis of the answers presented by the system. Facts work both by storing short and direct information found in a text, typically entity related information, and by containing in themselves the answers to the questions already in the form of small passages. As for the results, although there is margin for improvement, they are a tangible proof of the adequacy of our approach and its different modules for storing information and retrieving answers in question answering systems. In the process, in addition to contributing with a new approach to question answering for Portuguese, and validating the application of open information extraction to question answering, we have developed a set of tools that has been used in other natural language processing related works, such as is the case of a lemmatizer, LEMPORT, which was built from scratch, and has a high accuracy. Many of these tools result from the improvement of those found in the Apache OpenNLP toolkit, by pre-processing their input, post-processing their output, or both, and by training models for use in those tools or other, such as MaltParser. Other tools include the creation of interfaces for other resources containing, for example, synonyms, hypernyms, hyponyms, or the creation of lists of, for instance, relations between verbs and agents, using rules
    corecore