39 research outputs found

    A Dependency Parsing Approach to Biomedical Text Mining

    Get PDF
    Biomedical research is currently facing a new type of challenge: an excess of information, both in terms of raw data from experiments and in the number of scientific publications describing their results. Mirroring the focus on data mining techniques to address the issues of structured data, there has recently been great interest in the development and application of text mining techniques to make more effective use of the knowledge contained in biomedical scientific publications, accessible only in the form of natural human language. This thesis describes research done in the broader scope of projects aiming to develop methods, tools and techniques for text mining tasks in general and for the biomedical domain in particular. The work described here involves more specifically the goal of extracting information from statements concerning relations of biomedical entities, such as protein-protein interactions. The approach taken is one using full parsing—syntactic analysis of the entire structure of sentences—and machine learning, aiming to develop reliable methods that can further be generalized to apply also to other domains. The five papers at the core of this thesis describe research on a number of distinct but related topics in text mining. In the first of these studies, we assessed the applicability of two popular general English parsers to biomedical text mining and, finding their performance limited, identified several specific challenges to accurate parsing of domain text. In a follow-up study focusing on parsing issues related to specialized domain terminology, we evaluated three lexical adaptation methods. We found that the accurate resolution of unknown words can considerably improve parsing performance and introduced a domain-adapted parser that reduced the error rate of theoriginal by 10% while also roughly halving parsing time. To establish the relative merits of parsers that differ in the applied formalisms and the representation given to their syntactic analyses, we have also developed evaluation methodology, considering different approaches to establishing comparable dependency-based evaluation results. We introduced a methodology for creating highly accurate conversions between different parse representations, demonstrating the feasibility of unification of idiverse syntactic schemes under a shared, application-oriented representation. In addition to allowing formalism-neutral evaluation, we argue that such unification can also increase the value of parsers for domain text mining. As a further step in this direction, we analysed the characteristics of publicly available biomedical corpora annotated for protein-protein interactions and created tools for converting them into a shared form, thus contributing also to the unification of text mining resources. The introduced unified corpora allowed us to perform a task-oriented comparative evaluation of biomedical text mining corpora. This evaluation established clear limits on the comparability of results for text mining methods evaluated on different resources, prompting further efforts toward standardization. To support this and other research, we have also designed and annotated BioInfer, the first domain corpus of its size combining annotation of syntax and biomedical entities with a detailed annotation of their relationships. The corpus represents a major design and development effort of the research group, with manual annotation that identifies over 6000 entities, 2500 relationships and 28,000 syntactic dependencies in 1100 sentences. In addition to combining these key annotations for a single set of sentences, BioInfer was also the first domain resource to introduce a representation of entity relations that is supported by ontologies and able to capture complex, structured relationships. Part I of this thesis presents a summary of this research in the broader context of a text mining system, and Part II contains reprints of the five included publications.Siirretty Doriast

    Modelling input texts: from Tree Kernels to Deep Learning

    Get PDF
    One of the core questions when designing modern Natural Language Processing (NLP) systems is how to model input textual data such that the learning algorithm is provided with enough information to estimate accurate decision functions. The mainstream approach is to represent input objects as feature vectors where each value encodes some of their aspects, e.g., syntax, semantics, etc. Feature-based methods have demonstrated state-of-the-art results on various NLP tasks. However, designing good features is a highly empirical-driven process, it greatly depends on a task requiring a significant amount of domain expertise. Moreover, extracting features for complex NLP tasks often requires expensive pre-processing steps running a large number of linguistic tools while relying on external knowledge sources that are often not available or hard to get. Hence, this process is not cheap and often constitutes one of the major challenges when attempting a new task or adapting to a different language or domain. The problem of modelling input objects is even more acute in cases when the input examples are not just single objects but pairs of objects, such as in various learning to rank problems in Information Retrieval and Natural Language processing. An alternative to feature-based methods is using kernels which are essentially non-linear functions mapping input examples into some high dimensional space thus allowing for learning decision functions with higher discriminative power. Kernels implicitly generate a very large number of features computing similarity between input examples in that implicit space. A well-designed kernel function can greatly reduce the effort to design a large set of manually designed features often leading to superior results. However, in the recent years, the use of kernel methods in NLP has been greatly under-estimated primarily due to the following reasons: (i) learning with kernels is slow as it requires to carry out optimization in the dual space leading to quadratic complexity; (ii) applying kernels to the input objects encoded with vanilla structures, e.g., generated by syntactic parsers, often yields minor improvements over carefully designed feature-based methods. In this thesis, we adopt the kernel learning approach for solving complex NLP tasks and primarily focus on solutions to the aforementioned problems posed by the use of kernels. In particular, we design novel learning algorithms for training Support Vector Machines with structural kernels, e.g., tree kernels, considerably speeding up the training over the conventional SVM training methods. We show that using the training algorithms developed in this thesis allows for training tree kernel models on large-scale datasets containing millions of instances, which was not possible before. Next, we focus on the problem of designing input structures that are fed to tree kernel functions to automatically generate a large set of tree-fragment features. We demonstrate that previously used plain structures generated by syntactic parsers, e.g., syntactic or dependency trees, are often a poor choice thus compromising the expressivity offered by a tree kernel learning framework. We propose several effective design patterns of the input tree structures for various NLP tasks ranging from sentiment analysis to answer passage reranking. The central idea is to inject additional semantic information relevant for the task directly into the tree nodes and let the expressive kernels generate rich feature spaces. For the opinion mining tasks, the additional semantic information injected into tree nodes can be word polarity labels, while for more complex tasks of modelling text pairs the relational information about overlapping words in a pair appears to significantly improve the accuracy of the resulting models. Finally, we observe that both feature-based and kernel methods typically treat words as atomic units where matching different yet semantically similar words is problematic. Conversely, the idea of distributional approaches to model words as vectors is much more effective in establishing a semantic match between words and phrases. While tree kernel functions do allow for a more flexible matching between phrases and sentences through matching their syntactic contexts, their representation can not be tuned on the training set as it is possible with distributional approaches. Recently, deep learning approaches have been applied to generalize the distributional word matching problem to matching sentences taking it one step further by learning the optimal sentence representations for a given task. Deep neural networks have already claimed state-of-the-art performance in many computer vision, speech recognition, and natural language tasks. Following this trend, this thesis also explores the virtue of deep learning architectures for modelling input texts and text pairs where we build on some of the ideas to model input objects proposed within the tree kernel learning framework. In particular, we explore the idea of relational linking (proposed in the preceding chapters to encode text pairs using linguistic tree structures) to design a state-of-the-art deep learning architecture for modelling text pairs. We compare the proposed deep learning models that require even less manual intervention in the feature design process then previously described tree kernel methods that already offer a very good trade-off between the feature-engineering effort and the expressivity of the resulting representation. Our deep learning models demonstrate the state-of-the-art performance on a recent benchmark for Twitter Sentiment Analysis, Answer Sentence Selection and Microblog retrieval

    Normalization and parsing algorithms for uncertain input

    Get PDF

    Web Relation Extraction with Distant Supervision

    Get PDF
    Being able to find relevant information about prominent entities quickly is the main reason to use a search engine. However, with large quantities of information on the World Wide Web, real time search over billions of Web pages can waste resources and the end user’s time. One of the solutions to this is to store the answer to frequently asked general knowledge queries, such as the albums released by a musical artist, in a more accessible format, a knowledge base. Knowledge bases can be created and maintained automatically by using information extraction methods, particularly methods to extract relations between proper names (named entities). A group of approaches for this that has become popular in recent years are distantly supervised approaches as they allow to train relation extractors without text-bound annotation, using instead known relations from a knowledge base to heuristically align them with a large textual corpus from an appropriate domain. This thesis focuses on researching distant supervision for the Web domain. A new setting for creating training and testing data for distant supervision from the Web with entity-specific search queries is introduced and the resulting corpus is published. Methods to recognise noisy training examples as well as methods to combine extractions based on statistics derived from the background knowledge base are researched. Using co-reference resolution methods to extract relations from sentences which do not contain a direct mention of the subject of the relation is also investigated. One bottleneck for distant supervision for Web data is identified to be named entity recognition and classification (NERC), since relation extraction methods rely on it for identifying relation arguments. Typically, existing pre-trained tools are used, which fail in diverse genres with non-standard language, such as the Web genre. The thesis explores what can cause NERC methods to fail in diverse genres and quantifies different reasons for NERC failure. Finally, a novel method for NERC for relation extraction is proposed based on the idea of jointly training the named entity classifier and the relation extractor with imitation learning to reduce the reliance on external NERC tools. This thesis improves the state of the art in distant supervision for knowledge base population, and sheds light on and proposes solutions for issues arising for information extraction for not traditionally studied domains

    Active duplicate detection with Bayesian nonparametric models

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 129-137).When multiple databases are merged, an essential step is identifying sets of records that refer to the same entity. Called duplicate detection, this task is typically tedious to perform manually, and so a variety of automated methods have been developed for partitioning a collection of records into coreference sets. This task is complicated by ambiguous or noisy field values, so systems are typically domain-specific and often fitted to a representative labeled training corpus. Once fitted, such systems can estimate a partition of a similar corpus without human intervention. While this approach has many applications, it is often infeasible to encode the appropriate domain knowledge a priori or to identify suitable training data. To address such cases, this thesis uses an active framework for duplicate detection, wherein the system initially estimates a partition of a test corpus without training, but is then allowed to query a human user about the coreference labeling of a portion of the corpus. The responses to these queries are used to guide the system in producing improved partition estimates and further queries of interest. This thesis describes a complete implementation of this framework with three technical contributions: a domain-independent Bayesian model expressing the relationship between the unobserved partition and the observed field values of a set of database records; a criterion for picking informative queries based on the mutual information between the response and the unobserved partition; and an algorithm for estimating a minimum-error partition under a Bayesian model through a reduction to the well-studied problem of correlation clustering. It also present experimental results demonstrating the effectiveness of this method in a variety of data domains.by Nicholas Elias Matsakis.Ph.D


    Get PDF
    Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 268 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

    Comparative Quality Estimation for Machine Translation. An Application of Artificial Intelligence on Language Technology using Machine Learning of Human Preferences

    Get PDF
    In this thesis we focus on Comparative Quality Estimation, as the automaticprocess of analysing two or more translations produced by a Machine Translation(MT) system and expressing a judgment about their comparison. We approach theproblem from a supervised machine learning perspective, with the aim to learnfrom human preferences. As a result, we create the ranking mechanism, a pipelinethat includes the necessary tasks for ordering several MT outputs of a givensource sentence in terms of relative quality. Quality Estimation models are trained to statistically associate the judgmentswith some qualitative features. For this purpose, we design a broad set offeatures with a particular focus on the ones with a grammatical background.Through an iterative feature engineering process, we investigate several featuresets, we conclude to the ones that achieve the best performance and we proceedto linguistically intuitive observations about the contribution of individualfeatures. Additionally, we employ several feature selection and machine learning methodsto take advantage of these features. We suggest the usage of binary classifiersafter decomposing the ranking into pairwise decisions. In order to reduce theamount of uncertain decisions (ties) we weight the pairwise decisions with theirclassification probability. Through a set of experiments, we show that the ranking mechanism can learn andreproduce rankings that correlate to the ones given by humans. Most importantly,it can be successfully compared with state-of-the-art reference-aware metricsand other known ranking methods for several language pairs. We also apply thismethod for a hybrid MT system combination and we show that it is able to improvethe overall translation performance. Finally, we examine the correlation between common MT errors and decoding eventsof the phrase-based statistical MT systems. Through evidence from the decodingprocess, we identify some cases where long-distance grammatical phenomena cannotbe captured properly. An additional outcome of this thesis is the open source software Qualitative,which implements the full pipeline of ranking mechanism and the systemcombination task. It integrates a multitude of state-of-the-art natural languageprocessing tools and can support the development of new models. Apart from theusage in experiment pipelines, it can serve as an application back-end for webapplications in real-use scenaria.In dieser Promotionsarbeit konzentrieren wir uns auf die vergleichende Qualitätsschätzung der Maschinellen Übersetzung als ein automatisches Verfahren zur Analyse von zwei oder mehr Übersetzungen, die von Maschinenübersetzungssysteme erzeugt wurden, und zur Beurteilung von deren Vergleich. Wir gehen an das Problem aus der Perspektive des überwachten maschinellen Lernens heran, mit dem Ziel, von menschlichen Präferenzen zu lernen. Als Ergebnis erstellen wir einen Ranking-Mechanismus. Dabei handelt es sich um eine Pipeline, welche die notwendigen Arbeitsschritte für die Anordnung mehrerer Maschinenübersetzungen eines bestimmten Quellsatzes in Bezug auf die relative Qualität umfasst. Qualitätsschätzungsmodelle werden so trainiert, dass Vergleichsurteile mit einigen bestimmten Merkmalen statistisch verknüpft werden. Zu diesem Zweck konzipieren wir eine breite Palette von Merkmalen mit besonderem Fokus auf diejenigen mit einem grammatikalischen Hintergrund. Mit Hilfe eines iterativen Verfahrens der Merkmalskonstruktion untersuchen wir verschiedene Merkmalsreihen, erschließen diejenigen, die die beste Leistung erzielen, und leiten linguistisch motivierte Beobachtungen über die Beiträge der einzelnen Merkmale ab. Zusätzlich setzen wir verschiedene Methoden des maschinellen Lernens und der Merkmalsauswahl ein, um die Vorteile dieser Merkmale zu nutzen. Wir schlagen die Verwendung von binären Klassifikatoren nach Zerlegen des Rankings in paarweise Entscheidungen vor. Um die Anzahl der unklaren Entscheidungen (Unentschieden) zu verringern, gewichten wir die paarweisen Entscheidungen mit deren Klassifikationswahrscheinlichkeit. Mithilfe einer Reihe von Experimenten zeigen wir, dass der Ranking-Mechanismus Rankings lernen und reproduzieren kann, die mit denen von Menschen übereinstimmen. Die wichtigste Erkenntnis ist, dass der Mechanismus erfolgreich mit referenzbasierten Metriken und anderen bekannten Ranking-Methoden auf dem neusten Stand der Technik für verschiedene Sprachpaare verglichen werden kann. Diese Methode verwenden wir ebenfalls für eine hybride Systemkombination maschineller Übersetzer und zeigen, dass sie in der Lage ist, die gesamte Übersetzungsleistung zu verbessern. Abschließend untersuchen wir den Zusammenhang zwischen häufig vorkommenden Fehlern der maschinellen Übersetzung und Vorgängen, die während des internen Dekodierungsverfahrens der phrasenbasierten statistischen Maschinenübersetzungssysteme ablaufen. Durch Beweise aus dem Dekodierungsverfahren können wir einige Fälle identifizieren, in denen grammatikalische Phänomene mit Fernabhängigkeit nicht richtig erfasst werden können. Ein weiteres Ergebnis dieser Arbeit ist die quelloffene Software ``Qualitative'', welche die volle Pipeline des Ranking-Mechanismus und das System für die Kombinationsaufgabe implementiert. Die Software integriert eine Vielzahl modernster Softwaretools für die Verarbeitung natürlicher Sprache und kann die Entwicklung neuer Modelle unterstützen. Sie kann sowohl in Experimentierpipelines als auch als Anwendungs-Backend in realen Nutzungsszenarien verwendet werden