554 research outputs found
From POS tagging to dependency parsing for biomedical event extraction
Background: Given the importance of relation or event extraction from
biomedical research publications to support knowledge capture and synthesis,
and the strong dependency of approaches to this information extraction task on
syntactic information, it is valuable to understand which approaches to
syntactic processing of biomedical text have the highest performance. Results:
We perform an empirical study comparing state-of-the-art traditional
feature-based and neural network-based models for two core natural language
processing tasks of part-of-speech (POS) tagging and dependency parsing on two
benchmark biomedical corpora, GENIA and CRAFT. To the best of our knowledge,
there is no recent work making such comparisons in the biomedical context;
specifically no detailed analysis of neural models on this data is available.
Experimental results show that in general, the neural models outperform the
feature-based models on two benchmark biomedical corpora GENIA and CRAFT. We
also perform a task-oriented evaluation to investigate the influences of these
models in a downstream application on biomedical event extraction, and show
that better intrinsic parsing performance does not always imply better
extrinsic event extraction performance. Conclusion: We have presented a
detailed empirical study comparing traditional feature-based and neural
network-based models for POS tagging and dependency parsing in the biomedical
context, and also investigated the influence of parser selection for a
biomedical event extraction downstream task. Availability of data and material:
We make the retrained models available at
https://github.com/datquocnguyen/BioPosDepComment: Accepted for publication in BMC Bioinformatic
Investigating Genotype-Phenotype relationship extraction from biomedical text
During the last decade biomedicine has developed at a tremendous pace. Every day a lot of biomedical papers are published and a large amount of new information is produced. To help enable automated and human interaction in the multitude of applications of this biomedical data, the need for Natural Language Processing systems to process the vast amount of new information is increasing. Our main purpose in this research project is to extract the relationships between genotypes and phenotypes mentioned in the biomedical publications. Such a system provides important and up-to-date data for database construction and updating, and even text summarization. To achieve this goal we had to solve three main problems: finding genotype names, finding phenotype names, and finally extracting phenotype--genotype interactions. We consider all these required modules in a comprehensive system and propose a promising solution for each of them taking into account available tools and resources.
BANNER, an open source biomedical named entity recognition system, which has achieved good results in detecting genotypes, has been used for the genotype name recognition task. We were the first group to start working on phenotype name recognition. We have developed two different systems (rule-based and machine-learning based) for extracting phenotype names from text. These systems incorporated the available knowledge from the Unified Medical Language System metathesaurus and the Human Phenotype Onotolgy (HPO). As there was no available annotated corpus for phenotype names, we created a valuable corpus with annotated phenotype names using information available in HPO and a self-training method which can be used for future research. To solve the final problem of this project i.e. , phenotype--genotype relationship extraction, a machine learning method has been proposed. As there was no corpus available for this task and it was not possible for us to annotate a sufficiently large corpus manually, a semi-automatic approach has been used to annotate a small corpus and a self-training method has been proposed to annotate more sentences and enlarge this corpus. A test set was manually annotated by an expert. In addition to having phenotype-genotype relationships annotated, the test set contains important comments about the nature of these relationships. The evaluation results related to each system demonstrate the significantly good performance of all the proposed methods
Recommended from our members
Machine Learning Models for Efficient and Robust Natural Language Processing
Natural language processing (NLP) has come of age. For example, semantic role labeling (SRL), which automatically annotates sentences with a labeled graph representing who did what to whom, has in the past ten years seen nearly 40% reduction in error, bringing it to useful accuracy. As a result, a myriad of practitioners now want to deploy NLP systems on billions of documents across many domains. However, state-of-the-art NLP systems are typically not optimized for cross-domain robustness nor computational efficiency. In this dissertation I develop machine learning methods to facilitate fast and robust inference across many common NLP tasks.
First, I describe paired learning and inference algorithms for dynamic feature selection which accelerate inference in linear classifiers, the heart of the fastest NLP models, by 5-10 times. I then present iterated dilated convolutional neural networks (ID-CNNs), a distinct combination of network structure, parameter sharing and training procedures that increase inference speed by 14-20 times with accuracy matching bidirectional LSTMs, the most accurate models for NLP sequence labeling. Finally, I describe linguistically-informed self-attention (LISA), a neural network model that combines multi-head self-attention with multi-task learning to facilitate improved generalization to new domains. We show that incorporating linguistic structure in this way leads to substantial improvements over the previous state-of-the-art (syntax-free) neural network models for SRL, especially when evaluating out-of-domain. I conclude with a brief discussion of potential future directions stemming from my thesis work
Recommended from our members
Learning with Joint Inference and Latent Linguistic Structure in Graphical Models
Constructing end-to-end NLP systems requires the processing of many types of linguistic information prior to solving the desired end task. A common approach to this problem is to construct a pipeline, one component for each task, with each system\u27s output becoming input for the next. This approach poses two problems. First, errors propagate, and, much like the childhood game of telephone , combining systems in this manner can lead to unintelligible outcomes. Second, each component task requires annotated training data to act as supervision for training the model. These annotations are often expensive and time-consuming to produce, may differ from each other in genre and style, and may not match the intended application.
In this dissertation we present a general framework for constructing and reasoning on joint graphical model formulations of NLP problems. Individual models are composed using weighted Boolean logic constraints, and inference is performed using belief propagation. The systems we develop are composed of two parts: one a representation of syntax, the other a desired end task (semantic role labeling, named entity recognition, or relation extraction). By modeling these problems jointly, both models are trained in a single, integrated process, with uncertainty propagated between them. This mitigates the accumulation of errors typical of pipelined approaches.
Additionally we propose a novel marginalization-based training method in which the error signal from end task annotations is used to guide the induction of a constrained latent syntactic representation. This allows training in the absence of syntactic training data, where the latent syntactic structure is instead optimized to best support the end task predictions. We find that across many NLP tasks this training method offers performance comparable to fully supervised training of each individual component, and in some instances improves upon it by learning latent structures which are more appropriate for the task
Automatic inference of causal reasoning chains from student essays
While there has been an increasing focus on higher-level thinking skills arising from the Common Core Standards, many high-school and middle-school students struggle to combine and integrate information from multiple sources when writing essays. Writing is an important learning skill, and there is increasing evidence that writing about a topic develops a deeper understanding in the student. However, grading essays is time consuming for teachers, resulting in an increasing focus on shallower forms of assessment that are easier to automate, such as multiple-choice tests. Existing essay grading software has attempted to ease this burden but relies on shallow lexico-syntactic features and is unable to understand the structure or validity of a student’s arguments or explanations. Without the ability to understand a student’s reasoning processes, it is impossible to write automated formative assessment systems to assist students with improving their thinking skills through essay writing.
In order to understand the arguments put forth in an explanatory essay in the science domain, we need a method of representing the causal structure of a piece of explanatory text. Psychologists use a representation called a causal model to represent a student\u27s understanding of an explanatory text. This consists of a number of core concepts, and a set of causal relations linking them into one or more causal chains, forming a causal model. In this thesis I present a novel system for automatically constructing causal models from student scientific essays using Natural Language Processing (NLP) techniques.
The problem was decomposed into 4 sub-problems - assigning essay concepts to words, detecting causal-relations between these concepts, resolving coreferences within each essay, and using the structure of the whole essay to reconstruct a causal model. Solutions to each of these sub-problems build upon the predictions from the solutions to earlier problems, forming a sequential pipeline of models. Designing a system in this way allows later models to correct for false positive predictions from downstream models. However, this also has the disadvantage that errors made in earlier models can propagate through the system, negatively impacting the upstream models, and limiting their accuracy. Producing robust solutions for the initial 2 sub problems, detecting concepts, and parsing causal relations between them, was critical in building a robust system.
A number of sequence labeling models were trained to classify the concepts associated with each word, with the most effective approach being a bidirectional recurrent neural network (RNN), a deep learning model commonly applied to word labeling problems. This is because the RNN used pre-trained word embeddings to better generalize to rarer words, and was able to use information from both ends of each sentence to infer a word\u27s concept. The concepts predicted by this model were then used to develop causal relation parsing models for detecting causal connections between these concepts. A shift-reduce dependency parsing model was trained using the SEARN algorithm and out-performed a number of other approaches by better utilizing the structure of the problem and directly optimizing the error metric used.
Two pre-trained coreference resolution systems were used to resolve coreferences within the essays. However a word tagging model trained to predict anaphors combined with a heuristic for determining the antecedent out-performed these two systems. Finally, a model was developed for parsing a causal model from an entire essay, utilizing the solutions to the three previous problems. A beam search algorithm was used to produce multiple parses for each sentence, which in turn were combined to generate multiple candidate causal models for each student essay. A reranking algorithm was then used to select the optimal causal model from all of the generated candidates.
An important contribution of this work is that it represents a system for parsing a complete causal model of a scientific essay from a student\u27s written answer. Existing systems have been developed to parse individual causal relations, but no existing system attempts to parse a sequence of linked causal relations forming a causal model from an explanatory scientific essay. It is hoped that this work can lead to the development of more robust essay grading software and formative assessment tools, and can be extended to build solutions for extracting causality from text in other domains. In addition, I also present 2 novel approaches for optimizing the micro-F1 score within the design of two of the algorithms studied: the dependency parser and the reranking algorithm. The dependency parser uses a custom cost function to estimate the impact of parsing mistakes on the overall micro-F1 score, while the reranking algorithm allows the micro-F1 score to be optimized by tuning the beam search parameter to balance recall and precision
- …