305 research outputs found
Neural Reranking for Named Entity Recognition
We propose a neural reranking system for named entity recognition (NER). The
basic idea is to leverage recurrent neural network models to learn
sentence-level patterns that involve named entity mentions. In particular,
given an output sentence produced by a baseline NER model, we replace all
entity mentions, such as \textit{Barack Obama}, into their entity types, such
as \textit{PER}. The resulting sentence patterns contain direct output
information, yet is less sparse without specific named entities. For example,
"PER was born in LOC" can be such a pattern. LSTM and CNN structures are
utilised for learning deep representations of such sentences for reranking.
Results show that our system can significantly improve the NER accuracies over
two different baselines, giving the best reported results on a standard
benchmark.Comment: Accepted as regular paper by RANLP 201
Neural Representations of Concepts and Texts for Biomedical Information Retrieval
Information retrieval (IR) methods are an indispensable tool in the current landscape of exponentially increasing textual data, especially on the Web. A typical IR task involves fetching and ranking a set of documents (from a large corpus) in terms of relevance to a user\u27s query, which is often expressed as a short phrase. IR methods are the backbone of modern search engines where additional system-level aspects including fault tolerance, scale, user interfaces, and session maintenance are also addressed. In addition to fetching documents, modern search systems may also identify snippets within the documents that are potentially most relevant to the input query. Furthermore, current systems may also maintain preprocessed structured knowledge derived from textual data as so called knowledge graphs, so certain types of queries that are posed as questions can be parsed as such; a response can be an output of one or more named entities instead of a ranked list of documents (e.g., what diseases are associated with EGFR mutations? ). This refined setup is often termed as question answering (QA) in the IR and natural language processing (NLP) communities.
In biomedicine and healthcare, specialized corpora are often at play including research articles by scientists, clinical notes generated by healthcare professionals, consumer forums for specific conditions (e.g., cancer survivors network), and clinical trial protocols (e.g., www.clinicaltrials.gov). Biomedical IR is specialized given the types of queries and the variations in the texts are different from that of general Web documents. For example, scientific articles are more formal with longer sentences but clinical notes tend to have less grammatical conformity and are rife with abbreviations. There is also a mismatch between the vocabulary of consumers and the lingo of domain experts and professionals. Queries are also different and can range from simple phrases (e.g., COVID-19 symptoms ) to more complex implicitly fielded queries (e.g., chemotherapy regimens for stage IV lung cancer patients with ALK mutations ). Hence, developing methods for different configurations (corpus, query type, user type) needs more deliberate attention in biomedical IR.
Representations of documents and queries are at the core of IR methods and retrieval methodology involves coming up with these representations and matching queries with documents based on them. Traditional IR systems follow the approach of keyword based indexing of documents (the so called inverted index) and matching query phrases against the document index. It is not difficult to see that this keyword based matching ignores the semantics of texts (synonymy at the lexeme level and entailment at phrase/clause/sentence levels) and this has lead to dimensionality reduction methods such as latent semantic indexing that generally have scale-related concerns; such methods also do not address similarity at the sentence level. Since the resurgence of neural network methods in NLP, the IR field has also moved to incorporate advances in neural networks into current IR methods.
This dissertation presents four specific methodological efforts toward improving biomedical IR. Neural methods always begin with dense embeddings for words and concepts to overcome the limitations of one-hot encoding in traditional NLP/IR. In the first effort, we present a new neural pre-training approach to jointly learn word and concept embeddings for downstream use in applications. In the second study, we present a joint neural model for two essential subtasks of information extraction (IE): named entity recognition (NER) and entity normalization (EN). Our method detects biomedical concept phrases in texts and links them to the corresponding semantic types and entity codes. These first two studies provide essential tools to model textual representations as compositions of both surface forms (lexical units) and high level concepts with potential downstream use in QA. In the third effort, we present a document reranking model that can help surface documents that are likely to contain answers (e.g, factoids, lists) to a question in a QA task. The model is essentially a sentence matching neural network that learns the relevance of a candidate answer sentence to the given question parametrized with a bilinear map. In the fourth effort, we present another document reranking approach that is tailored for precision medicine use-cases. It combines neural query-document matching and faceted text summarization. The main distinction of this effort from previous efforts is to pivot from a query manipulation setup to transforming candidate documents into pseudo-queries via neural text summarization. Overall, our contributions constitute nontrivial advances in biomedical IR using neural representations of concepts and texts
Bi-Encoders based Species Normalization -- Pairwise Sentence Learning to Rank
Motivation: Biomedical named-entity normalization involves connecting
biomedical entities with distinct database identifiers in order to facilitate
data integration across various fields of biology. Existing systems for
biomedical named entity normalization heavily rely on dictionaries, manually
created rules, and high-quality representative features such as lexical or
morphological characteristics. However, recent research has investigated the
use of neural network-based models to reduce dependence on dictionaries,
manually crafted rules, and features. Despite these advancements, the
performance of these models is still limited due to the lack of sufficiently
large training datasets. These models have a tendency to overfit small training
corpora and exhibit poor generalization when faced with previously unseen
entities, necessitating the redesign of rules and features. Contribution: We
present a novel deep learning approach for named entity normalization, treating
it as a pair-wise learning to rank problem. Our method utilizes the widely-used
information retrieval algorithm Best Matching 25 to generate candidate
concepts, followed by the application of bi-directional encoder representation
from the encoder (BERT) to re-rank the candidate list. Notably, our approach
eliminates the need for feature-engineering or rule creation. We conduct
experiments on species entity types and evaluate our method against
state-of-the-art techniques using LINNAEUS and S800 biomedical corpora. Our
proposed approach surpasses existing methods in linking entities to the NCBI
taxonomy. To the best of our knowledge, there is no existing neural
network-based approach for species normalization in the literature
TrialMatch: A Transformer Architecture to Match Patients to Clinical Trials
Around 80% of clinical trials fail to meet the patient recruitment requirements, which
not only hinders the market growth but also delays patients’ access to new and effec-
tive treatments. A possible approach is to use Electronic Health Records (EHRs) to help
match patients to clinical trials. Past attempts at achieving this exact goal took place,
but due to a lack of data, they were unsuccessful. In 2021 Text REtrieval Conference
(TREC) introduced the Clinical Trials Track, where participants were challenged with
retrieving relevant clinical trials given the patient’s descriptions simulating admission
notes. Utilizing the track results as a baseline, we tackled the challenge, for this, we re-
sort to Information Retrieval (IR), implementing a pipeline for document ranking where
we explore the different retrieval methods, how to filter the clinical trials based on the
criteria, and reranking with Transformer based models. To tackle the problem, we ex-
plored models pre-trained on the biomedical domain, how to deal with long queries and
documents through query expansion and passage selection, and how to distinguish an
eligible clinical trial from an excluded clinical trial, using techniques such as Named
Entity Recognition (NER) and Clinical Assertion. Our results let to the finding that the
current state-of-the-art Bidirectional Encoder Representations from Transformers (BERT)
bi-encoders outperform the cross-encoders in the mentioned task, whilst proving that
sparse retrieval methods are capable of obtaining competitive outcomes, and to finalize
we showed that the use of the demographic information available can be used to improve
the final result.Cerca de 80% dos ensaios clĂnicos nĂŁo satisfazem os requisitos de recrutamento de paci-
entes, o que não só dificulta o crescimento do mercado como também impede o acesso
dos pacientes a novos e eficazes tratamentos. Uma abordagem possĂvel Ă© utilizar os Pron-
tuários EletrĂ´nicos para ajudar a combinar doentes a ensaios clĂnicos. Tentativas passadas
para alcançar este exato objetivo tiveram lugar, mas devido à falta de dados, não foram
bem sucedidos. Em 2021, a TREC introduziu a Clinical Trials Track, onde os participantes
foram desafiados com a recuperação ensaios clĂnicos relevantes, dadas as descrições dos
pacientes simulando notas de admissĂŁo. Utilizando os resultados da track como base, en-
frentámos o desafio, para isso recorremos à Recuperação de Informação, implementando
uma pipeline para a classificação de documentos onde exploramos os diferentes métodos
de recuperação, como filtrar os ensaios clĂnicos com base nos critĂ©rios, e reclassificação
com modelos baseados no Transformer. Para enfrentar o problema, explorámos modelos
prĂ©-treinados no domĂnio biomĂ©dico, como lidar com longas descrições e documentos,
e como distinguir um ensaio clĂnico elegĂvel de um ensaio clĂnico excluĂdo, utilizando
tĂ©cnicas como Reconhecimento de Entidade Mencionada e Asserção ClĂnica. Os nossos re-
sultados permitem concluir que os actuais bi-encoders de última geração BERT superam
os cross-encoders BERT na tarefa mencionada, provamos que os métodos de recuperação
esparsos sĂŁo capazes de obter resultados competitivos, e para finalizar mostramos que
a utilização da informação demográfica disponĂvel pode ser utilizada para melhorar o
resultado fina
NASTyLinker: NIL-Aware Scalable Transformer-based Entity Linker
Entity Linking (EL) is the task of detecting mentions of entities in text and
disambiguating them to a reference knowledge base. Most prevalent EL approaches
assume that the reference knowledge base is complete. In practice, however, it
is necessary to deal with the case of linking to an entity that is not
contained in the knowledge base (NIL entity). Recent works have shown that,
instead of focusing only on affinities between mentions and entities,
considering inter-mention affinities can be used to represent NIL entities by
producing clusters of mentions. At the same time, inter-mention affinities can
help to substantially improve linking performance for known entities. With
NASTyLinker, we introduce an EL approach that is aware of NIL entities and
produces corresponding mention clusters while maintaining high linking
performance for known entities. The approach clusters mentions and entities
based on dense representations from Transformers and resolves conflicts (if
more than one entity is assigned to a cluster) by computing transitive
mention-entity affinities. We show the effectiveness and scalability of
NASTyLinker on NILK, a dataset that is explicitly constructed to evaluate EL
with respect to NIL entities. Further, we apply the presented approach to an
actual EL task, namely to knowledge graph population by linking entities in
Wikipedia listings, and provide an analysis of the outcome.Comment: Preprint of a paper in the research track of the 20th Extended
Semantic Web Conference (ESWC'23
Towards Automatic Extraction of Social Networks of Organizations in PubMed Abstracts
Social Network Analysis (SNA) of organizations can attract great interest
from government agencies and scientists for its ability to boost translational
research and accelerate the process of converting research to care. For SNA of
a particular disease area, we need to identify the key research groups in that
area by mining the affiliation information from PubMed. This not only involves
recognizing the organization names in the affiliation string, but also
resolving ambiguities to identify the article with a unique organization. We
present here a process of normalization that involves clustering based on local
sequence alignment metrics and local learning based on finding connected
components. We demonstrate the application of the method by analyzing
organizations involved in angiogenensis treatment, and demonstrating the
utility of the results for researchers in the pharmaceutical and biotechnology
industries or national funding agencies.Comment: This paper has been withdrawn; First International Workshop on Graph
Techniques for Biomedical Networks in Conjunction with IEEE International
Conference on Bioinformatics and Biomedicine, Washington D.C., USA, Nov. 1-4,
2009; http://www.public.asu.edu/~sjonnal3/home/papers/IEEE%20BIBM%202009.pd
Exploring the boundaries: gene and protein identification in biomedical text
Background: Good automatic information extraction tools offer hope for automatic processing of the exploding biomedical literature, and successful named entity recognition is a key component for such tools. Methods: We present a maximum-entropy based system incorporating a diverse set of features for identifying gene and protein names in biomedical abstracts. Results: This system was entered in the BioCreative comparative evaluation and achieved a precision of 0.83 and recall of 0.84 in the “open ” evaluation and a precision of 0.78 and recall of 0.85 in the “closed ” evaluation. Conclusions: Central contributions are rich use of features derived from the training data at multiple levels of granularity, a focus on correctly identifying entity boundaries, and the innovative use of several external knowledge sources including full MEDLINE abstracts and web searches. Background The explosion of information in the biomedical domain and particularly in genetics has highlighted the need for automated text information extraction techniques. MEDLINE, the primary research database serving the biomedical community, currently contains over 14 million abstracts, with 60,000 new abstracts appearing each month. There is also an impressive number of molecular biological databases covering a
A Dependency Parsing Approach to Biomedical Text Mining
Biomedical research is currently facing a new type of challenge: an excess of information, both in terms of raw data from experiments and in the number of scientific publications describing their results. Mirroring the focus on data mining techniques to address the issues of structured data, there has recently been great interest in the development and application of text mining techniques to make more effective use of the knowledge contained in biomedical scientific publications, accessible only in the form of natural human language.
This thesis describes research done in the broader scope of projects aiming to develop methods, tools and techniques for text mining tasks in general and for the biomedical domain in particular. The work described here involves more specifically the goal of extracting information from statements concerning relations of biomedical entities, such as protein-protein interactions. The approach taken is one using full parsing—syntactic analysis of the entire structure of sentences—and machine learning, aiming to develop reliable methods that can further be generalized to apply also to other domains.
The five papers at the core of this thesis describe research on a number of distinct but related topics in text mining. In the first of these studies, we assessed the applicability of two popular general English parsers to biomedical text mining and, finding their performance limited, identified several specific challenges to accurate parsing of domain text. In a follow-up study focusing on parsing issues related to specialized domain terminology, we evaluated three lexical adaptation methods. We found that the accurate resolution of unknown words can considerably improve parsing performance and introduced a domain-adapted parser that reduced the error rate of theoriginal by 10% while also roughly halving parsing time.
To establish the relative merits of parsers that differ in the applied formalisms and the representation given to their syntactic analyses, we have also developed evaluation methodology, considering different approaches to establishing comparable dependency-based evaluation results. We introduced a methodology for creating highly accurate conversions between different parse representations, demonstrating the feasibility of unification of idiverse syntactic schemes under a shared, application-oriented representation. In addition to allowing formalism-neutral evaluation, we argue that such unification can also increase the value of parsers for domain text mining. As a further step in this direction, we analysed the characteristics of publicly available biomedical corpora annotated for protein-protein interactions and created tools for converting them into a shared form, thus contributing also to the unification of text mining resources. The introduced unified corpora allowed us to perform a task-oriented comparative evaluation of biomedical text mining corpora. This evaluation established clear limits on the comparability of results for text mining methods evaluated on different resources, prompting further efforts toward standardization.
To support this and other research, we have also designed and annotated BioInfer, the first domain corpus of its size combining annotation of syntax and biomedical entities with a detailed annotation of their relationships. The corpus represents a major design and development effort of the research group, with manual annotation that identifies over 6000 entities, 2500 relationships and 28,000 syntactic dependencies in 1100 sentences. In addition to combining these key annotations for a single set of sentences, BioInfer was also the first domain resource to introduce a representation of entity relations that is supported by ontologies and able to capture complex, structured relationships.
Part I of this thesis presents a summary of this research in the broader context of a text mining system, and Part II contains reprints of the five included publications.Siirretty Doriast
- …