3,984 research outputs found
Neural Representations of Concepts and Texts for Biomedical Information Retrieval
Information retrieval (IR) methods are an indispensable tool in the current landscape of exponentially increasing textual data, especially on the Web. A typical IR task involves fetching and ranking a set of documents (from a large corpus) in terms of relevance to a user\u27s query, which is often expressed as a short phrase. IR methods are the backbone of modern search engines where additional system-level aspects including fault tolerance, scale, user interfaces, and session maintenance are also addressed. In addition to fetching documents, modern search systems may also identify snippets within the documents that are potentially most relevant to the input query. Furthermore, current systems may also maintain preprocessed structured knowledge derived from textual data as so called knowledge graphs, so certain types of queries that are posed as questions can be parsed as such; a response can be an output of one or more named entities instead of a ranked list of documents (e.g., what diseases are associated with EGFR mutations? ). This refined setup is often termed as question answering (QA) in the IR and natural language processing (NLP) communities.
In biomedicine and healthcare, specialized corpora are often at play including research articles by scientists, clinical notes generated by healthcare professionals, consumer forums for specific conditions (e.g., cancer survivors network), and clinical trial protocols (e.g., www.clinicaltrials.gov). Biomedical IR is specialized given the types of queries and the variations in the texts are different from that of general Web documents. For example, scientific articles are more formal with longer sentences but clinical notes tend to have less grammatical conformity and are rife with abbreviations. There is also a mismatch between the vocabulary of consumers and the lingo of domain experts and professionals. Queries are also different and can range from simple phrases (e.g., COVID-19 symptoms ) to more complex implicitly fielded queries (e.g., chemotherapy regimens for stage IV lung cancer patients with ALK mutations ). Hence, developing methods for different configurations (corpus, query type, user type) needs more deliberate attention in biomedical IR.
Representations of documents and queries are at the core of IR methods and retrieval methodology involves coming up with these representations and matching queries with documents based on them. Traditional IR systems follow the approach of keyword based indexing of documents (the so called inverted index) and matching query phrases against the document index. It is not difficult to see that this keyword based matching ignores the semantics of texts (synonymy at the lexeme level and entailment at phrase/clause/sentence levels) and this has lead to dimensionality reduction methods such as latent semantic indexing that generally have scale-related concerns; such methods also do not address similarity at the sentence level. Since the resurgence of neural network methods in NLP, the IR field has also moved to incorporate advances in neural networks into current IR methods.
This dissertation presents four specific methodological efforts toward improving biomedical IR. Neural methods always begin with dense embeddings for words and concepts to overcome the limitations of one-hot encoding in traditional NLP/IR. In the first effort, we present a new neural pre-training approach to jointly learn word and concept embeddings for downstream use in applications. In the second study, we present a joint neural model for two essential subtasks of information extraction (IE): named entity recognition (NER) and entity normalization (EN). Our method detects biomedical concept phrases in texts and links them to the corresponding semantic types and entity codes. These first two studies provide essential tools to model textual representations as compositions of both surface forms (lexical units) and high level concepts with potential downstream use in QA. In the third effort, we present a document reranking model that can help surface documents that are likely to contain answers (e.g, factoids, lists) to a question in a QA task. The model is essentially a sentence matching neural network that learns the relevance of a candidate answer sentence to the given question parametrized with a bilinear map. In the fourth effort, we present another document reranking approach that is tailored for precision medicine use-cases. It combines neural query-document matching and faceted text summarization. The main distinction of this effort from previous efforts is to pivot from a query manipulation setup to transforming candidate documents into pseudo-queries via neural text summarization. Overall, our contributions constitute nontrivial advances in biomedical IR using neural representations of concepts and texts
Hybrid Query Expansion on Ontology Graph in Biomedical Information Retrieval
Nowadays, biomedical researchers publish thousands of papers and journals every day. Searching through biomedical literature to keep up with the state of the art is a task of increasing difficulty for many individual researchers. The continuously increasing amount of biomedical text data has resulted in high demands for an efficient and effective biomedical information retrieval (BIR) system. Though many existing information retrieval techniques can be directly applied in BIR, BIR distinguishes itself in the extensive use of biomedical terms and abbreviations which present high ambiguity. First of all, we studied a fundamental yet simpler problem of word semantic similarity. We proposed a novel semantic word similarity algorithm and related tools called Weighted Edge Similarity Tools (WEST). WEST was motivated by our discovery that humans are more sensitive to the semantic difference due to the categorization than that due to the generalization/specification. Unlike most existing methods which model the semantic similarity of words based on either the depth of their Lowest Common Ancestor (LCA) or the traversal distance of between the word pair in WordNet, WEST also considers the joint contribution of the weighted distance between two words and the weighted depth of their LCA in WordNet. Experiments show that weighted edge based word similarity method has achieved 83.5% accuracy to human judgments. Query expansion problem can be viewed as selecting top k words which have the maximum accumulated similarity to a given word set. It has been proved as an effective method in BIR and has been studied for over two decades. However, most of the previous researches focus on only one controlled vocabulary: MeSH. In addition, early studies find that applying ontology won\u27t necessarily improve searching performance. In this dissertation, we propose a novel graph based query expansion approach which is able to take advantage of the global information from multiple controlled vocabularies via building a biomedical ontology graph from selected vocabularies in Metathesaurus. We apply Personalized PageRank algorithm on the ontology graph to rank and identify top terms which are highly relevant to the original user query, yet not presented in that query. Those new terms are reordered by a weighted scheme to prioritize specialized concepts. We multiply a scaling factor to those final selected terms to prevent query drifting and append them to the original query in the search. Experiments show that our approach achieves 17.7% improvement in 11 points average precision and recall value against Lucene\u27s default indexing and searching strategy and by 24.8% better against all the other strategies on average. Furthermore, we observe that expanding with specialized concepts rather than generalized concepts can substantially improve the recall-precision performance. Furthermore, we have successfully applied WEST from the underlying WordNet graph to biomedical ontology graph constructed by multiple controlled vocabularies in Metathesaurus. Experiments indicate that WEST further improve the recall-precision performance. Finally, we have developed a Graph-based Biomedical Search Engine (G-Bean) for retrieving and visualizing information from literature using our proposed query expansion algorithm. G-Bean accepts any medical related user query and processes them with expanded medical query to search for the MEDLINE database
TrialMatch: A Transformer Architecture to Match Patients to Clinical Trials
Around 80% of clinical trials fail to meet the patient recruitment requirements, which
not only hinders the market growth but also delays patients’ access to new and effec-
tive treatments. A possible approach is to use Electronic Health Records (EHRs) to help
match patients to clinical trials. Past attempts at achieving this exact goal took place,
but due to a lack of data, they were unsuccessful. In 2021 Text REtrieval Conference
(TREC) introduced the Clinical Trials Track, where participants were challenged with
retrieving relevant clinical trials given the patient’s descriptions simulating admission
notes. Utilizing the track results as a baseline, we tackled the challenge, for this, we re-
sort to Information Retrieval (IR), implementing a pipeline for document ranking where
we explore the different retrieval methods, how to filter the clinical trials based on the
criteria, and reranking with Transformer based models. To tackle the problem, we ex-
plored models pre-trained on the biomedical domain, how to deal with long queries and
documents through query expansion and passage selection, and how to distinguish an
eligible clinical trial from an excluded clinical trial, using techniques such as Named
Entity Recognition (NER) and Clinical Assertion. Our results let to the finding that the
current state-of-the-art Bidirectional Encoder Representations from Transformers (BERT)
bi-encoders outperform the cross-encoders in the mentioned task, whilst proving that
sparse retrieval methods are capable of obtaining competitive outcomes, and to finalize
we showed that the use of the demographic information available can be used to improve
the final result.Cerca de 80% dos ensaios clínicos não satisfazem os requisitos de recrutamento de paci-
entes, o que não só dificulta o crescimento do mercado como também impede o acesso
dos pacientes a novos e eficazes tratamentos. Uma abordagem possível é utilizar os Pron-
tuários Eletrônicos para ajudar a combinar doentes a ensaios clínicos. Tentativas passadas
para alcançar este exato objetivo tiveram lugar, mas devido à falta de dados, não foram
bem sucedidos. Em 2021, a TREC introduziu a Clinical Trials Track, onde os participantes
foram desafiados com a recuperação ensaios clínicos relevantes, dadas as descrições dos
pacientes simulando notas de admissão. Utilizando os resultados da track como base, en-
frentámos o desafio, para isso recorremos à Recuperação de Informação, implementando
uma pipeline para a classificação de documentos onde exploramos os diferentes métodos
de recuperação, como filtrar os ensaios clínicos com base nos critérios, e reclassificação
com modelos baseados no Transformer. Para enfrentar o problema, explorámos modelos
pré-treinados no domínio biomédico, como lidar com longas descrições e documentos,
e como distinguir um ensaio clínico elegível de um ensaio clínico excluído, utilizando
técnicas como Reconhecimento de Entidade Mencionada e Asserção Clínica. Os nossos re-
sultados permitem concluir que os actuais bi-encoders de última geração BERT superam
os cross-encoders BERT na tarefa mencionada, provamos que os métodos de recuperação
esparsos são capazes de obter resultados competitivos, e para finalizar mostramos que
a utilização da informação demográfica disponível pode ser utilizada para melhorar o
resultado fina
GRAPHENE: A Precise Biomedical Literature Retrieval Engine with Graph Augmented Deep Learning and External Knowledge Empowerment
Effective biomedical literature retrieval (BLR) plays a central role in
precision medicine informatics. In this paper, we propose GRAPHENE, which is a
deep learning based framework for precise BLR. GRAPHENE consists of three main
different modules 1) graph-augmented document representation learning; 2) query
expansion and representation learning and 3) learning to rank biomedical
articles. The graph-augmented document representation learning module
constructs a document-concept graph containing biomedical concept nodes and
document nodes so that global biomedical related concept from external
knowledge source can be captured, which is further connected to a BiLSTM so
both local and global topics can be explored. Query expansion and
representation learning module expands the query with abbreviations and
different names, and then builds a CNN-based model to convolve the expanded
query and obtain a vector representation for each query. Learning to rank
minimizes a ranking loss between biomedical articles with the query to learn
the retrieval function. Experimental results on applying our system to TREC
Precision Medicine track data are provided to demonstrate its effectiveness.Comment: CIKM 201
- …