2,455 research outputs found
Learning Contextualized Document Representations for Healthcare Answer Retrieval
We present Contextual Discourse Vectors (CDV), a distributed document
representation for efficient answer retrieval from long healthcare documents.
Our approach is based on structured query tuples of entities and aspects from
free text and medical taxonomies. Our model leverages a dual encoder
architecture with hierarchical LSTM layers and multi-task training to encode
the position of clinical entities and aspects alongside the document discourse.
We use our continuous representations to resolve queries with short latency
using approximate nearest neighbor search on sentence level. We apply the CDV
model for retrieving coherent answer passages from nine English public health
resources from the Web, addressing both patients and medical professionals.
Because there is no end-to-end training data available for all application
scenarios, we train our model with self-supervised data from Wikipedia. We show
that our generalized model significantly outperforms several state-of-the-art
baselines for healthcare passage ranking and is able to adapt to heterogeneous
domains without additional fine-tuning.Comment: The Web Conference 2020 (WWW '20
Neural Representations of Concepts and Texts for Biomedical Information Retrieval
Information retrieval (IR) methods are an indispensable tool in the current landscape of exponentially increasing textual data, especially on the Web. A typical IR task involves fetching and ranking a set of documents (from a large corpus) in terms of relevance to a user\u27s query, which is often expressed as a short phrase. IR methods are the backbone of modern search engines where additional system-level aspects including fault tolerance, scale, user interfaces, and session maintenance are also addressed. In addition to fetching documents, modern search systems may also identify snippets within the documents that are potentially most relevant to the input query. Furthermore, current systems may also maintain preprocessed structured knowledge derived from textual data as so called knowledge graphs, so certain types of queries that are posed as questions can be parsed as such; a response can be an output of one or more named entities instead of a ranked list of documents (e.g., what diseases are associated with EGFR mutations? ). This refined setup is often termed as question answering (QA) in the IR and natural language processing (NLP) communities.
In biomedicine and healthcare, specialized corpora are often at play including research articles by scientists, clinical notes generated by healthcare professionals, consumer forums for specific conditions (e.g., cancer survivors network), and clinical trial protocols (e.g., www.clinicaltrials.gov). Biomedical IR is specialized given the types of queries and the variations in the texts are different from that of general Web documents. For example, scientific articles are more formal with longer sentences but clinical notes tend to have less grammatical conformity and are rife with abbreviations. There is also a mismatch between the vocabulary of consumers and the lingo of domain experts and professionals. Queries are also different and can range from simple phrases (e.g., COVID-19 symptoms ) to more complex implicitly fielded queries (e.g., chemotherapy regimens for stage IV lung cancer patients with ALK mutations ). Hence, developing methods for different configurations (corpus, query type, user type) needs more deliberate attention in biomedical IR.
Representations of documents and queries are at the core of IR methods and retrieval methodology involves coming up with these representations and matching queries with documents based on them. Traditional IR systems follow the approach of keyword based indexing of documents (the so called inverted index) and matching query phrases against the document index. It is not difficult to see that this keyword based matching ignores the semantics of texts (synonymy at the lexeme level and entailment at phrase/clause/sentence levels) and this has lead to dimensionality reduction methods such as latent semantic indexing that generally have scale-related concerns; such methods also do not address similarity at the sentence level. Since the resurgence of neural network methods in NLP, the IR field has also moved to incorporate advances in neural networks into current IR methods.
This dissertation presents four specific methodological efforts toward improving biomedical IR. Neural methods always begin with dense embeddings for words and concepts to overcome the limitations of one-hot encoding in traditional NLP/IR. In the first effort, we present a new neural pre-training approach to jointly learn word and concept embeddings for downstream use in applications. In the second study, we present a joint neural model for two essential subtasks of information extraction (IE): named entity recognition (NER) and entity normalization (EN). Our method detects biomedical concept phrases in texts and links them to the corresponding semantic types and entity codes. These first two studies provide essential tools to model textual representations as compositions of both surface forms (lexical units) and high level concepts with potential downstream use in QA. In the third effort, we present a document reranking model that can help surface documents that are likely to contain answers (e.g, factoids, lists) to a question in a QA task. The model is essentially a sentence matching neural network that learns the relevance of a candidate answer sentence to the given question parametrized with a bilinear map. In the fourth effort, we present another document reranking approach that is tailored for precision medicine use-cases. It combines neural query-document matching and faceted text summarization. The main distinction of this effort from previous efforts is to pivot from a query manipulation setup to transforming candidate documents into pseudo-queries via neural text summarization. Overall, our contributions constitute nontrivial advances in biomedical IR using neural representations of concepts and texts
A Survey on Conversational Search and Applications in Biomedicine
This paper aims to provide a radical rundown on Conversation Search
(ConvSearch), an approach to enhance the information retrieval method where
users engage in a dialogue for the information-seeking tasks. In this survey,
we predominantly focused on the human interactive characteristics of the
ConvSearch systems, highlighting the operations of the action modules, likely
the Retrieval system, Question-Answering, and Recommender system. We labeled
various ConvSearch research problems in knowledge bases, natural language
processing, and dialogue management systems along with the action modules. We
further categorized the framework to ConvSearch and the application is directed
toward biomedical and healthcare fields for the utilization of clinical social
technology. Finally, we conclude by talking through the challenges and issues
of ConvSearch, particularly in Bio-Medicine. Our main aim is to provide an
integrated and unified vision of the ConvSearch components from different
fields, which benefit the information-seeking process in healthcare systems
GrabQC: Graph based Query Contextualization for automated ICD coding
Automated medical coding is a process of codifying clinical notes to
appropriate diagnosis and procedure codes automatically from the standard
taxonomies such as ICD (International Classification of Diseases) and CPT
(Current Procedure Terminology). The manual coding process involves the
identification of entities from the clinical notes followed by querying a
commercial or non-commercial medical codes Information Retrieval (IR) system
that follows the Centre for Medicare and Medicaid Services (CMS) guidelines. We
propose to automate this manual process by automatically constructing a query
for the IR system using the entities auto-extracted from the clinical notes. We
propose \textbf{GrabQC}, a \textbf{Gra}ph \textbf{b}ased \textbf{Q}uery
\textbf{C}ontextualization method that automatically extracts queries from the
clinical text, contextualizes the queries using a Graph Neural Network (GNN)
model and obtains the ICD Codes using an external IR system. We also propose a
method for labelling the dataset for training the model. We perform experiments
on two datasets of clinical text in three different setups to assert the
effectiveness of our approach. The experimental results show that our proposed
method is better than the compared baselines in all three settings.Comment: 25th Pacific-Asia Conference on Knowledge Discovery and Data Mining
(PAKDD 2021
Generating Question-Answer Hierarchies
The process of knowledge acquisition can be viewed as a question-answer game
between a student and a teacher in which the student typically starts by asking
broad, open-ended questions before drilling down into specifics (Hintikka,
1981; Hakkarainen and Sintonen, 2002). This pedagogical perspective motivates a
new way of representing documents. In this paper, we present SQUASH
(Specificity-controlled Question-Answer Hierarchies), a novel and challenging
text generation task that converts an input document into a hierarchy of
question-answer pairs. Users can click on high-level questions (e.g., "Why did
Frodo leave the Fellowship?") to reveal related but more specific questions
(e.g., "Who did Frodo leave with?"). Using a question taxonomy loosely based on
Lehnert (1978), we classify questions in existing reading comprehension
datasets as either "general" or "specific". We then use these labels as input
to a pipelined system centered around a conditional neural language model. We
extensively evaluate the quality of the generated QA hierarchies through
crowdsourced experiments and report strong empirical results.Comment: ACL camera ready + technical note on pipeline modifications for demo
(15 pages
- …