3,436 research outputs found
What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams
Open domain question answering (OpenQA) tasks have been recently attracting
more and more attention from the natural language processing (NLP) community.
In this work, we present the first free-form multiple-choice OpenQA dataset for
solving medical problems, MedQA, collected from the professional medical board
exams. It covers three languages: English, simplified Chinese, and traditional
Chinese, and contains 12,723, 34,251, and 14,123 questions for the three
languages, respectively. We implement both rule-based and popular neural
methods by sequentially combining a document retriever and a machine
comprehension model. Through experiments, we find that even the current best
method can only achieve 36.7\%, 42.0\%, and 70.1\% of test accuracy on the
English, traditional Chinese, and simplified Chinese questions, respectively.
We expect MedQA to present great challenges to existing OpenQA systems and hope
that it can serve as a platform to promote much stronger OpenQA models from the
NLP community in the future.Comment: Submitted to AAAI 202
Dense Text Retrieval based on Pretrained Language Models: A Survey
Text retrieval is a long-standing research topic on information seeking,
where a system is required to return relevant information resources to user's
queries in natural language. From classic retrieval methods to learning-based
ranking functions, the underlying retrieval models have been continually
evolved with the ever-lasting technical innovation. To design effective
retrieval models, a key point lies in how to learn the text representation and
model the relevance matching. The recent success of pretrained language models
(PLMs) sheds light on developing more capable text retrieval approaches by
leveraging the excellent modeling capacity of PLMs. With powerful PLMs, we can
effectively learn the representations of queries and texts in the latent
representation space, and further construct the semantic matching function
between the dense vectors for relevance modeling. Such a retrieval approach is
referred to as dense retrieval, since it employs dense vectors (a.k.a.,
embeddings) to represent the texts. Considering the rapid progress on dense
retrieval, in this survey, we systematically review the recent advances on
PLM-based dense retrieval. Different from previous surveys on dense retrieval,
we take a new perspective to organize the related work by four major aspects,
including architecture, training, indexing and integration, and summarize the
mainstream techniques for each aspect. We thoroughly survey the literature, and
include 300+ related reference papers on dense retrieval. To support our
survey, we create a website for providing useful resources, and release a code
repertory and toolkit for implementing dense retrieval models. This survey aims
to provide a comprehensive, practical reference focused on the major progress
for dense text retrieval
A Survey of Source Code Search: A 3-Dimensional Perspective
(Source) code search is widely concerned by software engineering researchers
because it can improve the productivity and quality of software development.
Given a functionality requirement usually described in a natural language
sentence, a code search system can retrieve code snippets that satisfy the
requirement from a large-scale code corpus, e.g., GitHub. To realize effective
and efficient code search, many techniques have been proposed successively.
These techniques improve code search performance mainly by optimizing three
core components, including query understanding component, code understanding
component, and query-code matching component. In this paper, we provide a
3-dimensional perspective survey for code search. Specifically, we categorize
existing code search studies into query-end optimization techniques, code-end
optimization techniques, and match-end optimization techniques according to the
specific components they optimize. Considering that each end can be optimized
independently and contributes to the code search performance, we treat each end
as a dimension. Therefore, this survey is 3-dimensional in nature, and it
provides a comprehensive summary of each dimension in detail. To understand the
research trends of the three dimensions in existing code search studies, we
systematically review 68 relevant literatures. Different from existing code
search surveys that only focus on the query end or code end or introduce
various aspects shallowly (including codebase, evaluation metrics, modeling
technique, etc.), our survey provides a more nuanced analysis and review of the
evolution and development of the underlying techniques used in the three ends.
Based on a systematic review and summary of existing work, we outline several
open challenges and opportunities at the three ends that remain to be addressed
in future work.Comment: submitted to ACM Transactions on Software Engineering and Methodolog
TrialMatch: A Transformer Architecture to Match Patients to Clinical Trials
Around 80% of clinical trials fail to meet the patient recruitment requirements, which
not only hinders the market growth but also delays patients’ access to new and effec-
tive treatments. A possible approach is to use Electronic Health Records (EHRs) to help
match patients to clinical trials. Past attempts at achieving this exact goal took place,
but due to a lack of data, they were unsuccessful. In 2021 Text REtrieval Conference
(TREC) introduced the Clinical Trials Track, where participants were challenged with
retrieving relevant clinical trials given the patient’s descriptions simulating admission
notes. Utilizing the track results as a baseline, we tackled the challenge, for this, we re-
sort to Information Retrieval (IR), implementing a pipeline for document ranking where
we explore the different retrieval methods, how to filter the clinical trials based on the
criteria, and reranking with Transformer based models. To tackle the problem, we ex-
plored models pre-trained on the biomedical domain, how to deal with long queries and
documents through query expansion and passage selection, and how to distinguish an
eligible clinical trial from an excluded clinical trial, using techniques such as Named
Entity Recognition (NER) and Clinical Assertion. Our results let to the finding that the
current state-of-the-art Bidirectional Encoder Representations from Transformers (BERT)
bi-encoders outperform the cross-encoders in the mentioned task, whilst proving that
sparse retrieval methods are capable of obtaining competitive outcomes, and to finalize
we showed that the use of the demographic information available can be used to improve
the final result.Cerca de 80% dos ensaios clínicos não satisfazem os requisitos de recrutamento de paci-
entes, o que não só dificulta o crescimento do mercado como também impede o acesso
dos pacientes a novos e eficazes tratamentos. Uma abordagem possível é utilizar os Pron-
tuários Eletrônicos para ajudar a combinar doentes a ensaios clínicos. Tentativas passadas
para alcançar este exato objetivo tiveram lugar, mas devido à falta de dados, não foram
bem sucedidos. Em 2021, a TREC introduziu a Clinical Trials Track, onde os participantes
foram desafiados com a recuperação ensaios clínicos relevantes, dadas as descrições dos
pacientes simulando notas de admissão. Utilizando os resultados da track como base, en-
frentámos o desafio, para isso recorremos à Recuperação de Informação, implementando
uma pipeline para a classificação de documentos onde exploramos os diferentes métodos
de recuperação, como filtrar os ensaios clínicos com base nos critérios, e reclassificação
com modelos baseados no Transformer. Para enfrentar o problema, explorámos modelos
pré-treinados no domínio biomédico, como lidar com longas descrições e documentos,
e como distinguir um ensaio clínico elegível de um ensaio clínico excluído, utilizando
técnicas como Reconhecimento de Entidade Mencionada e Asserção Clínica. Os nossos re-
sultados permitem concluir que os actuais bi-encoders de última geração BERT superam
os cross-encoders BERT na tarefa mencionada, provamos que os métodos de recuperação
esparsos são capazes de obter resultados competitivos, e para finalizar mostramos que
a utilização da informação demográfica disponível pode ser utilizada para melhorar o
resultado fina
A Zero-Shot Monolingual Dual Stage Information Retrieval System for Spanish Biomedical Systematic Literature Reviews
The authors would like to thank members of the Childhood Obesity in Mexico (COMO)12projectfor supporting this researc
A zero-shot monolingual dual stage information retrieval system for Spanish biomedical systematic literature reviews.
Systematic Reviews (SRs) are foundational in healthcare for synthesising evidence to inform clinical practices. Traditionally skewed towards English-language databases, SRs often exclude significant research in other languages, leading to potential biases. This study addresses this gap by focusing on Spanish, a language notably underrepresented in SRs. We present a foundational zero-shot dual information retrieval (IR) baseline system, integrating traditional retrieval methods with pre-trained language models and cross-attention re-rankers for enhanced accuracy in Spanish biomedical literature retrieval. Utilising the LILACS database, known for its comprehensive coverage of Latin American and Caribbean biomedical literature, we evaluate the approach with three real-life case studies in Spanish SRs. The findings demonstrate the system's efficacy and underscore the importance of query formulation. This study contributes to the field of IR by promoting language inclusivity and supports the development of more comprehensive and globally representative healthcare guidelines
Neural Methods for Effective, Efficient, and Exposure-Aware Information Retrieval
Neural networks with deep architectures have demonstrated significant
performance improvements in computer vision, speech recognition, and natural
language processing. The challenges in information retrieval (IR), however, are
different from these other application areas. A common form of IR involves
ranking of documents--or short passages--in response to keyword-based queries.
Effective IR systems must deal with query-document vocabulary mismatch problem,
by modeling relationships between different query and document terms and how
they indicate relevance. Models should also consider lexical matches when the
query contains rare terms--such as a person's name or a product model
number--not seen during training, and to avoid retrieving semantically related
but irrelevant results. In many real-life IR tasks, the retrieval involves
extremely large collections--such as the document index of a commercial Web
search engine--containing billions of documents. Efficient IR methods should
take advantage of specialized IR data structures, such as inverted index, to
efficiently retrieve from large collections. Given an information need, the IR
system also mediates how much exposure an information artifact receives by
deciding whether it should be displayed, and where it should be positioned,
among other results. Exposure-aware IR systems may optimize for additional
objectives, besides relevance, such as parity of exposure for retrieved items
and content publishers. In this thesis, we present novel neural architectures
and methods motivated by the specific needs and challenges of IR tasks.Comment: PhD thesis, Univ College London (2020
Towards Robust Text Retrieval with Progressive Learning
Retrieval augmentation has become an effective solution to empower large
language models (LLMs) with external and verified knowledge sources from the
database, which overcomes the limitations and hallucinations of LLMs in
handling up-to-date and domain-specific information. However, existing
embedding models for text retrieval usually have three non-negligible
limitations. First, the number and diversity of samples in a batch are too
restricted to supervise the modeling of textual nuances at scale. Second, the
high proportional noise are detrimental to the semantic correctness and
consistency of embeddings. Third, the equal treatment to easy and difficult
samples would cause sub-optimum convergence of embeddings with poorer
generalization. In this paper, we propose the PEG, a progressively learned
embeddings for robust text retrieval. Specifically, we increase the training
in-batch negative samples to 80,000, and for each query, we extracted five hard
negatives. Concurrently, we incorporated a progressive learning mechanism,
enabling the model to dynamically modulate its attention to the samples
throughout the entire training process. Additionally, PEG is trained on more
than 100 million data, encompassing a wide range of domains (e.g., finance,
medicine, and tourism) and covering various tasks (e.g., question-answering,
machine reading comprehension, and similarity matching). Extensive experiments
conducted on C-MTEB and DuReader demonstrate that PEG surpasses
state-of-the-art embeddings in retrieving true positives, highlighting its
significant potential for applications in LLMs. Our model is publicly available
at https://huggingface.co/TownsWu/PEG
- …