121 research outputs found
Neural Ranking Models with Weak Supervision
Despite the impressive improvements achieved by unsupervised deep neural
networks in computer vision and NLP tasks, such improvements have not yet been
observed in ranking for information retrieval. The reason may be the complexity
of the ranking problem, as it is not obvious how to learn from queries and
documents when no supervised signal is available. Hence, in this paper, we
propose to train a neural ranking model using weak supervision, where labels
are obtained automatically without human annotators or any external resources
(e.g., click data). To this aim, we use the output of an unsupervised ranking
model, such as BM25, as a weak supervision signal. We further train a set of
simple yet effective ranking models based on feed-forward neural networks. We
study their effectiveness under various learning scenarios (point-wise and
pair-wise models) and using different input representations (i.e., from
encoding query-document pairs into dense/sparse vectors to using word embedding
representation). We train our networks using tens of millions of training
instances and evaluate it on two standard collections: a homogeneous news
collection(Robust) and a heterogeneous large-scale web collection (ClueWeb).
Our experiments indicate that employing proper objective functions and letting
the networks to learn the input representation based on weakly supervised data
leads to impressive performance, with over 13% and 35% MAP improvements over
the BM25 model on the Robust and the ClueWeb collections. Our findings also
suggest that supervised neural ranking models can greatly benefit from
pre-training on large amounts of weakly labeled data that can be easily
obtained from unsupervised IR models.Comment: In proceedings of The 40th International ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR2017
Relevance-based Word Embedding
Learning a high-dimensional dense representation for vocabulary terms, also
known as a word embedding, has recently attracted much attention in natural
language processing and information retrieval tasks. The embedding vectors are
typically learned based on term proximity in a large corpus. This means that
the objective in well-known word embedding algorithms, e.g., word2vec, is to
accurately predict adjacent word(s) for a given word or context. However, this
objective is not necessarily equivalent to the goal of many information
retrieval (IR) tasks. The primary objective in various IR tasks is to capture
relevance instead of term proximity, syntactic, or even semantic similarity.
This is the motivation for developing unsupervised relevance-based word
embedding models that learn word representations based on query-document
relevance information. In this paper, we propose two learning models with
different objective functions; one learns a relevance distribution over the
vocabulary set for each query, and the other classifies each term as belonging
to the relevant or non-relevant class for each query. To train our models, we
used over six million unique queries and the top ranked documents retrieved in
response to each query, which are assumed to be relevant to the query. We
extrinsically evaluate our learned word representation models using two IR
tasks: query expansion and query classification. Both query expansion
experiments on four TREC collections and query classification experiments on
the KDD Cup 2005 dataset suggest that the relevance-based word embedding models
significantly outperform state-of-the-art proximity-based embedding models,
such as word2vec and GloVe.Comment: to appear in the proceedings of The 40th International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR '17
Ranking Significant Discrepancies in Clinical Reports
Medical errors are a major public health concern and a leading cause of death
worldwide. Many healthcare centers and hospitals use reporting systems where
medical practitioners write a preliminary medical report and the report is
later reviewed, revised, and finalized by a more experienced physician. The
revisions range from stylistic to corrections of critical errors or
misinterpretations of the case. Due to the large quantity of reports written
daily, it is often difficult to manually and thoroughly review all the
finalized reports to find such errors and learn from them. To address this
challenge, we propose a novel ranking approach, consisting of textual and
ontological overlaps between the preliminary and final versions of reports. The
approach learns to rank the reports based on the degree of discrepancy between
the versions. This allows medical practitioners to easily identify and learn
from the reports in which their interpretation most substantially differed from
that of the attending physician (who finalized the report). This is a crucial
step towards uncovering potential errors and helping medical practitioners to
learn from such errors, thus improving patient-care in the long run. We
evaluate our model on a dataset of radiology reports and show that our approach
outperforms both previously-proposed approaches and more recent language models
by 4.5% to 15.4%.Comment: ECIR 2020 (short
Deeper Text Understanding for IR with Contextual Neural Language Modeling
Neural networks provide new possibilities to automatically learn complex
language patterns and query-document relations. Neural IR models have achieved
promising results in learning query-document relevance patterns, but few
explorations have been done on understanding the text content of a query or a
document. This paper studies leveraging a recently-proposed contextual neural
language model, BERT, to provide deeper text understanding for IR. Experimental
results demonstrate that the contextual text representations from BERT are more
effective than traditional word embeddings. Compared to bag-of-words retrieval
models, the contextual language model can better leverage language structures,
bringing large improvements on queries written in natural languages. Combining
the text understanding ability with search knowledge leads to an enhanced
pre-trained BERT model that can benefit related search tasks where training
data are limited.Comment: In proceedings of SIGIR 201
- …