1,488 research outputs found
MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval
Information retrieval (IR) is essential in biomedical knowledge acquisition
and clinical decision support. While recent progress has shown that language
model encoders perform better semantic retrieval, training such models requires
abundant query-article annotations that are difficult to obtain in biomedicine.
As a result, most biomedical IR systems only conduct lexical matching. In
response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained
Transformer model for zero-shot semantic IR in biomedicine. To train MedCPT, we
collected an unprecedented scale of 255 million user click logs from PubMed.
With such data, we use contrastive learning to train a pair of
closely-integrated retriever and re-ranker. Experimental results show that
MedCPT sets new state-of-the-art performance on six biomedical IR tasks,
outperforming various baselines including much larger models such as
GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical
article and sentence representations for semantic evaluations. As such, MedCPT
can be readily applied to various real-world biomedical IR tasks.Comment: The MedCPT code and API are available at
https://github.com/ncbi/MedCP
A Hierarchical Recurrent Encoder-Decoder For Generative Context-Aware Query Suggestion
Users may strive to formulate an adequate textual query for their information
need. Search engines assist the users by presenting query suggestions. To
preserve the original search intent, suggestions should be context-aware and
account for the previous queries issued by the user. Achieving context
awareness is challenging due to data sparsity. We present a probabilistic
suggestion model that is able to account for sequences of previous queries of
arbitrary lengths. Our novel hierarchical recurrent encoder-decoder
architecture allows the model to be sensitive to the order of queries in the
context while avoiding data sparsity. Additionally, our model can suggest for
rare, or long-tail, queries. The produced suggestions are synthetic and are
sampled one word at a time, using computationally cheap decoding techniques.
This is in contrast to current synthetic suggestion models relying upon machine
learning pipelines and hand-engineered feature sets. Results show that it
outperforms existing context-aware approaches in a next query prediction
setting. In addition to query suggestion, our model is general enough to be
used in a variety of other applications.Comment: To appear in Conference of Information Knowledge and Management
(CIKM) 201
LADER: Log-Augmented DEnse Retrieval for Biomedical Literature Search
Queries with similar information needs tend to have similar document clicks,
especially in biomedical literature search engines where queries are generally
short and top documents account for most of the total clicks. Motivated by
this, we present a novel architecture for biomedical literature search, namely
Log-Augmented DEnse Retrieval (LADER), which is a simple plug-in module that
augments a dense retriever with the click logs retrieved from similar training
queries. Specifically, LADER finds both similar documents and queries to the
given query by a dense retriever. Then, LADER scores relevant (clicked)
documents of similar queries weighted by their similarity to the input query.
The final document scores by LADER are the average of (1) the document
similarity scores from the dense retriever and (2) the aggregated document
scores from the click logs of similar queries. Despite its simplicity, LADER
achieves new state-of-the-art (SOTA) performance on TripClick, a recently
released benchmark for biomedical literature retrieval. On the frequent (HEAD)
queries, LADER largely outperforms the best retrieval model by 39% relative
NDCG@10 (0.338 v.s. 0.243). LADER also achieves better performance on the less
frequent (TORSO) queries with 11% relative NDCG@10 improvement over the
previous SOTA (0.303 v.s. 0.272). On the rare (TAIL) queries where similar
queries are scarce, LADER still compares favorably to the previous SOTA method
(NDCG@10: 0.310 v.s. 0.295). On all queries, LADER can improve the performance
of a dense retriever by 24%-37% relative NDCG@10 while not requiring additional
training, and further performance improvement is expected from more logs. Our
regression analysis has shown that queries that are more frequent, have higher
entropy of query similarity and lower entropy of document similarity, tend to
benefit more from log augmentation.Comment: SIGIR 202
Context-aware Deep Model for Entity Recommendation in Search Engine at Alibaba
Entity recommendation, providing search users with an improved experience via
assisting them in finding related entities for a given query, has become an
indispensable feature of today's search engines. Existing studies typically
only consider the queries with explicit entities. They usually fail to handle
complex queries that without entities, such as "what food is good for cold
weather", because their models could not infer the underlying meaning of the
input text. In this work, we believe that contexts convey valuable evidence
that could facilitate the semantic modeling of queries, and take them into
consideration for entity recommendation. In order to better model the semantics
of queries and entities, we learn the representation of queries and entities
jointly with attentive deep neural networks. We evaluate our approach using
large-scale, real-world search logs from a widely used commercial Chinese
search engine. Our system has been deployed in ShenMa Search Engine and you can
fetch it in UC Browser of Alibaba. Results from online A/B test suggest that
the impression efficiency of click-through rate increased by 5.1% and page view
increased by 5.5%.Comment: CIKM2019 International Workshop on Entity Retrieval. arXiv admin
note: text overlap with arXiv:1511.08996 by other author
- …