70,091 research outputs found
A Medical Literature Search System for Identifying Effective Treatments in Precision Medicine
The Precision Medicine Initiative states that treatments for a patient should
take into account not only the patient's disease, but his/her specific genetic
variation as well. The vast biomedical literature holds the potential for
physicians to identify effective treatment options for a cancer patient.
However, the complexity and ambiguity of medical terms can result in vocabulary
mismatch between the physician's query and the literature. The physician's
search intent (finding treatments instead of other types of studies) is
difficult to explicitly formulate in a query. Therefore, simple ad hot
retrieval approach will suffer from low recall and precision. In this paper, we
propose a new retrieval system that helps physicians identify effective
treatments in precision medicine. Given a cancer patient with a specific
disease, genetic variation, and demographic information, the system aims to
identify biomedical publications that report effective treatments. We approach
this goal from two directions. First, we expand the original disease and gene
terms using biomedical knowledge bases to improve recall of the initial
retrieval. We then improve precision by promoting treatment-related
publications to the top using a machine learning reranker trained on 2017 Text
Retrieval Conference Precision Medicine (PM) track corpus. Batch evaluation
results on 2018 PM track corpus show that the proposed approach effectively
improves both recall and precision, achieving performance comparable to the top
entries on the leaderboard of 2018 PM track.Comment: 32 page
A Fast Deep Learning Model for Textual Relevance in Biomedical Information Retrieval
Publications in the life sciences are characterized by a large technical
vocabulary, with many lexical and semantic variations for expressing the same
concept. Towards addressing the problem of relevance in biomedical literature
search, we introduce a deep learning model for the relevance of a document's
text to a keyword style query. Limited by a relatively small amount of training
data, the model uses pre-trained word embeddings. With these, the model first
computes a variable-length Delta matrix between the query and document,
representing a difference between the two texts, which is then passed through a
deep convolution stage followed by a deep feed-forward network to compute a
relevance score. This results in a fast model suitable for use in an online
search engine. The model is robust and outperforms comparable state-of-the-art
deep learning approaches.Comment: To appear in proceeding of WWW 201
Complementing Lexical Retrieval with Semantic Residual Embedding
This paper presents CLEAR, a retrieval model that seeks to complement
classical lexical exact-match models such as BM25 with semantic matching
signals from a neural embedding matching model. CLEAR explicitly trains the
neural embedding to encode language structures and semantics that lexical
retrieval fails to capture with a novel residual-based embedding learning
method. Empirical evaluations demonstrate the advantages of CLEAR over
state-of-the-art retrieval models, and that it can substantially improve the
end-to-end accuracy and efficiency of reranking pipelines.Comment: ECIR 202
Twitter100k: A Real-world Dataset for Weakly Supervised Cross-Media Retrieval
This paper contributes a new large-scale dataset for weakly supervised
cross-media retrieval, named Twitter100k. Current datasets, such as Wikipedia,
NUS Wide and Flickr30k, have two major limitations. First, these datasets are
lacking in content diversity, i.e., only some pre-defined classes are covered.
Second, texts in these datasets are written in well-organized language, leading
to inconsistency with realistic applications. To overcome these drawbacks, the
proposed Twitter100k dataset is characterized by two aspects: 1) it has 100,000
image-text pairs randomly crawled from Twitter and thus has no constraint in
the image categories; 2) text in Twitter100k is written in informal language by
the users.
Since strongly supervised methods leverage the class labels that may be
missing in practice, this paper focuses on weakly supervised learning for
cross-media retrieval, in which only text-image pairs are exploited during
training. We extensively benchmark the performance of four subspace learning
methods and three variants of the Correspondence AutoEncoder, along with
various text features on Wikipedia, Flickr30k and Twitter100k. Novel insights
are provided. As a minor contribution, inspired by the characteristic of
Twitter100k, we propose an OCR-based cross-media retrieval method. In
experiment, we show that the proposed OCR-based method improves the baseline
performance
Assessing Efficiency-Effectiveness Tradeoffs in Multi-Stage Retrieval Systems Without Using Relevance Judgments
Large-scale retrieval systems are often implemented as a cascading sequence
of phases -- a first filtering step, in which a large set of candidate
documents are extracted using a simple technique such as Boolean matching
and/or static document scores; and then one or more ranking steps, in which the
pool of documents retrieved by the filter is scored more precisely using dozens
or perhaps hundreds of different features. The documents returned to the user
are then taken from the head of the final ranked list. Here we examine methods
for measuring the quality of filtering and preliminary ranking stages, and show
how to use these measurements to tune the overall performance of the system.
Standard top-weighted metrics used for overall system evaluation are not
appropriate for assessing filtering stages, since the output is a set of
documents, rather than an ordered sequence of documents. Instead, we use an
approach in which a quality score is computed based on the discrepancy between
filtered and full evaluation. Unlike previous approaches, our methods do not
require relevance judgments, and thus can be used with virtually any query set.
We show that this quality score directly correlates with actual differences in
measured effectiveness when relevance judgments are available. Since the
quality score does not require relevance judgments, it can be used to identify
queries that perform particularly poorly for a given filter. Using these
methods, we explore a wide range of filtering options using thousands of
queries, categorize the relative merits of the different approaches, and
identify useful parameter combinations
Effective Image Retrieval via Multilinear Multi-index Fusion
Multi-index fusion has demonstrated impressive performances in retrieval task
by integrating different visual representations in a unified framework.
However, previous works mainly consider propagating similarities via neighbor
structure, ignoring the high order information among different visual
representations. In this paper, we propose a new multi-index fusion scheme for
image retrieval. By formulating this procedure as a multilinear based
optimization problem, the complementary information hidden in different indexes
can be explored more thoroughly. Specially, we first build our multiple indexes
from various visual representations. Then a so-called index-specific functional
matrix, which aims to propagate similarities, is introduced for updating the
original index. The functional matrices are then optimized in a unified tensor
space to achieve a refinement, such that the relevant images can be pushed more
closer. The optimization problem can be efficiently solved by the augmented
Lagrangian method with theoretical convergence guarantee. Unlike the
traditional multi-index fusion scheme, our approach embeds the multi-index
subspace structure into the new indexes with sparse constraint, thus it has
little additional memory consumption in online query stage. Experimental
evaluation on three benchmark datasets reveals that the proposed approach
achieves the state-of-the-art performance, i.e., N-score 3.94 on UKBench, mAP
94.1\% on Holiday and 62.39\% on Market-1501.Comment: 12 page
Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval
Cross-modal retrieval between visual data and natural language description
remains a long-standing challenge in multimedia. While recent image-text
retrieval methods offer great promise by learning deep representations aligned
across modalities, most of these methods are plagued by the issue of training
with small-scale datasets covering a limited number of images with ground-truth
sentences. Moreover, it is extremely expensive to create a larger dataset by
annotating millions of images with sentences and may lead to a biased model.
Inspired by the recent success of webly supervised learning in deep neural
networks, we capitalize on readily-available web images with noisy annotations
to learn robust image-text joint representation. Specifically, our main idea is
to leverage web images and corresponding tags, along with fully annotated
datasets, in training for learning the visual-semantic joint embedding. We
propose a two-stage approach for the task that can augment a typical supervised
pair-wise ranking loss based formulation with weakly-annotated web images to
learn a more robust visual-semantic embedding. Experiments on two standard
benchmark datasets demonstrate that our method achieves a significant
performance gain in image-text retrieval compared to state-of-the-art
approaches.Comment: ACM Multimedia 201
Beyond Precision: A Study on Recall of Initial Retrieval with Neural Representations
Vocabulary mismatch is a central problem in information retrieval (IR), i.e.,
the relevant documents may not contain the same (symbolic) terms of the query.
Recently, neural representations have shown great success in capturing semantic
relatedness, leading to new possibilities to alleviate the vocabulary mismatch
problem in IR. However, most existing efforts in this direction have been
devoted to the re-ranking stage. That is to leverage neural representations to
help re-rank a set of candidate documents, which are typically obtained from an
initial retrieval stage based on some symbolic index and search scheme (e.g.,
BM25 over the inverted index). This naturally raises a question: if the
relevant documents have not been found in the initial retrieval stage due to
vocabulary mismatch, there would be no chance to re-rank them to the top
positions later. Therefore, in this paper, we study the problem how to employ
neural representations to improve the recall of relevant documents in the
initial retrieval stage. Specifically, to meet the efficiency requirement of
the initial stage, we introduce a neural index for the neural representations
of documents, and propose two hybrid search schemes based on both neural and
symbolic indices, namely the parallel search scheme and the sequential search
scheme. Our experiments show that both hybrid index and search schemes can
improve the recall of the initial retrieval stage with small overhead
An Information Retrieval Approach to Short Text Conversation
Human computer conversation is regarded as one of the most difficult problems
in artificial intelligence. In this paper, we address one of its key
sub-problems, referred to as short text conversation, in which given a message
from human, the computer returns a reasonable response to the message. We
leverage the vast amount of short conversation data available on social media
to study the issue. We propose formalizing short text conversation as a search
problem at the first step, and employing state-of-the-art information retrieval
(IR) techniques to carry out the task. We investigate the significance as well
as the limitation of the IR approach. Our experiments demonstrate that the
retrieval-based model can make the system behave rather "intelligently", when
combined with a huge repository of conversation data from social media.Comment: 21 pages, 4 figure
Learning to Rank Using Localized Geometric Mean Metrics
Many learning-to-rank (LtR) algorithms focus on query-independent model, in
which query and document do not lie in the same feature space, and the rankers
rely on the feature ensemble about query-document pair instead of the
similarity between query instance and documents. However, existing algorithms
do not consider local structures in query-document feature space, and are
fragile to irrelevant noise features. In this paper, we propose a novel
Riemannian metric learning algorithm to capture the local structures and
develop a robust LtR algorithm. First, we design a concept called \textit{ideal
candidate document} to introduce metric learning algorithm to query-independent
model. Previous metric learning algorithms aiming to find an optimal metric
space are only suitable for query-dependent model, in which query instance and
documents belong to the same feature space and the similarity is directly
computed from the metric space. Then we extend the new and extremely fast
global Geometric Mean Metric Learning (GMML) algorithm to develop a localized
GMML, namely L-GMML. Based on the combination of local learned metrics, we
employ the popular Normalized Discounted Cumulative Gain~(NDCG) scorer and
Weighted Approximate Rank Pairwise (WARP) loss to optimize the \textit{ideal
candidate document} for each query candidate set. Finally, we can quickly
evaluate all candidates via the similarity between the \textit{ideal candidate
document} and other candidates. By leveraging the ability of metric learning
algorithms to describe the complex structural information, our approach gives
us a principled and efficient way to perform LtR tasks. The experiments on
real-world datasets demonstrate that our proposed L-GMML algorithm outperforms
the state-of-the-art metric learning to rank methods and the stylish
query-independent LtR algorithms regarding accuracy and computational
efficiency.Comment: To appear in SIGIR'1
- …