214 research outputs found
Query Expansion with Locally-Trained Word Embeddings
Continuous space word embeddings have received a great deal of attention in
the natural language processing and machine learning communities for their
ability to model term similarity and other relationships. We study the use of
term relatedness in the context of query expansion for ad hoc information
retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when
trained globally, underperform corpus and query specific embeddings for
retrieval tasks. These results suggest that other tasks benefiting from global
embeddings may also benefit from local embeddings
Neural Methods for Effective, Efficient, and Exposure-Aware Information Retrieval
Neural networks with deep architectures have demonstrated significant
performance improvements in computer vision, speech recognition, and natural
language processing. The challenges in information retrieval (IR), however, are
different from these other application areas. A common form of IR involves
ranking of documents--or short passages--in response to keyword-based queries.
Effective IR systems must deal with query-document vocabulary mismatch problem,
by modeling relationships between different query and document terms and how
they indicate relevance. Models should also consider lexical matches when the
query contains rare terms--such as a person's name or a product model
number--not seen during training, and to avoid retrieving semantically related
but irrelevant results. In many real-life IR tasks, the retrieval involves
extremely large collections--such as the document index of a commercial Web
search engine--containing billions of documents. Efficient IR methods should
take advantage of specialized IR data structures, such as inverted index, to
efficiently retrieve from large collections. Given an information need, the IR
system also mediates how much exposure an information artifact receives by
deciding whether it should be displayed, and where it should be positioned,
among other results. Exposure-aware IR systems may optimize for additional
objectives, besides relevance, such as parity of exposure for retrieved items
and content publishers. In this thesis, we present novel neural architectures
and methods motivated by the specific needs and challenges of IR tasks.Comment: PhD thesis, Univ College London (2020
Recall, Robustness, and Lexicographic Evaluation
Researchers use recall to evaluate rankings across a variety of retrieval,
recommendation, and machine learning tasks. While there is a colloquial
interpretation of recall in set-based evaluation, the research community is far
from a principled understanding of recall metrics for rankings. The lack of
principled understanding of or motivation for recall has resulted in criticism
amongst the retrieval community that recall is useful as a measure at all. In
this light, we reflect on the measurement of recall in rankings from a formal
perspective. Our analysis is composed of three tenets: recall, robustness, and
lexicographic evaluation. First, we formally define `recall-orientation' as
sensitivity to movement of the bottom-ranked relevant item. Second, we analyze
our concept of recall orientation from the perspective of robustness with
respect to possible searchers and content providers. Finally, we extend this
conceptual and theoretical treatment of recall by developing a practical
preference-based evaluation method based on lexicographic comparison. Through
extensive empirical analysis across 17 TREC tracks, we establish that our new
evaluation method, lexirecall, is correlated with existing recall metrics and
exhibits substantially higher discriminative power and stability in the
presence of missing labels. Our conceptual, theoretical, and empirical analysis
substantially deepens our understanding of recall and motivates its adoption
through connections to robustness and fairness.Comment: Under revie
- …