17 research outputs found
The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes
Information Retrieval using dense low-dimensional representations recently
became popular and showed out-performance to traditional sparse-representations
like BM25. However, no previous work investigated how dense representations
perform with large index sizes. We show theoretically and empirically that the
performance for dense representations decreases quicker than sparse
representations for increasing index sizes. In extreme cases, this can even
lead to a tipping point where at a certain index size sparse representations
outperform dense representations. We show that this behavior is tightly
connected to the number of dimensions of the representations: The lower the
dimension, the higher the chance for false positives, i.e. returning irrelevant
documents.Comment: Published at ACL 202
Multilingual Universal Sentence Encoder for Semantic Retrieval
We introduce two pre-trained retrieval focused multilingual sentence encoding
models, respectively based on the Transformer and CNN model architectures. The
models embed text from 16 languages into a single semantic space using a
multi-task trained dual-encoder that learns tied representations using
translation based bridge tasks (Chidambaram al., 2018). The models provide
performance that is competitive with the state-of-the-art on: semantic
retrieval (SR), translation pair bitext retrieval (BR) and retrieval question
answering (ReQA). On English transfer learning tasks, our sentence-level
embeddings approach, and in some cases exceed, the performance of monolingual,
English only, sentence embedding models. Our models are made available for
download on TensorFlow Hub.Comment: 6 pages, 6 tables, 2 listings, and 1 figur
Distilling Knowledge from Reader to Retriever for Question Answering
The task of information retrieval is an important component of many natural
language processing systems, such as open domain question answering. While
traditional methods were based on hand-crafted features, continuous
representations based on neural networks recently obtained competitive results.
A challenge of using such methods is to obtain supervised data to train the
retriever model, corresponding to pairs of query and support documents. In this
paper, we propose a technique to learn retriever models for downstream tasks,
inspired by knowledge distillation, and which does not require annotated pairs
of query and documents. Our approach leverages attention scores of a reader
model, used to solve the task based on retrieved documents, to obtain synthetic
labels for the retriever. We evaluate our method on question answering,
obtaining state-of-the-art results