18,204 research outputs found
Is Cross-modal Information Retrieval Possible without Training?
Encoded representations from a pretrained deep learning model (e.g., BERT
text embeddings, penultimate CNN layer activations of an image) convey a rich
set of features beneficial for information retrieval. Embeddings for a
particular modality of data occupy a high-dimensional space of its own, but it
can be semantically aligned to another by a simple mapping without training a
deep neural net. In this paper, we take a simple mapping computed from the
least squares and singular value decomposition (SVD) for a solution to the
Procrustes problem to serve a means to cross-modal information retrieval. That
is, given information in one modality such as text, the mapping helps us locate
a semantically equivalent data item in another modality such as image. Using
off-the-shelf pretrained deep learning models, we have experimented the
aforementioned simple cross-modal mappings in tasks of text-to-image and
image-to-text retrieval. Despite simplicity, our mappings perform reasonably
well reaching the highest accuracy of 77% on recall@10, which is comparable to
those requiring costly neural net training and fine-tuning. We have improved
the simple mappings by contrastive learning on the pretrained models.
Contrastive learning can be thought as properly biasing the pretrained encoders
to enhance the cross-modal mapping quality. We have further improved the
performance by multilayer perceptron with gating (gMLP), a simple neural
architecture
A Multi-Resolution Word Embedding for Document Retrieval from Large Unstructured Knowledge Bases
Deep language models learning a hierarchical representation proved to be a
powerful tool for natural language processing, text mining and information
retrieval. However, representations that perform well for retrieval must
capture semantic meaning at different levels of abstraction or context-scopes.
In this paper, we propose a new method to generate multi-resolution word
embeddings that represent documents at multiple resolutions in terms of
context-scopes. In order to investigate its performance,we use the Stanford
Question Answering Dataset (SQuAD) and the Question Answering by Search And
Reading (QUASAR) in an open-domain question-answering setting, where the first
task is to find documents useful for answering a given question. To this end,
we first compare the quality of various text-embedding methods for retrieval
performance and give an extensive empirical comparison with the performance of
various non-augmented base embeddings with and without multi-resolution
representation. We argue that multi-resolution word embeddings are consistently
superior to the original counterparts and deep residual neural models
specifically trained for retrieval purposes can yield further significant gains
when they are used for augmenting those embeddings
Recommended from our members
Neural Methods for Answer Passage Retrieval over Sparse Collections
Recent advances in machine learning have allowed information retrieval (IR) techniques to advance beyond the stage of handcrafting domain specific features. Specifically, deep neural models incorporate varying levels of features to learn whether a document answers the information need of a query. However, these neural models rely on a large number of parameters to successfully learn a relation between a query and a relevant document. This reliance on a large number of parameters, combined with the current methods of optimization relying on small updates necessitates numerous samples to allow the neural model to converge on an effective relevance function. This presents a significant obstacle in the realm of IR as relevance judgements are often sparse or noisy and combined with a large class imbalance. This is especially true for short text retrieval where there is often only one relevant passage. This problem is exacerbated when training these artificial neural networks, as excessive negative sampling can result in poor performance. Thus, we propose approaching this task through multiple avenues and examining their effectiveness on a non-factoid question answering (QA) task.We first propose learning local embeddings specific to the relevance information of the collection to improve performance of an upstream neural model. In doing so, we find significantly improved results over standard pre-trained embeddings, despite only developing the embeddings on a small collection which would not be sufficient for a full language model. Leveraging this local representation, and inspired by recent work in machine translation, we introduce a hybrid embedding based model that incorporates both pre-trained embeddings while dynamically constructing local representations from character embeddings. The hybrid approach relies on pre-trained embeddings to achieve an effective retrieval model, and continually adjusts its character level abstraction to fit a local representation.We next approach methods to adapt neural models to multiple IR collections, therefore reducing the collection specific training required and alleviating the need to retrain a neural model\u27s parameters for a new subdomain of a collection. First, we propose an adversarial retrieval model which achieves state-of-the-art performance on out of subdomain queries while maintaining in-domain performance. Second, we establish an informed negative sampling approach using a reinforcement learning agent. The agent is trained to directly maximize the performance of a neural IR model using a predefined IR metric by choosing which ranking function from which to sample negative documents. This policy based sampling allows the neural model to be exposed to more of a collection and results in a more consistent neural retrieval model over multiple training instances. Lastly, we move towards a universal retrieval function. We initially introduce a probe-based inspection of neural relevance models through the lens of standard natural language processing tasks and establish that while seemingly similar QA collections require the same basic abstract information, the final layers that determine relevance differ significantly. We then introduce Universal Retrieval Functions, a method to incorporate new collections using a library of previously trained linear relevance models and a common neural representation
Learning semantic sentence representations from visually grounded language without lexical knowledge
Current approaches to learning semantic representations of sentences often
use prior word-level knowledge. The current study aims to leverage visual
information in order to capture sentence level semantics without the need for
word embeddings. We use a multimodal sentence encoder trained on a corpus of
images with matching text captions to produce visually grounded sentence
embeddings. Deep Neural Networks are trained to map the two modalities to a
common embedding space such that for an image the corresponding caption can be
retrieved and vice versa. We show that our model achieves results comparable to
the current state-of-the-art on two popular image-caption retrieval benchmark
data sets: MSCOCO and Flickr8k. We evaluate the semantic content of the
resulting sentence embeddings using the data from the Semantic Textual
Similarity benchmark task and show that the multimodal embeddings correlate
well with human semantic similarity judgements. The system achieves
state-of-the-art results on several of these benchmarks, which shows that a
system trained solely on multimodal data, without assuming any word
representations, is able to capture sentence level semantics. Importantly, this
result shows that we do not need prior knowledge of lexical level semantics in
order to model sentence level semantics. These findings demonstrate the
importance of visual information in semantics
- …