290 research outputs found
Zero-Shot Event Detection by Multimodal Distributional Semantic Embedding of Videos
We propose a new zero-shot Event Detection method by Multi-modal
Distributional Semantic embedding of videos. Our model embeds object and action
concepts as well as other available modalities from videos into a
distributional semantic space. To our knowledge, this is the first Zero-Shot
event detection model that is built on top of distributional semantics and
extends it in the following directions: (a) semantic embedding of multimodal
information in videos (with focus on the visual modalities), (b) automatically
determining relevance of concepts/attributes to a free text query, which could
be useful for other applications, and (c) retrieving videos by free text event
query (e.g., "changing a vehicle tire") based on their content. We embed videos
into a distributional semantic space and then measure the similarity between
videos and the event query in a free text form. We validated our method on the
large TRECVID MED (Multimedia Event Detection) challenge. Using only the event
title as a query, our method outperformed the state-of-the-art that uses big
descriptions from 12.6% to 13.5% with MAP metric and 0.73 to 0.83 with ROC-AUC
metric. It is also an order of magnitude faster.Comment: To appear in AAAI 201
Unsupervised Visual and Textual Information Fusion in Multimedia Retrieval - A Graph-based Point of View
Multimedia collections are more than ever growing in size and diversity.
Effective multimedia retrieval systems are thus critical to access these
datasets from the end-user perspective and in a scalable way. We are interested
in repositories of image/text multimedia objects and we study multimodal
information fusion techniques in the context of content based multimedia
information retrieval. We focus on graph based methods which have proven to
provide state-of-the-art performances. We particularly examine two of such
methods : cross-media similarities and random walk based scores. From a
theoretical viewpoint, we propose a unifying graph based framework which
encompasses the two aforementioned approaches. Our proposal allows us to
highlight the core features one should consider when using a graph based
technique for the combination of visual and textual information. We compare
cross-media and random walk based results using three different real-world
datasets. From a practical standpoint, our extended empirical analysis allow us
to provide insights and guidelines about the use of graph based methods for
multimodal information fusion in content based multimedia information
retrieval.Comment: An extended version of the paper: Visual and Textual Information
Fusion in Multimedia Retrieval using Semantic Filtering and Graph based
Methods, by J. Ah-Pine, G. Csurka and S. Clinchant, submitted to ACM
Transactions on Information System
Unified Embedding and Metric Learning for Zero-Exemplar Event Detection
Event detection in unconstrained videos is conceived as a content-based video
retrieval with two modalities: textual and visual. Given a text describing a
novel event, the goal is to rank related videos accordingly. This task is
zero-exemplar, no video examples are given to the novel event.
Related works train a bank of concept detectors on external data sources.
These detectors predict confidence scores for test videos, which are ranked and
retrieved accordingly. In contrast, we learn a joint space in which the visual
and textual representations are embedded. The space casts a novel event as a
probability of pre-defined events. Also, it learns to measure the distance
between an event and its related videos.
Our model is trained end-to-end on publicly available EventNet. When applied
to TRECVID Multimedia Event Detection dataset, it outperforms the
state-of-the-art by a considerable margin.Comment: IEEE CVPR 201
One Embedder, Any Task: Instruction-Finetuned Text Embeddings
We introduce INSTRUCTOR, a new method for computing text embeddings given
task instructions: every text input is embedded together with instructions
explaining the use case (e.g., task and domain descriptions). Unlike encoders
from prior work that are more specialized, INSTRUCTOR is a single embedder that
can generate text embeddings tailored to different downstream tasks and
domains, without any further training. We first annotate instructions for 330
diverse tasks and train INSTRUCTOR on this multitask mixture with a contrastive
loss. We evaluate INSTRUCTOR on 70 embedding evaluation tasks (66 of which are
unseen during training), ranging from classification and information retrieval
to semantic textual similarity and text generation evaluation. INSTRUCTOR,
while having an order of magnitude fewer parameters than the previous best
model, achieves state-of-the-art performance, with an average improvement of
3.4% compared to the previous best results on the 70 diverse datasets. Our
analysis suggests that INSTRUCTOR is robust to changes in instructions, and
that instruction finetuning mitigates the challenge of training a single model
on diverse datasets. Our model, code, and data are available at
https://instructor-embedding.github.io.Comment: Accepted in ACL2023 Finding
Living Knowledge
Diversity, especially manifested in language and knowledge, is a function of local goals, needs, competences, beliefs, culture, opinions and personal experience. The Living Knowledge project considers diversity as an asset rather than a problem. With the project, foundational ideas emerged from the synergic contribution of different disciplines, methodologies (with which many partners were previously unfamiliar) and technologies flowed in concrete diversity-aware applications such as the Future Predictor and the Media Content Analyser providing users with better structured information while coping with Web scale complexities. The key notions of diversity, fact, opinion and bias have been defined in relation to three methodologies: Media Content Analysis (MCA) which operates from a social sciences perspective; Multimodal Genre Analysis (MGA) which operates from a semiotic perspective and Facet Analysis (FA) which operates from a knowledge representation and organization perspective. A conceptual architecture that pulls all of them together has become the core of the tools for automatic extraction and the way they interact. In particular, the conceptual architecture has been implemented with the Media Content Analyser application. The scientific and technological results obtained are described in the following
ControlRetriever: Harnessing the Power of Instructions for Controllable Retrieval
Recent studies have shown that dense retrieval models, lacking dedicated
training data, struggle to perform well across diverse retrieval tasks, as
different retrieval tasks often entail distinct search intents. To address this
challenge, in this work we introduce ControlRetriever, a generic and efficient
approach with a parameter isolated architecture, capable of controlling dense
retrieval models to directly perform varied retrieval tasks, harnessing the
power of instructions that explicitly describe retrieval intents in natural
language. Leveraging the foundation of ControlNet, which has proven powerful in
text-to-image generation, ControlRetriever imbues different retrieval models
with the new capacity of controllable retrieval, all while being guided by
task-specific instructions. Furthermore, we propose a novel LLM guided
Instruction Synthesizing and Iterative Training strategy, which iteratively
tunes ControlRetriever based on extensive automatically-generated retrieval
data with diverse instructions by capitalizing the advancement of large
language models. Extensive experiments show that in the BEIR benchmark, with
only natural language descriptions of specific retrieval intent for each task,
ControlRetriever, as a unified multi-task retrieval system without
task-specific tuning, significantly outperforms baseline methods designed with
task-specific retrievers and also achieves state-of-the-art zero-shot
performance
Learning to represent, categorise and rank in community question answering
The task of Question Answering (QA) is arguably one of the oldest tasks in Natural Language Processing, attracting high levels of interest from both industry and academia. However, most research has focused on factoid questions, e.g. Who is the president of Ireland? In contrast, research on answering non-factoid questions, such as manner, reason, difference and opinion questions, has been rather piecemeal.
This was largely due to the absence of available labelled data for the task. This is changing, however, with the growing popularity of Community Question Answering (CQA) websites, such as Quora, Yahoo! Answers and the Stack Exchange family of forums. These websites provide natural labelled data allowing us to apply machine learning techniques.
Most previous state-of-the-art approaches to the tasks of CQA-based question answering involved handcrafted features in combination with linear models. In this thesis we hypothesise that the use of handcrafted features can be avoided and the tasks can be approached with representation learning techniques, specifically deep learning.
In the first part of this thesis we give an overview of deep learning in natural language processing and empirically evaluate our hypothesis on the task of detecting semantically equivalent questions, i.e. predicting if two questions can be answered by the same answer.
In the second part of the thesis we address the task of answer ranking, i.e. determining how suitable an answer is for a given question. In order to determine the suitability of representation learning for the task of answer ranking, we provide a rigorous experimental evaluation of various neural architectures, based on feedforward, recurrent and convolutional neural networks, as well as their combinations.
This thesis shows that deep learning is a very suitable approach to CQA-based QA, achieving state-of-the-art results on the two tasks we addressed
- …