57,378 research outputs found
Contextual Media Retrieval Using Natural Language Queries
The widespread integration of cameras in hand-held and head-worn devices as
well as the ability to share content online enables a large and diverse visual
capture of the world that millions of users build up collectively every day. We
envision these images as well as associated meta information, such as GPS
coordinates and timestamps, to form a collective visual memory that can be
queried while automatically taking the ever-changing context of mobile users
into account. As a first step towards this vision, in this work we present
Xplore-M-Ego: a novel media retrieval system that allows users to query a
dynamic database of images and videos using spatio-temporal natural language
queries. We evaluate our system using a new dataset of real user queries as
well as through a usability study. One key finding is that there is a
considerable amount of inter-user variability, for example in the resolution of
spatial relations in natural language utterances. We show that our retrieval
system can cope with this variability using personalisation through an online
learning-based retrieval formulation.Comment: 8 pages, 9 figures, 1 tabl
CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation
Audio-visual video segmentation~(AVVS) aims to generate pixel-level maps of
sound-producing objects within image frames and ensure the maps faithfully
adhere to the given audio, such as identifying and segmenting a singing person
in a video. However, existing methods exhibit two limitations: 1) they address
video temporal features and audio-visual interactive features separately,
disregarding the inherent spatial-temporal dependence of combined audio and
video, and 2) they inadequately introduce audio constraints and object-level
information during the decoding stage, resulting in segmentation outcomes that
fail to comply with audio directives. To tackle these issues, we propose a
decoupled audio-video transformer that combines audio and video features from
their respective temporal and spatial dimensions, capturing their combined
dependence. To optimize memory consumption, we design a block, which, when
stacked, enables capturing audio-visual fine-grained combinatorial-dependence
in a memory-efficient manner. Additionally, we introduce audio-constrained
queries during the decoding phase. These queries contain rich object-level
information, ensuring the decoded mask adheres to the sounds. Experimental
results confirm our approach's effectiveness, with our framework achieving a
new SOTA performance on all three datasets using two backbones. The code is
available at \url{https://github.com/aspirinone/CATR.github.io}Comment: accepted by ACM MM 202
Multilevel Language and Vision Integration for Text-to-Clip Retrieval
We address the problem of text-based activity retrieval in video. Given a
sentence describing an activity, our task is to retrieve matching clips from an
untrimmed video. To capture the inherent structures present in both text and
video, we introduce a multilevel model that integrates vision and language
features earlier and more tightly than prior work. First, we inject text
features early on when generating clip proposals, to help eliminate unlikely
clips and thus speed up processing and boost performance. Second, to learn a
fine-grained similarity metric for retrieval, we use visual features to
modulate the processing of query sentences at the word level in a recurrent
neural network. A multi-task loss is also employed by adding query
re-generation as an auxiliary task. Our approach significantly outperforms
prior work on two challenging benchmarks: Charades-STA and ActivityNet
Captions.Comment: AAAI 201
Interactive visual exploration of a large spatio-temporal dataset: Reflections on a geovisualization mashup
Exploratory visual analysis is useful for the preliminary investigation of large structured, multifaceted spatio-temporal datasets. This process requires the selection and aggregation of records by time, space and attribute, the ability to transform data and the flexibility to apply appropriate visual encodings and interactions. We propose an approach inspired by geographical 'mashups' in which freely-available functionality and data are loosely but flexibly combined using de facto exchange standards. Our case study combines MySQL, PHP and the LandSerf GIS to allow Google Earth to be used for visual synthesis and interaction with encodings described in KML. This approach is applied to the exploration of a log of 1.42 million requests made of a mobile directory service. Novel combinations of interaction and visual encoding are developed including spatial 'tag clouds', 'tag maps', 'data dials' and multi-scale density surfaces. Four aspects of the approach are informally evaluated: the visual encodings employed, their success in the visual exploration of the clataset, the specific tools used and the 'rnashup' approach. Preliminary findings will be beneficial to others considering using mashups for visualization. The specific techniques developed may be more widely applied to offer insights into the structure of multifarious spatio-temporal data of the type explored here
Visual Information Retrieval in Digital Libraries
The emergence of information highways and multimedia computing has resulted in redefining the concept of libraries. It is widely believed that in the next few years, a significant portion of information in libraries will be in the form of multimedia electronic documents. Many approaches are being proposed for storing, retrieving, assimilating, harvesting, and prospecting information from these multimedia documents. Digital libraries are expected to allow users to access information independent of the locations and types of data sources and will provide a unified picture of information. In this paper, we discuss requirements of these emerging information systems and present query methods and data models for these systems. Finally, we briefly present a few examples of approaches that provide a preview of how things will be done in the digital libraries in the near future.published or submitted for publicatio
- …