756,883 research outputs found
Large-Scale Visual Relationship Understanding
Large scale visual understanding is challenging, as it requires a model to
handle the widely-spread and imbalanced distribution of <subject, relation,
object> triples. In real-world scenarios with large numbers of objects and
relations, some are seen very commonly while others are barely seen. We develop
a new relationship detection model that embeds objects and relations into two
vector spaces where both discriminative capability and semantic affinity are
preserved. We learn both a visual and a semantic module that map features from
the two modalities into a shared space, where matched pairs of features have to
discriminate against those unmatched, but also maintain close distances to
semantically similar ones. Benefiting from that, our model can achieve superior
performance even when the visual entity categories scale up to more than
80,000, with extremely skewed class distribution. We demonstrate the efficacy
of our model on a large and imbalanced benchmark based of Visual Genome that
comprises 53,000+ objects and 29,000+ relations, a scale at which no previous
work has ever been evaluated at. We show superiority of our model over
carefully designed baselines on the original Visual Genome dataset with 80,000+
categories. We also show state-of-the-art performance on the VRD dataset and
the scene graph dataset which is a subset of Visual Genome with 200 categories
Attend and Interact: Higher-Order Object Interactions for Video Understanding
Human actions often involve complex interactions across several inter-related
objects in the scene. However, existing approaches to fine-grained video
understanding or visual relationship detection often rely on single object
representation or pairwise object relationships. Furthermore, learning
interactions across multiple objects in hundreds of frames for video is
computationally infeasible and performance may suffer since a large
combinatorial space has to be modeled. In this paper, we propose to efficiently
learn higher-order interactions between arbitrary subgroups of objects for
fine-grained video understanding. We demonstrate that modeling object
interactions significantly improves accuracy for both action recognition and
video captioning, while saving more than 3-times the computation over
traditional pairwise relationships. The proposed method is validated on two
large-scale datasets: Kinetics and ActivityNet Captions. Our SINet and
SINet-Caption achieve state-of-the-art performances on both datasets even
though the videos are sampled at a maximum of 1 FPS. To the best of our
knowledge, this is the first work modeling object interactions on open domain
large-scale video datasets, and we additionally model higher-order object
interactions which improves the performance with low computational costs.Comment: CVPR 201
STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
Understanding relations between objects is crucial for understanding the
semantics of a visual scene. It is also an essential step in order to bridge
visual and language models. However, current state-of-the-art computer vision
models still lack the ability to perform spatial reasoning well. Existing
datasets mostly cover a relatively small number of spatial relations, all of
which are static relations that do not intrinsically involve motion. In this
paper, we propose the Spatial and Temporal Understanding of Prepositions
Dataset (STUPD) -- a large-scale video dataset for understanding static and
dynamic spatial relationships derived from prepositions of the English
language. The dataset contains 150K visual depictions (videos and images),
consisting of 30 distinct spatial prepositional senses, in the form of object
interaction simulations generated synthetically using Unity3D. In addition to
spatial relations, we also propose 50K visual depictions across 10 temporal
relations, consisting of videos depicting event/time-point interactions. To our
knowledge, no dataset exists that represents temporal relations through visual
settings. In this dataset, we also provide 3D information about object
interactions such as frame-wise coordinates, and descriptions of the objects
used. The goal of this synthetic dataset is to help models perform better in
visual relationship detection in real-world settings. We demonstrate an
increase in the performance of various models over 2 real-world datasets
(ImageNet-VidVRD and Spatial Senses) when pretrained on the STUPD dataset, in
comparison to other pretraining datasets.Comment: Submitted to Neurips Dataset track. 24 pages including citations and
appendi
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Large language models have demonstrated impressive universal capabilities
across a wide range of open-ended tasks and have extended their utility to
encompass multimodal conversations. However, existing methods encounter
challenges in effectively handling both image and video understanding,
particularly with limited visual tokens. In this work, we introduce Chat-UniVi,
a unified vision-language model capable of comprehending and engaging in
conversations involving images and videos through a unified visual
representation. Specifically, we employ a set of dynamic visual tokens to
uniformly represent images and videos. This representation framework empowers
the model to efficiently utilize a limited number of visual tokens to
simultaneously capture the spatial details necessary for images and the
comprehensive temporal relationship required for videos. Moreover, we leverage
a multi-scale representation, enabling the model to perceive both high-level
semantic concepts and low-level visual details. Notably, Chat-UniVi is trained
on a mixed dataset containing both images and videos, allowing direct
application to tasks involving both mediums without requiring any
modifications. Extensive experimental results demonstrate that Chat-UniVi, as a
unified model, consistently outperforms even existing methods exclusively
designed for either images or videos.Comment: 26 page
Construction of a multi-scale spiking model of macaque visual cortex
Understanding the relationship between structure and dynamics of the mammalian cortex is a key challenge of neuroscience. So far, it has been tackled in two ways: by modeling neurons or small circuits in great detail, and through large-scale models representing each area with a small number of differential equations. To bridge the gap between these two approaches, we construct a spiking network model extending earlier work on the cortical microcircuit by Potjans & Diesmann (2014) to all 32 areas of the macaque visual cortex in the parcellation of Felleman & Van Essen (1991). The model takes into account spe- cific neuronal densities and laminar thicknesses of the individual areas. The connectivity of the model combines recently updated binary tracing data from the CoCoMac database (Stephan et al., 2001) with quantitative tracing data providing connection densities (Markov et al., 2014a) and laminar connection patterns (Stephan et al., 2001; Markov et al., 2014b). We estimate missing data using structural regular- ities such as the exponential decay of connection densities with distance between areas (Ercsey-Ravasz et al., 2013) and a fit of laminar patterns versus logarithmic ratios of neuron densities. The model integrates a large body of knowledge on the structure of macaque visual cortex into a consistent framework that allows for progressive refinement
- …