509 research outputs found
Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos
Every moment counts in action recognition. A comprehensive understanding of
human activity in video requires labeling every frame according to the actions
occurring, placing multiple labels densely over a video sequence. To study this
problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new
dataset of dense labels over unconstrained internet videos. Modeling multiple,
dense labels benefits from temporal relations within and across classes. We
define a novel variant of long short-term memory (LSTM) deep networks for
modeling these temporal relations via multiple input and output connections. We
show that this model improves action labeling accuracy and further enables
deeper understanding tasks ranging from structured retrieval to action
prediction.Comment: To appear in IJC
Generalizable Neural Fields as Partially Observed Neural Processes
Neural fields, which represent signals as a function parameterized by a
neural network, are a promising alternative to traditional discrete vector or
grid-based representations. Compared to discrete representations, neural
representations both scale well with increasing resolution, are continuous, and
can be many-times differentiable. However, given a dataset of signals that we
would like to represent, having to optimize a separate neural field for each
signal is inefficient, and cannot capitalize on shared information or
structures among signals. Existing generalization methods view this as a
meta-learning problem and employ gradient-based meta-learning to learn an
initialization which is then fine-tuned with test-time optimization, or learn
hypernetworks to produce the weights of a neural field. We instead propose a
new paradigm that views the large-scale training of neural representations as a
part of a partially-observed neural process framework, and leverage neural
process algorithms to solve this task. We demonstrate that this approach
outperforms both state-of-the-art gradient-based meta-learning approaches and
hypernetwork approaches.Comment: To appear ICCV 202
Diffusion-HPC: Synthetic Data Generation for Human Mesh Recovery in Challenging Domains
Recent text-to-image generative models have exhibited remarkable abilities in
generating high-fidelity and photo-realistic images. However, despite the
visually impressive results, these models often struggle to preserve plausible
human structure in the generations. Due to this reason, while generative models
have shown promising results in aiding downstream image recognition tasks by
generating large volumes of synthetic data, they are not suitable for improving
downstream human pose perception and understanding. In this work, we propose a
Diffusion model with Human Pose Correction (Diffusion-HPC), a text-conditioned
method that generates photo-realistic images with plausible posed humans by
injecting prior knowledge about human body structure. Our generated images are
accompanied by 3D meshes that serve as ground truths for improving Human Mesh
Recovery tasks, where a shortage of 3D training data has long been an issue.
Furthermore, we show that Diffusion-HPC effectively improves the realism of
human generations under varying conditioning strategies
AdaEmbed: Semi-supervised Domain Adaptation in the Embedding Space
Semi-supervised domain adaptation (SSDA) presents a critical hurdle in
computer vision, especially given the frequent scarcity of labeled data in
real-world settings. This scarcity often causes foundation models, trained on
extensive datasets, to underperform when applied to new domains. AdaEmbed, our
newly proposed methodology for SSDA, offers a promising solution to these
challenges. Leveraging the potential of unlabeled data, AdaEmbed facilitates
the transfer of knowledge from a labeled source domain to an unlabeled target
domain by learning a shared embedding space. By generating accurate and uniform
pseudo-labels based on the established embedding space, the model overcomes the
limitations of conventional SSDA, thus enhancing performance significantly. Our
method's effectiveness is validated through extensive experiments on benchmark
datasets such as DomainNet, Office-Home, and VisDA-C, where AdaEmbed
consistently outperforms all the baselines, setting a new state of the art for
SSDA. With its straightforward implementation and high data efficiency,
AdaEmbed stands out as a robust and pragmatic solution for real-world
scenarios, where labeled data is scarce. To foster further research and
application in this area, we are sharing the codebase of our unified framework
for semi-supervised domain adaptation
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Long-form video understanding represents a significant challenge within
computer vision, demanding a model capable of reasoning over long multi-modal
sequences. Motivated by the human cognitive process for long-form video
understanding, we emphasize interactive reasoning and planning over the ability
to process lengthy visual inputs. We introduce a novel agent-based system,
VideoAgent, that employs a large language model as a central agent to
iteratively identify and compile crucial information to answer a question, with
vision-language foundation models serving as tools to translate and retrieve
visual information. Evaluated on the challenging EgoSchema and NExT-QA
benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only
8.4 and 8.2 frames used on average. These results demonstrate superior
effectiveness and efficiency of our method over the current state-of-the-art
methods, highlighting the potential of agent-based approaches in advancing
long-form video understanding
- …