4,496 research outputs found
Action Recognition in Videos: from Motion Capture Labs to the Web
This paper presents a survey of human action recognition approaches based on
visual data recorded from a single video camera. We propose an organizing
framework which puts in evidence the evolution of the area, with techniques
moving from heavily constrained motion capture scenarios towards more
challenging, realistic, "in the wild" videos. The proposed organization is
based on the representation used as input for the recognition task, emphasizing
the hypothesis assumed and thus, the constraints imposed on the type of video
that each technique is able to address. Expliciting the hypothesis and
constraints makes the framework particularly useful to select a method, given
an application. Another advantage of the proposed organization is that it
allows categorizing newest approaches seamlessly with traditional ones, while
providing an insightful perspective of the evolution of the action recognition
task up to now. That perspective is the basis for the discussion in the end of
the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4
table
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
For robots to perform a wide variety of tasks, they require a 3D
representation of the world that is semantically rich, yet compact and
efficient for task-driven perception and planning. Recent approaches have
attempted to leverage features from large vision-language models to encode
semantics in 3D representations. However, these approaches tend to produce maps
with per-point feature vectors, which do not scale well in larger environments,
nor do they contain semantic spatial relationships between entities in the
environment, which are useful for downstream planning. In this work, we propose
ConceptGraphs, an open-vocabulary graph-structured representation for 3D
scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing
their output to 3D by multi-view association. The resulting representations
generalize to novel semantic classes, without the need to collect large 3D
datasets or finetune models. We demonstrate the utility of this representation
through a number of downstream planning tasks that are specified through
abstract (language) prompts and require complex reasoning over spatial and
semantic concepts. (Project page: https://concept-graphs.github.io/ Explainer
video: https://youtu.be/mRhNkQwRYnc )Comment: Project page: https://concept-graphs.github.io/ Explainer video:
https://youtu.be/mRhNkQwRYn
Vision Language Models in Autonomous Driving and Intelligent Transportation Systems
The applications of Vision-Language Models (VLMs) in the fields of Autonomous
Driving (AD) and Intelligent Transportation Systems (ITS) have attracted
widespread attention due to their outstanding performance and the ability to
leverage Large Language Models (LLMs). By integrating language data, the
vehicles, and transportation systems are able to deeply understand real-world
environments, improving driving safety and efficiency. In this work, we present
a comprehensive survey of the advances in language models in this domain,
encompassing current models and datasets. Additionally, we explore the
potential applications and emerging research directions. Finally, we thoroughly
discuss the challenges and research gap. The paper aims to provide researchers
with the current work and future trends of VLMs in AD and ITS
- …