374 research outputs found
Causal video object segmentation from persistence of occlusions
indicated by, , respectively. On the far right, our algorithm correctly infers that the bag strap is in front of the woman’s arm, which is in front of her trunk, which is in front of the background. Project page
Towards Structured Analysis of Broadcast Badminton Videos
Sports video data is recorded for nearly every major tournament but remains
archived and inaccessible to large scale data mining and analytics. It can only
be viewed sequentially or manually tagged with higher-level labels which is
time consuming and prone to errors. In this work, we propose an end-to-end
framework for automatic attributes tagging and analysis of sport videos. We use
commonly available broadcast videos of matches and, unlike previous approaches,
does not rely on special camera setups or additional sensors.
Our focus is on Badminton as the sport of interest. We propose a method to
analyze a large corpus of badminton broadcast videos by segmenting the points
played, tracking and recognizing the players in each point and annotating their
respective badminton strokes. We evaluate the performance on 10 Olympic matches
with 20 players and achieved 95.44% point segmentation accuracy, 97.38% player
detection score ([email protected]), 97.98% player identification accuracy, and stroke
segmentation edit scores of 80.48%. We further show that the automatically
annotated videos alone could enable the gameplay analysis and inference by
computing understandable metrics such as player's reaction time, speed, and
footwork around the court, etc.Comment: 9 page
CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images
We present a method for teaching machines to understand and model the
underlying spatial common sense of diverse human-object interactions in 3D in a
self-supervised way. This is a challenging task, as there exist specific
manifolds of the interactions that can be considered human-like and natural,
but the human pose and the geometry of objects can vary even for similar
interactions. Such diversity makes the annotating task of 3D interactions
difficult and hard to scale, which limits the potential to reason about that in
a supervised way. One way of learning the 3D spatial relationship between
humans and objects during interaction is by showing multiple 2D images captured
from different viewpoints when humans interact with the same type of objects.
The core idea of our method is to leverage a generative model that produces
high-quality 2D images from an arbitrary text prompt input as an "unbounded"
data generator with effective controllability and view diversity. Despite its
imperfection of the image quality over real images, we demonstrate that the
synthesized images are sufficient to learn the 3D human-object spatial
relations. We present multiple strategies to leverage the synthesized images,
including (1) the first method to leverage a generative image model for 3D
human-object spatial relation learning; (2) a framework to reason about the 3D
spatial relations from inconsistent 2D cues in a self-supervised manner via 3D
occupancy reasoning with pose canonicalization; (3) semantic clustering to
disambiguate different types of interactions with the same object types; and
(4) a novel metric to assess the quality of 3D spatial learning of interaction.Comment: Accepted to ICCV 2023 (Oral Presentation). Project Page:
https://jellyheadandrew.github.io/projects/choru
Motion Segmentation from a Moving Monocular Camera
Identifying and segmenting moving objects from a moving monocular camera is
difficult when there is unknown camera motion, different types of object
motions and complex scene structures. To tackle these challenges, we take
advantage of two popular branches of monocular motion segmentation approaches:
point trajectory based and optical flow based methods, by synergistically
fusing these two highly complementary motion cues at object level. By doing
this, we are able to model various complex object motions in different scene
structures at once, which has not been achieved by existing methods. We first
obtain object-specific point trajectories and optical flow mask for each common
object in the video, by leveraging the recent foundational models in object
recognition, segmentation and tracking. We then construct two robust affinity
matrices representing the pairwise object motion affinities throughout the
whole video using epipolar geometry and the motion information provided by
optical flow. Finally, co-regularized multi-view spectral clustering is used to
fuse the two affinity matrices and obtain the final clustering. Our method
shows state-of-the-art performance on the KT3DMoSeg dataset, which contains
complex motions and scene structures. Being able to identify moving objects
allows us to remove them for map building when using visual SLAM or SFM.Comment: Accepted by IROS 2023 Workshop on Robotic Perception And Mapping:
Frontier Vision and Learning Technique
Multigranularity Representations for Human Inter-Actions: Pose, Motion and Intention
Tracking people and their body pose in videos is a central problem in computer vision. Standard tracking representations reason about temporal coherence of detected people and body parts. They have difficulty tracking targets under partial occlusions or rare body poses, where detectors often fail, since the number of training examples is often too small to deal with the exponential variability of such configurations.
We propose tracking representations that track and segment people and their body pose in videos by exploiting information at multiple detection and segmentation granularities when available, whole body, parts or point trajectories.
Detections and motion estimates provide contradictory information in case of false alarm detections or leaking motion affinities. We consolidate contradictory information via graph steering, an algorithm for simultaneous detection and co-clustering in a two-granularity graph of motion trajectories and detections, that corrects motion leakage between correctly detected objects, while being robust to false alarms or spatially inaccurate detections.
We first present a motion segmentation framework that exploits long range motion of point trajectories and large spatial support of image regions.
We show resulting video segments adapt to targets under partial occlusions and deformations.
Second, we augment motion-based representations with object detection for dealing with motion leakage. We demonstrate how to combine dense optical flow trajectory affinities with repulsions from confident detections to reach a global consensus of detection and tracking in crowded scenes.
Third, we study human motion and pose estimation.
We segment hard to detect, fast moving body limbs from their surrounding clutter and match them against pose exemplars to detect body pose under fast motion. We employ on-the-fly human body kinematics to improve tracking of body joints under wide deformations.
We use motion segmentability of body parts for re-ranking a set of body joint candidate trajectories and jointly infer multi-frame body pose and video segmentation.
We show empirically that such multi-granularity tracking representation is worthwhile, obtaining significantly more accurate multi-object tracking and detailed body pose estimation in popular datasets
- …