32 research outputs found
A Causal And-Or Graph Model for Visibility Fluent Reasoning in Tracking Interacting Objects
Tracking humans that are interacting with the other subjects or environment
remains unsolved in visual tracking, because the visibility of the human of
interests in videos is unknown and might vary over time. In particular, it is
still difficult for state-of-the-art human trackers to recover complete human
trajectories in crowded scenes with frequent human interactions. In this work,
we consider the visibility status of a subject as a fluent variable, whose
change is mostly attributed to the subject's interaction with the surrounding,
e.g., crossing behind another object, entering a building, or getting into a
vehicle, etc. We introduce a Causal And-Or Graph (C-AOG) to represent the
causal-effect relations between an object's visibility fluent and its
activities, and develop a probabilistic graph model to jointly reason the
visibility fluent change (e.g., from visible to invisible) and track humans in
videos. We formulate this joint task as an iterative search of a feasible
causal graph structure that enables fast search algorithm, e.g., dynamic
programming method. We apply the proposed method on challenging video sequences
to evaluate its capabilities of estimating visibility fluent changes of
subjects and tracking subjects of interests over time. Results with comparisons
demonstrate that our method outperforms the alternative trackers and can
recover complete trajectories of humans in complicated scenarios with frequent
human interactions.Comment: accepted by CVPR 201
Scene-centric Joint Parsing of Cross-view Videos
Cross-view video understanding is an important yet under-explored area in
computer vision. In this paper, we introduce a joint parsing framework that
integrates view-centric proposals into scene-centric parse graphs that
represent a coherent scene-centric understanding of cross-view scenes. Our key
observations are that overlapping fields of views embed rich appearance and
geometry correlations and that knowledge fragments corresponding to individual
vision tasks are governed by consistency constraints available in commonsense
knowledge. The proposed joint parsing framework represents such correlations
and constraints explicitly and generates semantic scene-centric parse graphs.
Quantitative experiments show that scene-centric predictions in the parse graph
outperform view-centric predictions.Comment: Accepted by AAAI 201
Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation
In this paper, we propose a pose grammar to tackle the problem of 3D human
pose estimation. Our model directly takes 2D pose as input and learns a
generalized 2D-3D mapping function. The proposed model consists of a base
network which efficiently captures pose-aligned features and a hierarchy of
Bi-directional RNNs (BRNN) on the top to explicitly incorporate a set of
knowledge regarding human body configuration (i.e., kinematics, symmetry, motor
coordination). The proposed model thus enforces high-level constraints over
human poses. In learning, we develop a pose sample simulator to augment
training samples in virtual camera views, which further improves our model
generalizability. We validate our method on public 3D human pose benchmarks and
propose a new evaluation protocol working on cross-view setting to verify the
generalization capability of different methods. We empirically observe that
most state-of-the-art methods encounter difficulty under such setting while our
method can well handle such challenges.Comment: Accepted by AAAI 201
VIVE3D: Viewpoint-Independent Video Editing using 3D-Aware GANs
We introduce VIVE3D, a novel approach that extends the capabilities of
image-based 3D GANs to video editing and is able to represent the input video
in an identity-preserving and temporally consistent way. We propose two new
building blocks. First, we introduce a novel GAN inversion technique
specifically tailored to 3D GANs by jointly embedding multiple frames and
optimizing for the camera parameters. Second, besides traditional semantic face
edits (e.g. for age and expression), we are the first to demonstrate edits that
show novel views of the head enabled by the inherent properties of 3D GANs and
our optical flow-guided compositing technique to combine the head with the
background video. Our experiments demonstrate that VIVE3D generates
high-fidelity face edits at consistent quality from a range of camera
viewpoints which are composited with the original video in a temporally and
spatially consistent manner.Comment: CVPR 2023. Project webpage and video available at
http://afruehstueck.github.io/vive3
On optical solutions to the Kadomtsev–Petviashviliequation with a local Conformable derivativeitle
In fact, due to the existence of this category of equations, our understanding of many phenomena around us becomes more complete. In this paper, we study an integrable partial differential equation called the Kadomtsev–Petviashvili equation with a local conformable derivative. This equation is used to describe nonlinear motion. In order to solve the equation, it is first necessary to convert the form of the equation from a partial derivative to an equation with ordinary derivatives using a suitable variable change. The resulting form will then be the basis of our work to determine the main solutions. All the solutions reported in the paper for the present equation are quite different from the previous findings in other papers. All necessary calculations are provided using symbolic computing software in Maple
Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet
Multi-person pose understanding from RGB videos involves three complex tasks:
pose estimation, tracking and motion forecasting. Intuitively, accurate
multi-person pose estimation facilitates robust tracking, and robust tracking
builds crucial history for correct motion forecasting. Most existing works
either focus on a single task or employ multi-stage approaches to solving
multiple tasks separately, which tends to make sub-optimal decision at each
stage and also fail to exploit correlations among the three tasks. In this
paper, we propose Snipper, a unified framework to perform multi-person 3D pose
estimation, tracking, and motion forecasting simultaneously in a single stage.
We propose an efficient yet powerful deformable attention mechanism to
aggregate spatiotemporal information from the video snippet. Building upon this
deformable attention, a video transformer is learned to encode the
spatiotemporal features from the multi-frame snippet and to decode informative
pose features for multi-person pose queries. Finally, these pose queries are
regressed to predict multi-person pose trajectories and future motions in a
single shot. In the experiments, we show the effectiveness of Snipper on three
challenging public datasets where our generic model rivals specialized
state-of-art baselines for pose estimation, tracking, and forecasting
Evaluation of passenger comfort with road field test multi-axis vibration
Using objective vibration evaluation to produce results highly consistent with real road ride comfort is challenging. The deficiencies in traditional evaluation indices, adopting an average operator, maximum operator, or cumulative operator as the main vibration information integration logic, are reported here through 19 designed road field tests in which major vibration information distribution covers all axes and vibration information is distributed in spacetime in various patterns. A new evaluation index which adopted a combination of maximum and cumulative operator is proposed to overcome these deficiencies and an interactive mechanism which standardized the process of selecting vibration information distributed among axes and spacetime is devised between the localized major vibrations. The results show that the proposed road ride comfort evaluation index is more consistent and accurate than the evaluation indices proposed by ISO 2631-1 and can be used more generally