3,918 research outputs found
PTP: Parallelized Tracking and Prediction with Graph Neural Networks and Diversity Sampling
Multi-object tracking (MOT) and trajectory prediction are two critical
components in modern 3D perception systems that require accurate modeling of
multi-agent interaction. We hypothesize that it is beneficial to unify both
tasks under one framework in order to learn a shared feature representation of
agent interaction. Furthermore, instead of performing tracking and prediction
sequentially which can propagate errors from tracking to prediction, we propose
a parallelized framework to mitigate the issue. Also, our parallel
track-forecast framework incorporates two additional novel computational units.
First, we use a feature interaction technique by introducing Graph Neural
Networks (GNNs) to capture the way in which agents interact with one another.
The GNN is able to improve discriminative feature learning for MOT association
and provide socially-aware contexts for trajectory prediction. Second, we use a
diversity sampling function to improve the quality and diversity of our
forecasted trajectories. The learned sampling function is trained to
efficiently extract a variety of outcomes from a generative trajectory
distribution and helps avoid the problem of generating duplicate trajectory
samples. We evaluate on KITTI and nuScenes datasets showing that our method
with socially-aware feature learning and diversity sampling achieves new
state-of-the-art performance on 3D MOT and trajectory prediction. Project
website is: https://www.xinshuoweng.com/projects/PTPComment: Published in Robotics and Automation Letters (RA-L) 2021, with the
ICRA 2021 option. The first two authors contributed equally. Project website:
https://www.xinshuoweng.com/projects/PTP
End-to-End 3D Multi-Object Tracking and Trajectory Forecasting
3D multi-object tracking (MOT) and trajectory forecasting are two critical
components in modern 3D perception systems. We hypothesize that it is
beneficial to unify both tasks under one framework to learn a shared feature
representation of agent interaction. To evaluate this hypothesis, we propose a
unified solution for 3D MOT and trajectory forecasting which also incorporates
two additional novel computational units. First, we employ a feature
interaction technique by introducing Graph Neural Networks (GNNs) to capture
the way in which multiple agents interact with one another. The GNN is able to
model complex hierarchical interactions, improve the discriminative feature
learning for MOT association, and provide socially-aware context for trajectory
forecasting. Second, we use a diversity sampling function to improve the
quality and diversity of our forecasted trajectories. The learned sampling
function is trained to efficiently extract a variety of outcomes from a
generative trajectory distribution and helps avoid the problem of generating
many duplicate trajectory samples. We show that our method achieves
state-of-the-art performance on the KITTI dataset. Our project website is at
http://www.xinshuoweng.com/projects/GNNTrkForecast.Comment: Extended abstract. The first two authors contributed equally. Project
website: http://www.xinshuoweng.com/projects/GNNTrkForecast. arXiv admin
note: substantial text overlap with arXiv:2003.0784
Implicit Latent Variable Model for Scene-Consistent Motion Forecasting
In order to plan a safe maneuver an autonomous vehicle must accurately
perceive its environment, and understand the interactions among traffic
participants. In this paper, we aim to learn scene-consistent motion forecasts
of complex urban traffic directly from sensor data. In particular, we propose
to characterize the joint distribution over future trajectories via an implicit
latent variable model. We model the scene as an interaction graph and employ
powerful graph neural networks to learn a distributed latent representation of
the scene. Coupled with a deterministic decoder, we obtain trajectory samples
that are consistent across traffic participants, achieving state-of-the-art
results in motion forecasting and interaction understanding. Last but not
least, we demonstrate that our motion forecasts result in safer and more
comfortable motion planning.Comment: European Conference on Computer Vision (ECCV) 202
SceneGen: Learning to Generate Realistic Traffic Scenes
We consider the problem of generating realistic traffic scenes automatically.
Existing methods typically insert actors into the scene according to a set of
hand-crafted heuristics and are limited in their ability to model the true
complexity and diversity of real traffic scenes, thus inducing a content gap
between synthesized traffic scenes versus real ones. As a result, existing
simulators lack the fidelity necessary to train and test self-driving vehicles.
To address this limitation, we present SceneGen, a neural autoregressive model
of traffic scenes that eschews the need for rules and heuristics. In
particular, given the ego-vehicle state and a high definition map of
surrounding area, SceneGen inserts actors of various classes into the scene and
synthesizes their sizes, orientations, and velocities. We demonstrate on two
large-scale datasets SceneGen's ability to faithfully model distributions of
real traffic scenes. Moreover, we show that SceneGen coupled with sensor
simulation can be used to train perception models that generalize to the real
world
Generative Hybrid Representations for Activity Forecasting with No-Regret Learning
Automatically reasoning about future human behaviors is a difficult problem
but has significant practical applications to assistive systems. Part of this
difficulty stems from learning systems' inability to represent all kinds of
behaviors. Some behaviors, such as motion, are best described with continuous
representations, whereas others, such as picking up a cup, are best described
with discrete representations. Furthermore, human behavior is generally not
fixed: people can change their habits and routines. This suggests these systems
must be able to learn and adapt continuously. In this work, we develop an
efficient deep generative model to jointly forecast a person's future discrete
actions and continuous motions. On a large-scale egocentric dataset,
EPIC-KITCHENS, we observe our method generates high-quality and diverse samples
while exhibiting better generalization than related generative models. Finally,
we propose a variant to continually learn our model from streaming data,
observe its practical effectiveness, and theoretically justify its learning
efficiency.Comment: Oral presentation at CVPR 202
We are More than Our Joints: Predicting how 3D Bodies Move
A key step towards understanding human behavior is the prediction of 3D human
motion. Successful solutions have many applications in human tracking, HCI, and
graphics. Most previous work focuses on predicting a time series of future 3D
joint locations given a sequence 3D joints from the past. This Euclidean
formulation generally works better than predicting pose in terms of joint
rotations. Body joint locations, however, do not fully constrain 3D human pose,
leaving degrees of freedom undefined, making it hard to animate a realistic
human from only the joints. Note that the 3D joints can be viewed as a sparse
point cloud. Thus the problem of human motion prediction can be seen as point
cloud prediction. With this observation, we instead predict a sparse set of
locations on the body surface that correspond to motion capture markers. Given
such markers, we fit a parametric body model to recover the 3D shape and pose
of the person. These sparse surface markers also carry detailed information
about human movement that is not present in the joints, increasing the
naturalness of the predicted motions. Using the AMASS dataset, we train MOJO,
which is a novel variational autoencoder that generates motions from latent
frequencies. MOJO preserves the full temporal resolution of the input motion,
and sampling from the latent frequencies explicitly introduces high-frequency
components into the generated motion. We note that motion prediction methods
accumulate errors over time, resulting in joints or markers that diverge from
true human bodies. To address this, we fit SMPL-X to the predictions at each
time step, projecting the solution back onto the space of valid bodies. These
valid markers are then propagated in time. Experiments show that our method
produces state-of-the-art results and realistic 3D body animations. The code
for research purposes is at https://yz-cnsdqz.github.io/MOJO/MOJO.htmlComment: camera ready, cvp
Joint Object Detection and Multi-Object Tracking with Graph Neural Networks
Object detection and data association are critical components in multi-object
tracking (MOT) systems. Despite the fact that the two components are dependent
on each other, prior works often design detection and data association modules
separately which are trained with separate objectives. As a result, one cannot
back-propagate the gradients and optimize the entire MOT system, which leads to
sub-optimal performance. To address this issue, recent works simultaneously
optimize detection and data association modules under a joint MOT framework,
which has shown improved performance in both modules. In this work, we propose
a new instance of joint MOT approach based on Graph Neural Networks (GNNs). The
key idea is that GNNs can model relations between variable-sized objects in
both the spatial and temporal domains, which is essential for learning
discriminative features for detection and data association. Through extensive
experiments on the MOT15/16/17/20 datasets, we demonstrate the effectiveness of
our GNN-based joint MOT approach and show state-of-the-art performance for both
detection and MOT tasks. Our code is available at:
https://github.com/yongxinw/GSDTComment: Published in International Conference on Robotics and Automation
(ICRA), 2021. Code is released here: https://github.com/yongxinw/GSD
AutoSelect: Automatic and Dynamic Detection Selection for 3D Multi-Object Tracking
3D multi-object tracking is an important component in robotic perception
systems such as self-driving vehicles. Recent work follows a
tracking-by-detection pipeline, which aims to match past tracklets with
detections in the current frame. To avoid matching with false positive
detections, prior work filters out detections with low confidence scores via a
threshold. However, finding a proper threshold is non-trivial, which requires
extensive manual search via ablation study. Also, this threshold is sensitive
to many factors such as target object category so we need to re-search the
threshold if these factors change. To ease this process, we propose to
automatically select high-quality detections and remove the efforts needed for
manual threshold search. Also, prior work often uses a single threshold per
data sequence, which is sub-optimal in particular frames or for certain
objects. Instead, we dynamically search threshold per frame or per object to
further boost performance. Through experiments on KITTI and nuScenes, our
method can filter out false positives while maintaining the recall,
achieving new S.O.T.A. performance and removing the need for manually threshold
tuning
GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with Multi-Feature Learning
3D Multi-object tracking (MOT) is crucial to autonomous systems. Recent work
uses a standard tracking-by-detection pipeline, where feature extraction is
first performed independently for each object in order to compute an affinity
matrix. Then the affinity matrix is passed to the Hungarian algorithm for data
association. A key process of this standard pipeline is to learn discriminative
features for different objects in order to reduce confusion during data
association. In this work, we propose two techniques to improve the
discriminative feature learning for MOT: (1) instead of obtaining features for
each object independently, we propose a novel feature interaction mechanism by
introducing the Graph Neural Network. As a result, the feature of one object is
informed of the features of other objects so that the object feature can lean
towards the object with similar feature (i.e., object probably with a same ID)
and deviate from objects with dissimilar features (i.e., object probably with
different IDs), leading to a more discriminative feature for each object; (2)
instead of obtaining the feature from either 2D or 3D space in prior work, we
propose a novel joint feature extractor to learn appearance and motion features
from 2D and 3D space simultaneously. As features from different modalities
often have complementary information, the joint feature can be more
discriminate than feature from each individual modality. To ensure that the
joint feature extractor does not heavily rely on one modality, we also propose
an ensemble training paradigm. Through extensive evaluation, our proposed
method achieves state-of-the-art performance on KITTI and nuScenes 3D MOT
benchmarks. Our code will be made available at
https://github.com/xinshuoweng/GNN3DMOTComment: CVPR 2020. My website for all my research works:
http://www.xinshuoweng.com
Shared Cross-Modal Trajectory Prediction for Autonomous Driving
Predicting future trajectories of traffic agents in highly interactive
environments is an essential and challenging problem for the safe operation of
autonomous driving systems. On the basis of the fact that self-driving vehicles
are equipped with various types of sensors (e.g., LiDAR scanner, RGB camera,
radar, etc.), we propose a Cross-Modal Embedding framework that aims to benefit
from the use of multiple input modalities. At training time, our model learns
to embed a set of complementary features in a shared latent space by jointly
optimizing the objective functions across different types of input data. At
test time, a single input modality (e.g., LiDAR data) is required to generate
predictions from the input perspective (i.e., in the LiDAR space), while taking
advantages from the model trained with multiple sensor modalities. An extensive
evaluation is conducted to show the efficacy of the proposed framework using
two benchmark driving datasets.Comment: CVPR 2021 [Oral
- …