1,114 research outputs found
GRIP++: Enhanced Graph-based Interaction-aware Trajectory Prediction for Autonomous Driving
Despite the advancement in the technology of autonomous driving cars, the
safety of a self-driving car is still a challenging problem that has not been
well studied. Motion prediction is one of the core functions of an autonomous
driving car. Previously, we propose a novel scheme called GRIP which is
designed to predict trajectories for traffic agents around an autonomous car
efficiently. GRIP uses a graph to represent the interactions of close objects,
applies several graph convolutional blocks to extract features, and
subsequently uses an encoder-decoder long short-term memory (LSTM) model to
make predictions. Even though our experimental results show that GRIP improves
the prediction accuracy of the state-of-the-art solution by 30%, GRIP still has
some limitations. GRIP uses a fixed graph to describe the relationships between
different traffic agents and hence may suffer some performance degradations
when it is being used in urban traffic scenarios. Hence, in this paper, we
describe an improved scheme called GRIP++ where we use both fixed and dynamic
graphs for trajectory predictions of different types of traffic agents. Such an
improvement can help autonomous driving cars avoid many traffic accidents. Our
evaluations using a recently released urban traffic dataset, namely ApolloScape
showed that GRIP++ achieves better prediction accuracy than state-of-the-art
schemes. GRIP++ ranked #1 on the leaderboard of the ApolloScape trajectory
competition in October 2019. In addition, GRIP++ runs 21.7 times faster than a
state-of-the-art scheme, CS-LSTM
Context-Aware Synthesis and Placement of Object Instances
Learning to insert an object instance into an image in a semantically
coherent manner is a challenging and interesting problem. Solving it requires
(a) determining a location to place an object in the scene and (b) determining
its appearance at the location. Such an object insertion model can potentially
facilitate numerous image editing and scene parsing applications. In this
paper, we propose an end-to-end trainable neural network for the task of
inserting an object instance mask of a specified class into the semantic label
map of an image. Our network consists of two generative modules where one
determines where the inserted object mask should be (i.e., location and scale)
and the other determines what the object mask shape (and pose) should look
like. The two modules are connected together via a spatial transformation
network and jointly trained. We devise a learning procedure that leverage both
supervised and unsupervised data and show our model can insert an object at
diverse locations with various appearances. We conduct extensive experimental
validations with comparisons to strong baselines to verify the effectiveness of
the proposed network
Segmenting the Future
Predicting the future is an important aspect for decision-making in robotics
or autonomous driving systems, which heavily rely upon visual scene
understanding. While prior work attempts to predict future video pixels,
anticipate activities or forecast future scene semantic segments from
segmentation of the preceding frames, methods that predict future semantic
segmentation solely from the previous frame RGB data in a single end-to-end
trainable model do not exist. In this paper, we propose a temporal
encoder-decoder network architecture that encodes RGB frames from the past and
decodes the future semantic segmentation. The network is coupled with a new
knowledge distillation training framework specific for the forecasting task.
Our method, only seeing preceding video frames, implicitly models the scene
segments while simultaneously accounting for the object dynamics to infer the
future scene semantic segments. Our results on Cityscapes and Apolloscape
outperform the baseline and current state-of-the-art methods. Code is available
at https://github.com/eddyhkchiu/segmenting_the_future/
Social Attention: Modeling Attention in Human Crowds
Robots that navigate through human crowds need to be able to plan safe,
efficient, and human predictable trajectories. This is a particularly
challenging problem as it requires the robot to predict future human
trajectories within a crowd where everyone implicitly cooperates with each
other to avoid collisions. Previous approaches to human trajectory prediction
have modeled the interactions between humans as a function of proximity.
However, that is not necessarily true as some people in our immediate vicinity
moving in the same direction might not be as important as other people that are
further away, but that might collide with us in the future. In this work, we
propose Social Attention, a novel trajectory prediction model that captures the
relative importance of each person when navigating in the crowd, irrespective
of their proximity. We demonstrate the performance of our method against a
state-of-the-art approach on two publicly available crowd datasets and analyze
the trained attention model to gain a better understanding of which surrounding
agents humans attend to, when navigating in a crowd
Geometric Image Synthesis
The task of generating natural images from 3D scenes has been a long standing
goal in computer graphics. On the other hand, recent developments in deep
neural networks allow for trainable models that can produce natural-looking
images with little or no knowledge about the scene structure. While the
generated images often consist of realistic looking local patterns, the overall
structure of the generated images is often inconsistent. In this work we
propose a trainable, geometry-aware image generation method that leverages
various types of scene information, including geometry and segmentation, to
create realistic looking natural images that match the desired scene structure.
Our geometrically-consistent image synthesis method is a deep neural network,
called Geometry to Image Synthesis (GIS) framework, which retains the
advantages of a trainable method, e.g., differentiability and adaptiveness,
but, at the same time, makes a step towards the generalizability, control and
quality output of modern graphics rendering engines. We utilize the GIS
framework to insert vehicles in outdoor driving scenes, as well as to generate
novel views of objects from the Linemod dataset. We qualitatively show that our
network is able to generalize beyond the training set to novel scene
geometries, object shapes and segmentations. Furthermore, we quantitatively
show that the GIS framework can be used to synthesize large amounts of training
data which proves beneficial for training instance segmentation models
End-to-End Tracking and Semantic Segmentation Using Recurrent Neural Networks
In this work we present a novel end-to-end framework for tracking and
classifying a robot's surroundings in complex, dynamic and only partially
observable real-world environments. The approach deploys a recurrent neural
network to filter an input stream of raw laser measurements in order to
directly infer object locations, along with their identity in both visible and
occluded areas. To achieve this we first train the network using unsupervised
Deep Tracking, a recently proposed theoretical framework for end-to-end space
occupancy prediction. We show that by learning to track on a large amount of
unsupervised data, the network creates a rich internal representation of its
environment which we in turn exploit through the principle of inductive
transfer of knowledge to perform the task of it's semantic classification. As a
result, we show that only a small amount of labelled data suffices to steer the
network towards mastering this additional task. Furthermore we propose a novel
recurrent neural network architecture specifically tailored to tracking and
semantic classification in real-world robotics applications. We demonstrate the
tracking and classification performance of the method on real-world data
collected at a busy road junction. Our evaluation shows that the proposed
end-to-end framework compares favourably to a state-of-the-art, model-free
tracking solution and that it outperforms a conventional one-shot training
scheme for semantic classification
Selective Distillation of Weakly Annotated GTD for Vision-based Slab Identification System
This paper proposes an algorithm for recognizing slab identification numbers
in factory scenes. In the development of a deep-learning based system, manual
labeling to make ground truth data (GTD) is an important but expensive task.
Furthermore, the quality of GTD is closely related to the performance of a
supervised learning algorithm. To reduce manual work in the labeling process,
we generated weakly annotated GTD by marking only character centroids. Whereas
bounding-boxes for characters require at least a drag-and-drop operation or two
clicks to annotate a character location, the weakly annotated GTD requires a
single click to record a character location. The main contribution of this
paper is on selective distillation to improve the quality of the weakly
annotated GTD. Because manual GTD are usually generated by many people, it may
contain personal bias or human error. To address this problem, the information
in manual GTD is integrated and refined by selective distillation. In the
process of selective distillation, a fully convolutional network is trained
using the weakly annotated GTD, and its prediction maps are selectively used to
revise locations and boundaries of semantic regions of characters in the
initial GTD. The modified GTD are used in the main training stage, and a
post-processing is conducted to retrieve text information. Experiments were
thoroughly conducted on actual industry data collected at a steelmaking factory
to demonstrate the effectiveness of the proposed method.Comment: 10 pages, 12 figures, submitted to a journa
Accurate Single Stage Detector Using Recurrent Rolling Convolution
Most of the recent successful methods in accurate object detection and
localization used some variants of R-CNN style two stage Convolutional Neural
Networks (CNN) where plausible regions were proposed in the first stage then
followed by a second stage for decision refinement. Despite the simplicity of
training and the efficiency in deployment, the single stage detection methods
have not been as competitive when evaluated in benchmarks consider mAP for high
IoU thresholds. In this paper, we proposed a novel single stage end-to-end
trainable object detection network to overcome this limitation. We achieved
this by introducing Recurrent Rolling Convolution (RRC) architecture over
multi-scale feature maps to construct object classifiers and bounding box
regressors which are "deep in context". We evaluated our method in the
challenging KITTI dataset which measures methods under IoU threshold of 0.7. We
showed that with RRC, a single reduced VGG-16 based model already significantly
outperformed all the previously published results. At the time this paper was
written our models ranked the first in KITTI car detection (the hard level),
the first in cyclist detection and the second in pedestrian detection. These
results were not reached by the previous single stage methods. The code is
publicly available.Comment: CVPR 201
Predicting Vehicle Behaviors Over An Extended Horizon Using Behavior Interaction Network
Anticipating possible behaviors of traffic participants is an essential
capability of autonomous vehicles. Many behavior detection and maneuver
recognition methods only have a very limited prediction horizon that leaves
inadequate time and space for planning. To avoid unsatisfactory reactive
decisions, it is essential to count long-term future rewards in planning, which
requires extending the prediction horizon. In this paper, we uncover that clues
to vehicle behaviors over an extended horizon can be found in vehicle
interaction, which makes it possible to anticipate the likelihood of a certain
behavior, even in the absence of any clear maneuver pattern. We adopt a
recurrent neural network (RNN) for observation encoding, and based on that, we
propose a novel vehicle behavior interaction network (VBIN) to capture the
vehicle interaction from the hidden states and connection feature of each
interaction pair. The output of our method is a probabilistic likelihood of
multiple behavior classes, which matches the multimodal and uncertain nature of
the distant future. A systematic comparison of our method against two
state-of-the-art methods and another two baseline methods on a publicly
available real highway dataset is provided, showing that our method has
superior accuracy and advanced capability for interaction modeling.Comment: 6+n pages. Accepted to International Conference on Robotics and
Automation (ICRA) 2019. IEEE copyrigh
Re-ranking Object Proposals for Object Detection in Automatic Driving
Object detection often suffers from a plenty of bootless proposals, selecting
high quality proposals remains a great challenge. In this paper, we propose a
semantic, class-specific approach to re-rank object proposals, which can
consistently improve the recall performance even with less proposals. We first
extract features for each proposal including semantic segmentation, stereo
information, contextual information, CNN-based objectness and low-level cue,
and then score them using class-specific weights learnt by Structured SVM. The
advantages of the proposed model are twofold: 1) it can be easily merged to
existing generators with few computational costs, and 2) it can achieve high
recall rate uner strict critical even using less proposals. Experimental
evaluation on the KITTI benchmark demonstrates that our approach significantly
improves existing popular generators on recall performance. Moreover, in the
experiment conducted for object detection, even with 1,500 proposals, our
approach can still have higher average precision (AP) than baselines with 5,000
proposals
- …