2,625 research outputs found
Densely Supervised Grasp Detector (DSGD)
This paper presents Densely Supervised Grasp Detector (DSGD), a deep learning
framework which combines CNN structures with layer-wise feature fusion and
produces grasps and their confidence scores at different levels of the image
hierarchy (i.e., global-, region-, and pixel-levels). % Specifically, at the
global-level, DSGD uses the entire image information to predict a grasp. At the
region-level, DSGD uses a region proposal network to identify salient regions
in the image and predicts a grasp for each salient region. At the pixel-level,
DSGD uses a fully convolutional network and predicts a grasp and its confidence
at every pixel. % During inference, DSGD selects the most confident grasp as
the output. This selection from hierarchically generated grasp candidates
overcomes limitations of the individual models. % DSGD outperforms
state-of-the-art methods on the Cornell grasp dataset in terms of grasp
accuracy. % Evaluation on a multi-object dataset and real-world robotic
grasping experiments show that DSGD produces highly stable grasps on a set of
unseen objects in new environments. It achieves 97% grasp detection accuracy
and 90% robotic grasping success rate with real-time inference speed
Multi-Modal Trip Hazard Affordance Detection On Construction Sites
Trip hazards are a significant contributor to accidents on construction and
manufacturing sites, where over a third of Australian workplace injuries occur
[1]. Current safety inspections are labour intensive and limited by human
fallibility,making automation of trip hazard detection appealing from both a
safety and economic perspective. Trip hazards present an interesting challenge
to modern learning techniques because they are defined as much by affordance as
by object type; for example wires on a table are not a trip hazard, but can be
if lying on the ground. To address these challenges, we conduct a comprehensive
investigation into the performance characteristics of 11 different colour and
depth fusion approaches, including 4 fusion and one non fusion approach; using
colour and two types of depth images. Trained and tested on over 600 labelled
trip hazards over 4 floors and 2000m in an active construction
site,this approach was able to differentiate between identical objects in
different physical configurations (see Figure 1). Outperforming a colour-only
detector, our multi-modal trip detector fuses colour and depth information to
achieve a 4% absolute improvement in F1-score. These investigative results and
the extensive publicly available dataset moves us one step closer to assistive
or fully automated safety inspection systems on construction sites.Comment: 9 Pages, 12 Figures, 2 Tables, Accepted to Robotics and Automation
Letters (RA-L
Ordered Pooling of Optical Flow Sequences for Action Recognition
Training of Convolutional Neural Networks (CNNs) on long video sequences is
computationally expensive due to the substantial memory requirements and the
massive number of parameters that deep architectures demand. Early fusion of
video frames is thus a standard technique, in which several consecutive frames
are first agglomerated into a compact representation, and then fed into the CNN
as an input sample. For this purpose, a summarization approach that represents
a set of consecutive RGB frames by a single dynamic image to capture pixel
dynamics is proposed recently. In this paper, we introduce a novel ordered
representation of consecutive optical flow frames as an alternative and argue
that this representation captures the action dynamics more effectively than RGB
frames. We provide intuitions on why such a representation is better for action
recognition. We validate our claims on standard benchmark datasets and
demonstrate that using summaries of flow images lead to significant
improvements over RGB frames while achieving accuracy comparable to the
state-of-the-art on UCF101 and HMDB datasets.Comment: Accepted in WACV 201
Long-Term Image Boundary Prediction
Boundary estimation in images and videos has been a very active topic of
research, and organizing visual information into boundaries and segments is
believed to be a corner stone of visual perception. While prior work has
focused on estimating boundaries for observed frames, our work aims at
predicting boundaries of future unobserved frames. This requires our model to
learn about the fate of boundaries and corresponding motion patterns --
including a notion of "intuitive physics". We experiment on natural video
sequences along with synthetic sequences with deterministic physics-based and
agent-based motions. While not being our primary goal, we also show that fusion
of RGB and boundary prediction leads to improved RGB predictions.Comment: Accepted in the AAAI Conference for Artificial Intelligence, 201
- …