3,944 research outputs found
Eye Semantic Segmentation with a Lightweight Model
In this paper, we present a multi-class eye segmentation method that can run
the hardware limitations for real-time inference. Our approach includes three
major stages: get a grayscale image from the input, segment three distinct eye
region with a deep network, and remove incorrect areas with heuristic filters.
Our model based on the encoder decoder structure with the key is the depthwise
convolution operation to reduce the computation cost. We experiment on OpenEDS,
a large scale dataset of eye images captured by a head-mounted display with two
synchronized eye facing cameras. We achieved the mean intersection over union
(mIoU) of 94.85% with a model of size 0.4 megabytes. The source code are
available https://github.com/th2l/Eye_VR_SegmentationComment: To appear in ICCVW 2019. Pre-trained models and source code are
available https://github.com/th2l/Eye_VR_Segmentatio
Deep Learning for LiDAR Point Clouds in Autonomous Driving: A Review
Recently, the advancement of deep learning in discriminative feature learning
from 3D LiDAR data has led to rapid development in the field of autonomous
driving. However, automated processing uneven, unstructured, noisy, and massive
3D point clouds is a challenging and tedious task. In this paper, we provide a
systematic review of existing compelling deep learning architectures applied in
LiDAR point clouds, detailing for specific tasks in autonomous driving such as
segmentation, detection, and classification. Although several published
research papers focus on specific topics in computer vision for autonomous
vehicles, to date, no general survey on deep learning applied in LiDAR point
clouds for autonomous vehicles exists. Thus, the goal of this paper is to
narrow the gap in this topic. More than 140 key contributions in the recent
five years are summarized in this survey, including the milestone 3D deep
architectures, the remarkable deep learning applications in 3D semantic
segmentation, object detection, and classification; specific datasets,
evaluation metrics, and the state of the art performance. Finally, we conclude
the remaining challenges and future researches.Comment: 21 pages, submitted to IEEE Transactions on Neural Networks and
Learning System
VarGNet: Variable Group Convolutional Neural Network for Efficient Embedded Computing
In this paper, we propose a novel network design mechanism for efficient
embedded computing. Inspired by the limited computing patterns, we propose to
fix the number of channels in a group convolution, instead of the existing
practice that fixing the total group numbers. Our solution based network, named
Variable Group Convolutional Network (VarGNet), can be optimized easier on
hardware side, due to the more unified computing schemes among the layers.
Extensive experiments on various vision tasks, including classification,
detection, pixel-wise parsing and face recognition, have demonstrated the
practical value of our VarGNet.Comment: Technical repor
Towards an Embodied Semantic Fovea: Semantic 3D scene reconstruction from ego-centric eye-tracker videos
Incorporating the physical environment is essential for a complete
understanding of human behavior in unconstrained every-day tasks. This is
especially important in ego-centric tasks where obtaining 3 dimensional
information is both limiting and challenging with the current 2D video analysis
methods proving insufficient. Here we demonstrate a proof-of-concept system
which provides real-time 3D mapping and semantic labeling of the local
environment from an ego-centric RGB-D video-stream with 3D gaze point
estimation from head mounted eye tracking glasses. We augment existing work in
Semantic Simultaneous Localization And Mapping (Semantic SLAM) with collected
gaze vectors. Our system can then find and track objects both inside and
outside the user field-of-view in 3D from multiple perspectives with reasonable
accuracy. We validate our concept by producing a semantic map from images of
the NYUv2 dataset while simultaneously estimating gaze position and gaze
classes from recorded gaze data of the dataset images
Sensor Fusion for Joint 3D Object Detection and Semantic Segmentation
In this paper, we present an extension to LaserNet, an efficient and
state-of-the-art LiDAR based 3D object detector. We propose a method for fusing
image data with the LiDAR data and show that this sensor fusion method improves
the detection performance of the model especially at long ranges. The addition
of image data is straightforward and does not require image labels.
Furthermore, we expand the capabilities of the model to perform 3D semantic
segmentation in addition to 3D object detection. On a large benchmark dataset,
we demonstrate our approach achieves state-of-the-art performance on both
object detection and semantic segmentation while maintaining a low runtime.Comment: Accepted for publication at CVPR Workshop on Autonomous Driving 201
Security of Facial Forensics Models Against Adversarial Attacks
Deep neural networks (DNNs) have been used in digital forensics to identify
fake facial images. We investigated several DNN-based forgery forensics models
(FFMs) to examine whether they are secure against adversarial attacks. We
experimentally demonstrated the existence of individual adversarial
perturbations (IAPs) and universal adversarial perturbations (UAPs) that can
lead a well-performed FFM to misbehave. Based on iterative procedure, gradient
information is used to generate two kinds of IAPs that can be used to fabricate
classification and segmentation outputs. In contrast, UAPs are generated on the
basis of over-firing. We designed a new objective function that encourages
neurons to over-fire, which makes UAP generation feasible even without using
training data. Experiments demonstrated the transferability of UAPs across
unseen datasets and unseen FFMs. Moreover, we conducted subjective assessment
for imperceptibility of the adversarial perturbations, revealing that the
crafted UAPs are visually negligible. These findings provide a baseline for
evaluating the adversarial security of FFMs.Comment: Accepted by ICIP 202
Looking Fast and Slow: Memory-Guided Mobile Video Object Detection
With a single eye fixation lasting a fraction of a second, the human visual
system is capable of forming a rich representation of a complex environment,
reaching a holistic understanding which facilitates object recognition and
detection. This phenomenon is known as recognizing the "gist" of the scene and
is accomplished by relying on relevant prior knowledge. This paper addresses
the analogous question of whether using memory in computer vision systems can
not only improve the accuracy of object detection in video streams, but also
reduce the computation time. By interleaving conventional feature extractors
with extremely lightweight ones which only need to recognize the gist of the
scene, we show that minimal computation is required to produce accurate
detections when temporal memory is present. In addition, we show that the
memory contains enough information for deploying reinforcement learning
algorithms to learn an adaptive inference policy. Our model achieves
state-of-the-art performance among mobile methods on the Imagenet VID 2015
dataset, while running at speeds of up to 70+ FPS on a Pixel 3 phone
Shallow and Deep Convolutional Networks for Saliency Prediction
The prediction of salient areas in images has been traditionally addressed
with hand-crafted features based on neuroscience principles. This paper,
however, addresses the problem with a completely data-driven approach by
training a convolutional neural network (convnet). The learning process is
formulated as a minimization of a loss function that measures the Euclidean
distance of the predicted saliency map with the provided ground truth. The
recent publication of large datasets of saliency prediction has provided enough
data to train end-to-end architectures that are both fast and accurate. Two
designs are proposed: a shallow convnet trained from scratch, and a another
deeper solution whose first three layers are adapted from another network
trained for classification. To the authors knowledge, these are the first
end-to-end CNNs trained and tested for the purpose of saliency prediction.Comment: Preprint of the paper accepted at 2016 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). Source code and models available at
https://github.com/imatge-upc/saliency-2016-cvpr. Junting Pan and Kevin
McGuinness contributed equally to this wor
Reverse Attention for Salient Object Detection
Benefit from the quick development of deep learning techniques, salient
object detection has achieved remarkable progresses recently. However, there
still exists following two major challenges that hinder its application in
embedded devices, low resolution output and heavy model weight. To this end,
this paper presents an accurate yet compact deep network for efficient salient
object detection. More specifically, given a coarse saliency prediction in the
deepest layer, we first employ residual learning to learn side-output residual
features for saliency refinement, which can be achieved with very limited
convolutional parameters while keep accuracy. Secondly, we further propose
reverse attention to guide such side-output residual learning in a top-down
manner. By erasing the current predicted salient regions from side-output
features, the network can eventually explore the missing object parts and
details which results in high resolution and accuracy. Experiments on six
benchmark datasets demonstrate that the proposed approach compares favorably
against state-of-the-art methods, and with advantages in terms of simplicity,
efficiency (45 FPS) and model size (81 MB).Comment: ECCV 201
SS3D: Single Shot 3D Object Detector
Single stage deep learning algorithm for 2D object detection was made popular
by Single Shot MultiBox Detector (SSD) and it was heavily adopted in several
embedded applications. PointPillars is a state of the art 3D object detection
algorithm that uses a Single Shot Detector adapted for 3D object detection. The
main downside of PointPillars is that it has a two stage approach with learned
input representation based on fully connected layers followed by the Single
Shot Detector for 3D detection. In this paper we present Single Shot 3D Object
Detection (SS3D) - a single stage 3D object detection algorithm which combines
straight forward, statistically computed input representation and a Single Shot
Detector (based on PointPillars). Computing the input representation is
straight forward, does not involve learning and does not have much
computational cost. We also extend our method to stereo input and show that,
aided by additional semantic segmentation input; our method produces similar
accuracy as state of the art stereo based detectors. Achieving the accuracy of
two stage detectors using a single stage approach is important as single stage
approaches are simpler to implement in embedded, real-time applications. With
LiDAR as well as stereo input, our method outperforms PointPillars. When using
LiDAR input, our input representation is able to improve the AP3D of Cars
objects in the moderate category from 74.99 to 76.84. When using stereo input,
our input representation is able to improve the AP3D of Cars objects in the
moderate category from 38.13 to 45.13. Our results are also better than other
popular 3D object detectors such as AVOD and F-PointNet
- …