71,513 research outputs found
Semantic-Aware Scene Recognition
Scene recognition is currently one of the top-challenging research fields in
computer vision. This may be due to the ambiguity between classes: images of
several scene classes may share similar objects, which causes confusion among
them. The problem is aggravated when images of a particular scene class are
notably different. Convolutional Neural Networks (CNNs) have significantly
boosted performance in scene recognition, albeit it is still far below from
other recognition tasks (e.g., object or image recognition). In this paper, we
describe a novel approach for scene recognition based on an end-to-end
multi-modal CNN that combines image and context information by means of an
attention module. Context information, in the shape of semantic segmentation,
is used to gate features extracted from the RGB image by leveraging on
information encoded in the semantic representation: the set of scene objects
and stuff, and their relative locations. This gating process reinforces the
learning of indicative scene content and enhances scene disambiguation by
refocusing the receptive fields of the CNN towards them. Experimental results
on four publicly available datasets show that the proposed approach outperforms
every other state-of-the-art method while significantly reducing the number of
network parameters. All the code and data used along this paper is available at
https://github.com/vpulab/Semantic-Aware-Scene-RecognitionComment: Paper submitted for publication to Elsevier Pattern Recognition
journa
Semantic-aware scene recognition
Scene recognition is currently one of the top-challenging research fields in computer vision. This may be due to the ambiguity between classes: images of several scene classes may share similar objects, which causes confusion among them. The problem is aggravated when images of a particular scene class are notably different. Convolutional Neural Networks (CNNs) have significantly boosted performance in scene recognition, albeit it is still far below from other recognition tasks (e.g., object or image recognition). In this paper, we describe a novel approach for scene recognition based on an end-to-end multi-modal CNN that combines image and context information by means of an attention module. Context information, in the shape of a semantic segmentation, is used to gate features extracted from the RGB image by leveraging on information encoded in the semantic representation: the set of scene objects and stuff, and their relative locations. This gating process reinforces the learning of indicative scene content and enhances scene disambiguation by refocusing the receptive fields of the CNN towards them. Experimental results on three publicly available datasets show that the proposed approach outperforms every other state-of-the-art method while significantly reducing the number of network parameters. All the code and data used along this paper is available at: https://github.com/vpulab/Semantic-Aware-Scene-RecognitionThis study has been partially supported by the Spanish Government through its TEC2017-88169-R MobiNetVideo projec
MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects
We present MaskFusion, a real-time, object-aware, semantic and dynamic RGB-D SLAM system that goes beyond traditional systems which output a purely geometric map of a static scene. MaskFusion recognizes, segments and assigns semantic class labels to different objects in the scene, while tracking and reconstructing them even when they move independently from the camera. As an RGB-D camera scans a cluttered scene, image-based instance-level semantic segmentation creates semantic object masks that enable realtime object recognition and the creation of an object-level representation for the world map. Unlike previous recognition-based SLAM systems, MaskFusion does not require known models of the objects it can recognize, and can deal with multiple independent motions. MaskFusion takes full advantage of using instance-level semantic segmentation to enable semantic labels to be fused into an object-aware map, unlike recent semantics enabled SLAM systems that perform voxel-level semantic segmentation. We show augmented-reality applications that demonstrate the unique features of the map output by MaskFusion: instance-aware, semantic and dynamic. Code will be made available
Recurrent Scene Parsing with Perspective Understanding in the Loop
Objects may appear at arbitrary scales in perspective images of a scene,
posing a challenge for recognition systems that process images at a fixed
resolution. We propose a depth-aware gating module that adaptively selects the
pooling field size in a convolutional network architecture according to the
object scale (inversely proportional to the depth) so that small details are
preserved for distant objects while larger receptive fields are used for those
nearby. The depth gating signal is provided by stereo disparity or estimated
directly from monocular input. We integrate this depth-aware gating into a
recurrent convolutional neural network to perform semantic segmentation. Our
recurrent module iteratively refines the segmentation results, leveraging the
depth and semantic predictions from the previous iterations.
Through extensive experiments on four popular large-scale RGB-D datasets, we
demonstrate this approach achieves competitive semantic segmentation
performance with a model which is substantially more compact. We carry out
extensive analysis of this architecture including variants that operate on
monocular RGB but use depth as side-information during training, unsupervised
gating as a generic attentional mechanism, and multi-resolution gating. We find
that gated pooling for joint semantic segmentation and depth yields
state-of-the-art results for quantitative monocular depth estimation
Attribute-aware Semantic Segmentation of Road Scenes for Understanding Pedestrian Orientations
Semantic segmentation is an interesting task for many deep learning researchers for scene understanding. However, recognizing details about objects' attributes can be more informative and also helpful for a better scene understanding in intelligent vehicle use cases. This paper introduces a method for simultaneous semantic segmentation and pedestrian attributes recognition. A modified dataset built on top of the Cityscapes dataset is created by adding attribute classes corresponding to pedestrian orientation attributes. The proposed method extends the SegNet model and is trained by using both the original and the attribute-enriched datasets. Based on an experiment, the proposed attribute-aware semantic segmentation approach shows the ability to slightly improve the performance on the Cityscapes dataset, which is capable of expanding its classes in this case through additional data training
Fusion of Multispectral Data Through Illumination-aware Deep Neural Networks for Pedestrian Detection
Multispectral pedestrian detection has received extensive attention in recent
years as a promising solution to facilitate robust human target detection for
around-the-clock applications (e.g. security surveillance and autonomous
driving). In this paper, we demonstrate illumination information encoded in
multispectral images can be utilized to significantly boost performance of
pedestrian detection. A novel illumination-aware weighting mechanism is present
to accurately depict illumination condition of a scene. Such illumination
information is incorporated into two-stream deep convolutional neural networks
to learn multispectral human-related features under different illumination
conditions (daytime and nighttime). Moreover, we utilized illumination
information together with multispectral data to generate more accurate semantic
segmentation which are used to boost pedestrian detection accuracy. Putting all
of the pieces together, we present a powerful framework for multispectral
pedestrian detection based on multi-task learning of illumination-aware
pedestrian detection and semantic segmentation. Our proposed method is trained
end-to-end using a well-designed multi-task loss function and outperforms
state-of-the-art approaches on KAIST multispectral pedestrian dataset
- …