16 research outputs found
CrowdHuman: A Benchmark for Detecting Human in a Crowd
Human detection has witnessed impressive progress in recent years. However,
the occlusion issue of detecting human in highly crowded environments is far
from solved. To make matters worse, crowd scenarios are still under-represented
in current human detection benchmarks. In this paper, we introduce a new
dataset, called CrowdHuman, to better evaluate detectors in crowd scenarios.
The CrowdHuman dataset is large, rich-annotated and contains high diversity.
There are a total of human instances from the train and validation
subsets, and persons per image, with various kinds of occlusions in the
dataset. Each human instance is annotated with a head bounding-box, human
visible-region bounding-box and human full-body bounding-box. Baseline
performance of state-of-the-art detection frameworks on CrowdHuman is
presented. The cross-dataset generalization results of CrowdHuman dataset
demonstrate state-of-the-art performance on previous dataset including
Caltech-USA, CityPersons, and Brainwash without bells and whistles. We hope our
dataset will serve as a solid baseline and help promote future research in
human detection tasks
SSA-CNN: Semantic Self-Attention CNN for Pedestrian Detection
Pedestrian detection plays an important role in many applications such as
autonomous driving. We propose a method that explores semantic segmentation
results as self-attention cues to significantly improve the pedestrian
detection performance. Specifically, a multi-task network is designed to
jointly learn semantic segmentation and pedestrian detection from image
datasets with weak box-wise annotations. The semantic segmentation feature maps
are concatenated with corresponding convolution features maps to provide more
discriminative features for pedestrian detection and pedestrian classification.
By jointly learning segmentation and detection, our proposed pedestrian
self-attention mechanism can effectively identify pedestrian regions and
suppress backgrounds. In addition, we propose to incorporate semantic attention
information from multi-scale layers into deep convolution neural network to
boost pedestrian detection. Experiment results show that the proposed method
achieves the best detection performance with MR of 6.27% on Caltech dataset and
obtain competitive performance on CityPersons dataset while maintaining high
computational efficiency.Comment: wrong setting in CityPersons experiment
Adaptive NMS: Refining Pedestrian Detection in a Crowd
Pedestrian detection in a crowd is a very challenging issue. This paper
addresses this problem by a novel Non-Maximum Suppression (NMS) algorithm to
better refine the bounding boxes given by detectors. The contributions are
threefold: (1) we propose adaptive-NMS, which applies a dynamic suppression
threshold to an instance, according to the target density; (2) we design an
efficient subnetwork to learn density scores, which can be conveniently
embedded into both the single-stage and two-stage detectors; and (3) we achieve
state of the art results on the CityPersons and CrowdHuman benchmarks.Comment: To appear at CVPR 2019 (Oral
Occlusion-aware R-CNN: Detecting Pedestrians in a Crowd
Pedestrian detection in crowded scenes is a challenging problem since the
pedestrians often gather together and occlude each other. In this paper, we
propose a new occlusion-aware R-CNN (OR-CNN) to improve the detection accuracy
in the crowd. Specifically, we design a new aggregation loss to enforce
proposals to be close and locate compactly to the corresponding objects.
Meanwhile, we use a new part occlusion-aware region of interest (PORoI) pooling
unit to replace the RoI pooling layer in order to integrate the prior structure
information of human body with visibility prediction into the network to handle
occlusion. Our detector is trained in an end-to-end fashion, which achieves
state-of-the-art results on three pedestrian detection datasets, i.e.,
CityPersons, ETH, and INRIA, and performs on-pair with the state-of-the-arts on
Caltech.Comment: Accepted by ECCV 201
PSC-Net: Learning Part Spatial Co-occurrence for Occluded Pedestrian Detection
Detecting pedestrians, especially under heavy occlusions, is a challenging
computer vision problem with numerous real-world applications. This paper
introduces a novel approach, termed as PSC-Net, for occluded pedestrian
detection. The proposed PSC-Net contains a dedicated module that is designed to
explicitly capture both inter and intra-part co-occurrence information of
different pedestrian body parts through a Graph Convolutional Network (GCN).
Both inter and intra-part co-occurrence information contribute towards
improving the feature representation for handling varying level of occlusions,
ranging from partial to severe occlusions. Our PSC-Net exploits the topological
structure of pedestrian and does not require part-based annotations or
additional visible bounding-box (VBB) information to learn part spatial
co-occurrence. Comprehensive experiments are performed on two challenging
datasets: CityPersons and Caltech datasets. The proposed PSC-Net achieves
state-of-the-art detection performance on both. On the heavy occluded
(\textbf{HO}) set of CityPerosns test set, our PSC-Net obtains an absolute gain
of 4.0\% in terms of log-average miss rate over the state-of-the-art with same
backbone, input scale and without using additional VBB supervision. Further,
PSC-Net improves the state-of-the-art from 37.9 to 34.8 in terms of log-average
miss rate on Caltech (\textbf{HO}) test set
Mask-Guided Attention Network for Occluded Pedestrian Detection
Pedestrian detection relying on deep convolution neural networks has made
significant progress. Though promising results have been achieved on standard
pedestrians, the performance on heavily occluded pedestrians remains far from
satisfactory. The main culprits are intra-class occlusions involving other
pedestrians and inter-class occlusions caused by other objects, such as cars
and bicycles. These result in a multitude of occlusion patterns. We propose an
approach for occluded pedestrian detection with the following contributions.
First, we introduce a novel mask-guided attention network that fits naturally
into popular pedestrian detection pipelines. Our attention network emphasizes
on visible pedestrian regions while suppressing the occluded ones by modulating
full body features. Second, we empirically demonstrate that coarse-level
segmentation annotations provide reasonable approximation to their dense
pixel-wise counterparts. Experiments are performed on CityPersons and Caltech
datasets. Our approach sets a new state-of-the-art on both datasets. Our
approach obtains an absolute gain of 9.5% in log-average miss rate, compared to
the best reported results on the heavily occluded (HO) pedestrian set of
CityPersons test set. Further, on the HO pedestrian set of Caltech dataset, our
method achieves an absolute gain of 5.0% in log-average miss rate, compared to
the best reported results. Code and models are available at:
https://github.com/Leotju/MGAN.Comment: Accepted at ICCV 201
Learning to Separate: Detecting Heavily-Occluded Objects in Urban Scenes
While visual object detection with deep learning has received much attention
in the past decade, cases when heavy intra-class occlusions occur have not been
studied thoroughly. In this work, we propose a Non-Maximum-Suppression (NMS)
algorithm that dramatically improves the detection recall while maintaining
high precision in scenes with heavy occlusions. Our NMS algorithm is derived
from a novel embedding mechanism, in which the semantic and geometric features
of the detected boxes are jointly exploited. The embedding makes it possible to
determine whether two heavily-overlapping boxes belong to the same object in
the physical world. Our approach is particularly useful for car detection and
pedestrian detection in urban scenes where occlusions often happen. We show the
effectiveness of our approach by creating a model called SG-Det (short for
Semantics and Geometry Detection) and testing SG-Det on two widely-adopted
datasets, KITTI and CityPersons for which it achieves state-of-the-art
performance.Comment: ECCV 202
NMS by Representative Region: Towards Crowded Pedestrian Detection by Proposal Pairing
Although significant progress has been made in pedestrian detection recently,
pedestrian detection in crowded scenes is still challenging. The heavy
occlusion between pedestrians imposes great challenges to the standard
Non-Maximum Suppression (NMS). A relative low threshold of intersection over
union (IoU) leads to missing highly overlapped pedestrians, while a higher one
brings in plenty of false positives. To avoid such a dilemma, this paper
proposes a novel Representative Region NMS approach leveraging the less
occluded visible parts, effectively removing the redundant boxes without
bringing in many false positives. To acquire the visible parts, a novel
Paired-Box Model (PBM) is proposed to simultaneously predict the full and
visible boxes of a pedestrian. The full and visible boxes constitute a pair
serving as the sample unit of the model, thus guaranteeing a strong
correspondence between the two boxes throughout the detection pipeline.
Moreover, convenient feature integration of the two boxes is allowed for the
better performance on both full and visible pedestrian detection tasks.
Experiments on the challenging CrowdHuman and CityPersons benchmarks
sufficiently validate the effectiveness of the proposed approach on pedestrian
detection in the crowded situation.Comment: Accepted by CVPR2020. The first two authors contributed equally, and
are listed in alphabetical orde
Multiview Detection with Feature Perspective Transformation
Incorporating multiple camera views for detection alleviates the impact of
occlusions in crowded scenes. In a multiview system, we need to answer two
important questions when dealing with ambiguities that arise from occlusions.
First, how should we aggregate cues from the multiple views? Second, how should
we aggregate unreliable 2D and 3D spatial information that has been tainted by
occlusions? To address these questions, we propose a novel multiview detection
system, MVDet. For multiview aggregation, existing methods combine anchor box
features from the image plane, which potentially limits performance due to
inaccurate anchor box shapes and sizes. In contrast, we take an anchor-free
approach to aggregate multiview information by projecting feature maps onto the
ground plane (bird's eye view). To resolve any remaining spatial ambiguity, we
apply large kernel convolutions on the ground plane feature map and infer
locations from detection peaks. Our entire model is end-to-end learnable and
achieves 88.2% MODA on the standard Wildtrack dataset, outperforming the
state-of-the-art by 14.1%. We also provide detailed analysis of MVDet on a
newly introduced synthetic dataset, MultiviewX, which allows us to control the
level of occlusion. Code and MultiviewX dataset are available at
https://github.com/hou-yz/MVDet
V2F-Net: Explicit Decomposition of Occluded Pedestrian Detection
Occlusion is very challenging in pedestrian detection. In this paper, we
propose a simple yet effective method named V2F-Net, which explicitly
decomposes occluded pedestrian detection into visible region detection and full
body estimation. V2F-Net consists of two sub-networks: Visible region Detection
Network (VDN) and Full body Estimation Network (FEN). VDN tries to localize
visible regions and FEN estimates full-body box on the basis of the visible
box. Moreover, to further improve the estimation of full body, we propose a
novel Embedding-based Part-aware Module (EPM). By supervising the visibility
for each part, the network is encouraged to extract features with essential
part information. We experimentally show the effectiveness of V2F-Net by
conducting several experiments on two challenging datasets. V2F-Net achieves
5.85% AP gains on CrowdHuman and 2.24% MR-2 improvements on CityPersons
compared to FPN baseline. Besides, the consistent gain on both one-stage and
two-stage detector validates the generalizability of our method.Comment: 11 pages, 4 figure