97 research outputs found
Bottom-up Object Detection by Grouping Extreme and Center Points
With the advent of deep learning, object detection drifted from a bottom-up
to a top-down recognition problem. State of the art algorithms enumerate a
near-exhaustive list of object locations and classify each into: object or not.
In this paper, we show that bottom-up approaches still perform competitively.
We detect four extreme points (top-most, left-most, bottom-most, right-most)
and one center point of objects using a standard keypoint estimation network.
We group the five keypoints into a bounding box if they are geometrically
aligned. Object detection is then a purely appearance-based keypoint estimation
problem, without region classification or implicit feature learning. The
proposed method performs on-par with the state-of-the-art region based
detection methods, with a bounding box AP of 43.2% on COCO test-dev. In
addition, our estimated extreme points directly span a coarse octagonal mask,
with a COCO Mask AP of 18.9%, much better than the Mask AP of vanilla bounding
boxes. Extreme point guided segmentation further improves this to 34.6% Mask
AP
Improving Object Detection from Scratch via Gated Feature Reuse
In this paper, we present a simple and parameter-efficient drop-in module for
one-stage object detectors like SSD when learning from scratch (i.e., without
pre-trained models). We call our module GFR (Gated Feature Reuse), which
exhibits two main advantages. First, we introduce a novel gate-controlled
prediction strategy enabled by Squeeze-and-Excitation to adaptively enhance or
attenuate supervision at different scales based on the input object size. As a
result, our model is more effective in detecting diverse sizes of objects.
Second, we propose a feature-pyramids structure to squeeze rich spatial and
semantic features into a single prediction layer, which strengthens feature
representation and reduces the number of parameters to learn. We apply the
proposed structure on DSOD and SSD detection frameworks, and evaluate the
performance on PASCAL VOC 2007, 2012 and COCO datasets. With fewer model
parameters, GFR-DSOD outperforms the baseline DSOD by 1.4%, 1.1%, 1.7% and
0.6%, respectively. GFR-SSD also outperforms the original SSD and SSD with
dense prediction by 3.6% and 2.8% on VOC 2007 dataset. Code is available at:
https://github.com/szq0214/GFR-DSOD .Comment: Accepted in BMVC 2019. Code: https://github.com/szq0214/GFR-DSO
Mask Encoding for Single Shot Instance Segmentation
To date, instance segmentation is dominated by twostage methods, as pioneered
by Mask R-CNN. In contrast, one-stage alternatives cannot compete with Mask
R-CNN in mask AP, mainly due to the difficulty of compactly representing masks,
making the design of one-stage methods very challenging. In this work, we
propose a simple singleshot instance segmentation framework, termed mask
encoding based instance segmentation (MEInst). Instead of predicting the
two-dimensional mask directly, MEInst distills it into a compact and
fixed-dimensional representation vector, which allows the instance segmentation
task to be incorporated into one-stage bounding-box detectors and results in a
simple yet efficient instance segmentation framework. The proposed one-stage
MEInst achieves 36.4% in mask AP with single-model (ResNeXt-101-FPN backbone)
and single-scale testing on the MS-COCO benchmark. We show that the much
simpler and flexible one-stage instance segmentation method, can also achieve
competitive performance. This framework can be easily adapted for other
instance-level recognition tasks. Code is available at:
https://git.io/AdelaiDetComment: Accepted to Proc. IEEE Conf. Computer Vision and Pattern Recognition
(CVPR), 202
Learning from Noisy Anchors for One-stage Object Detection
State-of-the-art object detectors rely on regressing and classifying an
extensive list of possible anchors, which are divided into positive and
negative samples based on their intersection-over-union (IoU) with
corresponding groundtruth objects. Such a harsh split conditioned on IoU
results in binary labels that are potentially noisy and challenging for
training. In this paper, we propose to mitigate noise incurred by imperfect
label assignment such that the contributions of anchors are dynamically
determined by a carefully constructed cleanliness score associated with each
anchor. Exploring outputs from both regression and classification branches, the
cleanliness scores, estimated without incurring any additional computational
overhead, are used not only as soft labels to supervise the training of the
classification branch but also sample re-weighting factors for improved
localization and classification accuracy. We conduct extensive experiments on
COCO, and demonstrate, among other things, the proposed approach steadily
improves RetinaNet by ~2% with various backbones.Comment: CVPR 2020 camera read
LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking
In this paper, we propose a novel effective light-weight framework, called
LightTrack, for online human pose tracking. The proposed framework is designed
to be generic for top-down pose tracking and is faster than existing online and
offline methods. Single-person Pose Tracking (SPT) and Visual Object Tracking
(VOT) are incorporated into one unified functioning entity, easily implemented
by a replaceable single-person pose estimation module. Our framework unifies
single-person pose tracking with multi-person identity association and sheds
first light upon bridging keypoint tracking with object tracking. We also
propose a Siamese Graph Convolution Network (SGCN) for human pose matching as a
Re-ID module in our pose tracking system. In contrary to other Re-ID modules,
we use a graphical representation of human joints for matching. The
skeleton-based representation effectively captures human pose similarity and is
computationally inexpensive. It is robust to sudden camera shift that
introduces human drifting. To the best of our knowledge, this is the first
paper to propose an online human pose tracking framework in a top-down fashion.
The proposed framework is general enough to fit other pose estimators and
candidate matching mechanisms. Our method outperforms other online methods
while maintaining a much higher frame rate, and is very competitive with our
offline state-of-the-art. We make the code publicly available at:
https://github.com/Guanghan/lighttrack.Comment: 9 pages, 6 figures, 6 table
Vid2Game: Controllable Characters Extracted from Real-World Videos
We are given a video of a person performing a certain activity, from which we
extract a controllable model. The model generates novel image sequences of that
person, according to arbitrary user-defined control signals, typically marking
the displacement of the moving body. The generated video can have an arbitrary
background, and effectively capture both the dynamics and appearance of the
person.
The method is based on two networks. The first network maps a current pose,
and a single-instance control signal to the next pose. The second network maps
the current pose, the new pose, and a given background, to an output frame.
Both networks include multiple novelties that enable high-quality performance.
This is demonstrated on multiple characters extracted from various videos of
dancers and athletes
End-to-End Wireframe Parsing
We present a conceptually simple yet effective algorithm to detect wireframes
in a given image. Compared to the previous methods which first predict an
intermediate heat map and then extract straight lines with heuristic
algorithms, our method is end-to-end trainable and can directly output a
vectorized wireframe that contains semantically meaningful and geometrically
salient junctions and lines. To better understand the quality of the outputs,
we propose a new metric for wireframe evaluation that penalizes overlapped line
segments and incorrect line connectivities. We conduct extensive experiments
and show that our method significantly outperforms the previous
state-of-the-art wireframe and line extraction algorithms. We hope our simple
approach can be served as a baseline for future wireframe parsing studies. Code
has been made publicly available at https://github.com/zhou13/lcnn
Vertebra-Focused Landmark Detection for Scoliosis Assessment
Adolescent idiopathic scoliosis (AIS) is a lifetime disease that arises in
children. Accurate estimation of Cobb angles of the scoliosis is essential for
clinicians to make diagnosis and treatment decisions. The Cobb angles are
measured according to the vertebrae landmarks. Existing regression-based
methods for the vertebra landmark detection typically suffer from large dense
mapping parameters and inaccurate landmark localization. The segmentation-based
methods tend to predict connected or corrupted vertebra masks. In this paper,
we propose a novel vertebra-focused landmark detection method. Our model first
localizes the vertebra centers, based on which it then traces the four corner
landmarks of the vertebra through the learned corner offset. In this way, our
method is able to keep the order of the landmarks. The comparison results
demonstrate the merits of our method in both Cobb angle measurement and
landmark detection on low-contrast and ambiguous X-ray images. Code is
available at: \url{https://github.com/yijingru/Vertebra-Landmark-Detection}.Comment: Accepted to ISBI202
RepPoints: Point Set Representation for Object Detection
Modern object detectors rely heavily on rectangular bounding boxes, such as
anchors, proposals and the final predictions, to represent objects at various
recognition stages. The bounding box is convenient to use but provides only a
coarse localization of objects and leads to a correspondingly coarse extraction
of object features. In this paper, we present \textbf{RepPoints}
(representative points), a new finer representation of objects as a set of
sample points useful for both localization and recognition. Given ground truth
localization and recognition targets for training, RepPoints learn to
automatically arrange themselves in a manner that bounds the spatial extent of
an object and indicates semantically significant local areas. They furthermore
do not require the use of anchors to sample a space of bounding boxes. We show
that an anchor-free object detector based on RepPoints can be as effective as
the state-of-the-art anchor-based detection methods, with 46.5 AP and 67.4
on the COCO test-dev detection benchmark, using ResNet-101 model.
Code is available at https://github.com/microsoft/RepPoints.Comment: International Conference on Computer Vision (ICCV), 201
SAIS: Single-stage Anchor-free Instance Segmentation
In this paper, we propose a simple yet efficientinstance segmentation
approach based on the single-stage anchor-free detector, termed SAIS. In our
approach, the instancesegmentation task consists of two parallel subtasks which
re-spectively predict the mask coefficients and the mask prototypes.Then,
instance masks are generated by linearly combining theprototypes with the mask
coefficients. To enhance the quality ofinstance mask, the information from
regression and classificationis fused to predict the mask coefficients. In
addition, center-aware target is designed to preserve the center coordination
ofeach instance, which achieves a stable improvement in instancesegmentation.
The experiment on MS COCO shows that SAISachieves the performance of the
exiting state-of-the-art single-stage methods with a much less memory footp
- …