27,524 research outputs found
Multi-modal Queried Object Detection in the Wild
We introduce MQ-Det, an efficient architecture and pre-training strategy
design to utilize both textual description with open-set generalization and
visual exemplars with rich description granularity as category queries, namely,
Multi-modal Queried object Detection, for real-world detection with both
open-vocabulary categories and various granularity. MQ-Det incorporates vision
queries into existing well-established language-queried-only detectors. A
plug-and-play gated class-scalable perceiver module upon the frozen detector is
proposed to augment category text with class-wise visual information. To
address the learning inertia problem brought by the frozen detector, a vision
conditioned masked language prediction strategy is proposed. MQ-Det's simple
yet effective architecture and training strategy design is compatible with most
language-queried object detectors, thus yielding versatile applications.
Experimental results demonstrate that multi-modal queries largely boost
open-world detection. For instance, MQ-Det significantly improves the
state-of-the-art open-set detector GLIP by +7.8% zero-shot AP on the LVIS
benchmark and averagely +6.3% AP on 13 few-shot downstream tasks, with merely
3% pre-training time required by GLIP. Code is available at
https://github.com/YifanXu74/MQ-Det.Comment: Under revie
Monocular SLAM Supported Object Recognition
In this work, we develop a monocular SLAM-aware object recognition system
that is able to achieve considerably stronger recognition performance, as
compared to classical object recognition systems that function on a
frame-by-frame basis. By incorporating several key ideas including multi-view
object proposals and efficient feature encoding methods, our proposed system is
able to detect and robustly recognize objects in its environment using a single
RGB camera in near-constant time. Through experiments, we illustrate the
utility of using such a system to effectively detect and recognize objects,
incorporating multiple object viewpoint detections into a unified prediction
hypothesis. The performance of the proposed recognition system is evaluated on
the UW RGB-D Dataset, showing strong recognition performance and scalable
run-time performance compared to current state-of-the-art recognition systems.Comment: Accepted to appear at Robotics: Science and Systems 2015, Rome, Ital
Scalable Object Detection using Deep Neural Networks
Deep convolutional neural networks have recently achieved state-of-the-art
performance on a number of image recognition benchmarks, including the ImageNet
Large-Scale Visual Recognition Challenge (ILSVRC-2012). The winning model on
the localization sub-task was a network that predicts a single bounding box and
a confidence score for each object category in the image. Such a model captures
the whole-image context around the objects but cannot handle multiple instances
of the same object in the image without naively replicating the number of
outputs for each instance. In this work, we propose a saliency-inspired neural
network model for detection, which predicts a set of class-agnostic bounding
boxes along with a single score for each box, corresponding to its likelihood
of containing any object of interest. The model naturally handles a variable
number of instances for each class and allows for cross-class generalization at
the highest levels of the network. We are able to obtain competitive
recognition performance on VOC2007 and ILSVRC2012, while using only the top few
predicted locations in each image and a small number of neural network
evaluations
- …