85,979 research outputs found
StarNet: towards Weakly Supervised Few-Shot Object Detection
Few-shot detection and classification have advanced significantly in recent
years. Yet, detection approaches require strong annotation (bounding boxes)
both for pre-training and for adaptation to novel classes, and classification
approaches rarely provide localization of objects in the scene. In this paper,
we introduce StarNet - a few-shot model featuring an end-to-end differentiable
non-parametric star-model detection and classification head. Through this head,
the backbone is meta-trained using only image-level labels to produce good
features for jointly localizing and classifying previously unseen categories of
few-shot test tasks using a star-model that geometrically matches between the
query and support images (to find corresponding object instances). Being a
few-shot detector, StarNet does not require any bounding box annotations,
neither during pre-training nor for novel classes adaptation. It can thus be
applied to the previously unexplored and challenging task of Weakly Supervised
Few-Shot Object Detection (WS-FSOD), where it attains significant improvements
over the baselines. In addition, StarNet shows significant gains on few-shot
classification benchmarks that are less cropped around the objects (where
object localization is key)
Many-shot from Low-shot: Learning to Annotate using Mixed Supervision for Object Detection
Object detection has witnessed significant progress by relying on large,
manually annotated datasets. Annotating such datasets is highly time consuming
and expensive, which motivates the development of weakly supervised and
few-shot object detection methods. However, these methods largely underperform
with respect to their strongly supervised counterpart, as weak training signals
\emph{often} result in partial or oversized detections. Towards solving this
problem we introduce, for the first time, an online annotation module (OAM)
that learns to generate a many-shot set of \emph{reliable} annotations from a
larger volume of weakly labelled images. Our OAM can be jointly trained with
any fully supervised two-stage object detection method, providing additional
training annotations on the fly. This results in a fully end-to-end strategy
that only requires a low-shot set of fully annotated images. The integration of
the OAM with Fast(er) R-CNN improves their performance by mAP,
AP50 on PASCAL VOC 2007 and MS-COCO benchmarks, and significantly outperforms
competing methods using mixed supervision.Comment: Accepted at ECCV 2020. Camera-ready version and Appendice
Did You Miss the Sign? A False Negative Alarm System for Traffic Sign Detectors
Object detection is an integral part of an autonomous vehicle for its
safety-critical and navigational purposes. Traffic signs as objects play a
vital role in guiding such systems. However, if the vehicle fails to locate any
critical sign, it might make a catastrophic failure. In this paper, we propose
an approach to identify traffic signs that have been mistakenly discarded by
the object detector. The proposed method raises an alarm when it discovers a
failure by the object detector to detect a traffic sign. This approach can be
useful to evaluate the performance of the detector during the deployment phase.
We trained a single shot multi-box object detector to detect traffic signs and
used its internal features to train a separate false negative detector (FND).
During deployment, FND decides whether the traffic sign detector (TSD) has
missed a sign or not. We are using precision and recall to measure the accuracy
of FND in two different datasets. For 80% recall, FND has achieved 89.9%
precision in Belgium Traffic Sign Detection dataset and 90.8% precision in
German Traffic Sign Recognition Benchmark dataset respectively. To the best of
our knowledge, our method is the first to tackle this critical aspect of false
negative detection in robotic vision. Such a fail-safe mechanism for object
detection can improve the engagement of robotic vision systems in our daily
life.Comment: Submitted to the 2019 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2019
Grounded Language-Image Pre-training
This paper presents a grounded language-image pre-training (GLIP) model for
learning object-level, language-aware, and semantic-rich visual
representations. GLIP unifies object detection and phrase grounding for
pre-training. The unification brings two benefits: 1) it allows GLIP to learn
from both detection and grounding data to improve both tasks and bootstrap a
good grounding model; 2) GLIP can leverage massive image-text pairs by
generating grounding boxes in a self-training fashion, making the learned
representation semantic-rich. In our experiments, we pre-train GLIP on 27M
grounding data, including 3M human-annotated and 24M web-crawled image-text
pairs. The learned representations demonstrate strong zero-shot and few-shot
transferability to various object-level recognition tasks. 1) When directly
evaluated on COCO and LVIS (without seeing any images in COCO during
pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many
supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val
and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13
downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised
Dynamic Head. Code is released at https://github.com/microsoft/GLIP.Comment: CVPR 2022; updated visualizations; fixed hyper-parameters in Appendix
C.
APPLICATION OF ONE-SHOT LEARNING DETECTOR TO AVOID UNWANTED OBJECTS TO BE BLURRED/HIDDEN IN VIDEO CONFERENCING
In the era of hybrid working mode, more and more people prefer online video meetings to face-to-face meetings. Considering that people may have online meetings from home, there are many solutions such as background blur or virtual background to protect user privacy. However, if a user tries to demonstrate certain object when the background blur/virtual background enabled, the object may be viewed as a part of background and then loss the visibility as the picture below. This invention disclosure proposes the idea of utilizing one-shot learning to allow users to specify the objects they want to be visible in a background blur/virtual background enabled video conferencing, along with implementing a post-processing to make any of one-shot learning models perform better for object tracking in video conferencing user scenario. In this disclosure, the user scenario we use to explain the idea is set to show an object in a background blur/virtual background enabled video conferencing. But the application of this idea is not limited to this scenario, for example, it can be also used for auto framing on any of objects users specify instead of only on human faces. Besides, even though this scenario combines image segmentation, gesture detection, and one-shot learning deep learning models, we will only discuss more details on one-shot learning model in this disclosure because it’s the most critical part for the invention, and the rest of the two models are very common in existing applications
Zero-Shot In-Distribution Detection in Multi-Object Settings Using Vision-Language Foundation Models
Removing out-of-distribution (OOD) images from noisy images scraped from the
Internet is an important preprocessing for constructing datasets, which can be
addressed by zero-shot OOD detection with vision language foundation models
(CLIP). The existing zero-shot OOD detection setting does not consider the
realistic case where an image has both in-distribution (ID) objects and OOD
objects. However, it is important to identify such images as ID images when
collecting the images of rare classes or ethically inappropriate classes that
must not be missed. In this paper, we propose a novel problem setting called
in-distribution (ID) detection, where we identify images containing ID objects
as ID images, even if they contain OOD objects, and images lacking ID objects
as OOD images. To solve this problem, we present a new approach,
\textbf{G}lobal-\textbf{L}ocal \textbf{M}aximum \textbf{C}oncept
\textbf{M}atching (GL-MCM), based on both global and local visual-text
alignments of CLIP features, which can identify any image containing ID objects
as ID images. Extensive experiments demonstrate that GL-MCM outperforms
comparison methods on both multi-object datasets and single-object ImageNet
benchmarks
- …