1,292 research outputs found
SSDA-YOLO: Semi-supervised Domain Adaptive YOLO for Cross-Domain Object Detection
Domain adaptive object detection (DAOD) aims to alleviate transfer
performance degradation caused by the cross-domain discrepancy. However, most
existing DAOD methods are dominated by outdated and computationally intensive
two-stage Faster R-CNN, which is not the first choice for industrial
applications. In this paper, we propose a novel semi-supervised domain adaptive
YOLO (SSDA-YOLO) based method to improve cross-domain detection performance by
integrating the compact one-stage stronger detector YOLOv5 with domain
adaptation. Specifically, we adapt the knowledge distillation framework with
the Mean Teacher model to assist the student model in obtaining instance-level
features of the unlabeled target domain. We also utilize the scene style
transfer to cross-generate pseudo images in different domains for remedying
image-level differences. In addition, an intuitive consistency loss is proposed
to further align cross-domain predictions. We evaluate SSDA-YOLO on public
benchmarks including PascalVOC, Clipart1k, Cityscapes, and Foggy Cityscapes.
Moreover, to verify its generalization, we conduct experiments on yawning
detection datasets collected from various real classrooms. The results show
considerable improvements of our method in these DAOD tasks, which reveals both
the effectiveness of proposed adaptive modules and the urgency of applying more
advanced detectors in DAOD. Our code is available on
\url{https://github.com/hnuzhy/SSDA-YOLO}.Comment: submitted to CVI
Pseudo-labels for Supervised Learning on Dynamic Vision Sensor Data, Applied to Object Detection under Ego-motion
In recent years, dynamic vision sensors (DVS), also known as event-based
cameras or neuromorphic sensors, have seen increased use due to various
advantages over conventional frame-based cameras. Using principles inspired by
the retina, its high temporal resolution overcomes motion blurring, its high
dynamic range overcomes extreme illumination conditions and its low power
consumption makes it ideal for embedded systems on platforms such as drones and
self-driving cars. However, event-based data sets are scarce and labels are
even rarer for tasks such as object detection. We transferred discriminative
knowledge from a state-of-the-art frame-based convolutional neural network
(CNN) to the event-based modality via intermediate pseudo-labels, which are
used as targets for supervised learning. We show, for the first time,
event-based car detection under ego-motion in a real environment at 100 frames
per second with a test average precision of 40.3% relative to our annotated
ground truth. The event-based car detector handles motion blur and poor
illumination conditions despite not explicitly trained to do so, and even
complements frame-based CNN detectors, suggesting that it has learnt
generalized visual representations
Reverse Knowledge Distillation: Training a Large Model using a Small One for Retinal Image Matching on Limited Data
Retinal image matching plays a crucial role in monitoring disease progression
and treatment response. However, datasets with matched keypoints between
temporally separated pairs of images are not available in abundance to train
transformer-based model. We propose a novel approach based on reverse knowledge
distillation to train large models with limited data while preventing
overfitting. Firstly, we propose architectural modifications to a CNN-based
semi-supervised method called SuperRetina that help us improve its results on a
publicly available dataset. Then, we train a computationally heavier model
based on a vision transformer encoder using the lighter CNN-based model, which
is counter-intuitive in the field knowledge-distillation research where
training lighter models based on heavier ones is the norm. Surprisingly, such
reverse knowledge distillation improves generalization even further. Our
experiments suggest that high-dimensional fitting in representation space may
prevent overfitting unlike training directly to match the final output. We also
provide a public dataset with annotations for retinal image keypoint detection
and matching to help the research community develop algorithms for retinal
image applications
Semi-Supervised Domain Generalization for Object Detection via Language-Guided Feature Alignment
Existing domain adaptation (DA) and generalization (DG) methods in object
detection enforce feature alignment in the visual space but face challenges
like object appearance variability and scene complexity, which make it
difficult to distinguish between objects and achieve accurate detection. In
this paper, we are the first to address the problem of semi-supervised domain
generalization by exploring vision-language pre-training and enforcing feature
alignment through the language space. We employ a novel Cross-Domain
Descriptive Multi-Scale Learning (CDDMSL) aiming to maximize the agreement
between descriptions of an image presented with different domain-specific
characteristics in the embedding space. CDDMSL significantly outperforms
existing methods, achieving 11.7% and 7.5% improvement in DG and DA settings,
respectively. Comprehensive analysis and ablation studies confirm the
effectiveness of our method, positioning CDDMSL as a promising approach for
domain generalization in object detection tasks.Comment: Accepted at BMVC 202
- …