188,829 research outputs found
VSSA-NET: Vertical Spatial Sequence Attention Network for Traffic Sign Detection
Although traffic sign detection has been studied for years and great progress
has been made with the rise of deep learning technique, there are still many
problems remaining to be addressed. For complicated real-world traffic scenes,
there are two main challenges. Firstly, traffic signs are usually small size
objects, which makes it more difficult to detect than large ones; Secondly, it
is hard to distinguish false targets which resemble real traffic signs in
complex street scenes without context information. To handle these problems, we
propose a novel end-to-end deep learning method for traffic sign detection in
complex environments. Our contributions are as follows: 1) We propose a
multi-resolution feature fusion network architecture which exploits densely
connected deconvolution layers with skip connections, and can learn more
effective features for the small size object; 2) We frame the traffic sign
detection as a spatial sequence classification and regression task, and propose
a vertical spatial sequence attention (VSSA) module to gain more context
information for better detection performance. To comprehensively evaluate the
proposed method, we do experiments on several traffic sign datasets as well as
the general object detection dataset and the results have shown the
effectiveness of our proposed method
On the Importance of Visual Context for Data Augmentation in Scene Understanding
Performing data augmentation for learning deep neural networks is known to be
important for training visual recognition systems. By artificially increasing
the number of training examples, it helps reducing overfitting and improves
generalization. While simple image transformations can already improve
predictive performance in most vision tasks, larger gains can be obtained by
leveraging task-specific prior knowledge. In this work, we consider object
detection, semantic and instance segmentation and augment the training images
by blending objects in existing scenes, using instance segmentation
annotations. We observe that randomly pasting objects on images hurts the
performance, unless the object is placed in the right context. To resolve this
issue, we propose an explicit context model by using a convolutional neural
network, which predicts whether an image region is suitable for placing a given
object or not. In our experiments, we show that our approach is able to improve
object detection, semantic and instance segmentation on the PASCAL VOC12 and
COCO datasets, with significant gains in a limited annotation scenario, i.e.
when only one category is annotated. We also show that the method is not
limited to datasets that come with expensive pixel-wise instance annotations
and can be used when only bounding boxes are available, by employing
weakly-supervised learning for instance masks approximation.Comment: Updated the experimental section. arXiv admin note: substantial text
overlap with arXiv:1807.0742
Does Visual Pretraining Help End-to-End Reasoning?
We aim to investigate whether end-to-end learning of visual reasoning can be
achieved with general-purpose neural networks, with the help of visual
pretraining. A positive result would refute the common belief that explicit
visual abstraction (e.g. object detection) is essential for compositional
generalization on visual reasoning, and confirm the feasibility of a neural
network "generalist" to solve visual recognition and reasoning tasks. We
propose a simple and general self-supervised framework which "compresses" each
video frame into a small set of tokens with a transformer network, and
reconstructs the remaining frames based on the compressed temporal context. To
minimize the reconstruction loss, the network must learn a compact
representation for each image, as well as capture temporal dynamics and object
permanence from temporal context. We perform evaluation on two visual reasoning
benchmarks, CATER and ACRE. We observe that pretraining is essential to achieve
compositional generalization for end-to-end visual reasoning. Our proposed
framework outperforms traditional supervised pretraining, including image
classification and explicit object detection, by large margins.Comment: NeurIPS 202
Efficient Parameter Mining and Freezing for Continual Object Detection
Continual Object Detection is essential for enabling intelligent agents to
interact proactively with humans in real-world settings. While
parameter-isolation strategies have been extensively explored in the context of
continual learning for classification, they have yet to be fully harnessed for
incremental object detection scenarios. Drawing inspiration from prior research
that focused on mining individual neuron responses and integrating insights
from recent developments in neural pruning, we proposed efficient ways to
identify which layers are the most important for a network to maintain the
performance of a detector across sequential updates. The presented findings
highlight the substantial advantages of layer-level parameter isolation in
facilitating incremental learning within object detection models, offering
promising avenues for future research and application in real-world scenarios.Comment: In Proceedings of the 19th International Joint Conference on Computer
Vision, Imaging and Computer Graphics Theory and Applications - Volume 2:
VISAPP, ISBN 978-989-758-679-8, ISSN 2184-4321, pages 466-47
- …