699 research outputs found
Learning scale-variant and scale-invariant features for deep image classification
Convolutional Neural Networks (CNNs) require large image corpora to be
trained on classification tasks. The variation in image resolutions, sizes of
objects and patterns depicted, and image scales, hampers CNN training and
performance, because the task-relevant information varies over spatial scales.
Previous work attempting to deal with such scale variations focused on
encouraging scale-invariant CNN representations. However, scale-invariant
representations are incomplete representations of images, because images
contain scale-variant information as well. This paper addresses the combined
development of scale-invariant and scale-variant representations. We propose a
multi- scale CNN method to encourage the recognition of both types of features
and evaluate it on a challenging image classification task involving
task-relevant characteristics at multiple scales. The results show that our
multi-scale CNN outperforms single-scale CNN. This leads to the conclusion that
encouraging the combined development of a scale-invariant and scale-variant
representation in CNNs is beneficial to image recognition performance
Deep Learning Framework For Intelligent Pavement Condition Rating: A direct classification approach for regional and local roads
Transport authorities rely on pavement characteristics to determine a pavement condition rating index. However, manually computing ratings can be a tedious, subjective, time-consuming, and training-intensive process. This paper presents a deep-learning framework for automatically rating the condition of rural road pavements using digital images captured from a dashboard-mounted camera. The framework includes pavement segmentation, data cleaning, image cropping and resizing, and pavement condition rating classification. A dataset of images, captured from diverse roads in Ireland and rated by two expert raters using the pavement surface condition index (PSCI) scale, was created. Deep-learning models were developed to perform pavement segmentation and condition rating classification. The automated PSCI rating achieved an average Cohen Kappa score and F1-score of 0.9 and 0.85, respectively, across 1–10 rating classes on an independent test set. The incorporation of unique image augmentation during training enabled the models to exhibit increased robustness against variations in background and clutter
Orientation-Independent Chinese Text Recognition in Scene Images
Scene text recognition (STR) has attracted much attention due to its broad
applications. The previous works pay more attention to dealing with the
recognition of Latin text images with complex backgrounds by introducing
language models or other auxiliary networks. Different from Latin texts, many
vertical Chinese texts exist in natural scenes, which brings difficulties to
current state-of-the-art STR methods. In this paper, we take the first attempt
to extract orientation-independent visual features by disentangling content and
orientation information of text images, thus recognizing both horizontal and
vertical texts robustly in natural scenes. Specifically, we introduce a
Character Image Reconstruction Network (CIRN) to recover corresponding printed
character images with disentangled content and orientation information. We
conduct experiments on a scene dataset for benchmarking Chinese text
recognition, and the results demonstrate that the proposed method can indeed
improve performance through disentangling content and orientation information.
To further validate the effectiveness of our method, we additionally collect a
Vertical Chinese Text Recognition (VCTR) dataset. The experimental results show
that the proposed method achieves 45.63% improvement on VCTR when introducing
CIRN to the baseline model.Comment: IJCAI 202
Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective
Visual representation learning is the key of solving various vision problems.
Relying on the seminal grid structure priors, convolutional neural networks
(CNNs) have been the de facto standard architectures of most deep vision
models. For instance, classical semantic segmentation methods often adopt a
fully-convolutional network (FCN) with an encoder-decoder architecture. The
encoder progressively reduces the spatial resolution and learns more abstract
visual concepts with larger receptive fields. Since context modeling is
critical for segmentation, the latest efforts have been focused on increasing
the receptive field, through either dilated (i.e., atrous) convolutions or
inserting attention modules. However, the FCN-based architecture remains
unchanged. In this paper, we aim to provide an alternative perspective by
treating visual representation learning generally as a sequence-to-sequence
prediction task. Specifically, we deploy a pure Transformer to encode an image
as a sequence of patches, without local convolution and resolution reduction.
With the global context modeled in every layer of the Transformer, stronger
visual representation can be learned for better tackling vision tasks. In
particular, our segmentation model, termed as SEgmentation TRansformer (SETR),
excels on ADE20K (50.28% mIoU, the first position in the test leaderboard on
the day of submission), Pascal Context (55.83% mIoU) and reaches competitive
results on Cityscapes. Further, we formulate a family of Hierarchical
Local-Global (HLG) Transformers characterized by local attention within windows
and global-attention across windows in a hierarchical and pyramidal
architecture. Extensive experiments show that our method achieves appealing
performance on a variety of visual recognition tasks (e.g., image
classification, object detection and instance segmentation and semantic
segmentation).Comment: Extended version of CVPR 2021 paper arXiv:2012.1584
Efficient Semantic Segmentation for Resource-Constrained Applications with Lightweight Neural Networks
This thesis focuses on developing lightweight semantic segmentation models tailored for resource-constrained applications, effectively balancing accuracy and computational efficiency. It introduces several novel concepts, including knowledge sharing, dense bottleneck, and feature re-usability, which enhance the feature hierarchy by capturing fine-grained details, long-range dependencies, and diverse geometrical objects within the scene. To achieve precise object localization and improved semantic representations in real-time environments, the thesis introduces multi-stage feature aggregation, feature scaling, and hybrid-path attention methods
Research on rainy day traffic sign recognition algorithm based on PMRNet
The recognition of traffic signs is of great significance to intelligent driving and traffic systems. Most current traffic sign recognition algorithms do not consider the impact of rainy weather. The rain marks will obscure the recognition target in the image, which will lead to the performance degradation of the algorithm, a problem that has yet to be solved. In order to improve the accuracy of traffic sign recognition in rainy weather, we propose a rainy traffic sign recognition algorithm. The algorithm in this paper includes two modules. First, we propose an image deraining algorithm based on the Progressive multi-scale residual network (PMRNet), which uses a multi-scale residual structure to extract features of different scales, so as to improve the utilization rate of the algorithm for information, combined with the Convolutional long-short term memory (ConvLSTM) network to enhance the algorithm's ability to extract rain mark features. Second, we use the CoT-YOLOv5 algorithm to recognize traffic signs on the recovered images. In this paper, in order to improve the performance of YOLOv5 (You-Only-Look-Once, YOLO), the 3 × 3 convolution in the feature extraction module is replaced by the Contextual Transformer (CoT) module to make up for the lack of global modeling capability of Convolutional Neural Network (CNN), thus improving the recognition accuracy. The experimental results show that the deraining algorithm based on PMRNet can effectively remove rain marks, and the evaluation indicators Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) are better than the other representative algorithms. The mean Average Precision (mAP) of the CoT-YOLOv5 algorithm on the TT100k datasets reaches 92.1%, which is 5% higher than the original YOLOv5
- …