1,336 research outputs found
Neighborhood Structure-Based Model for Multilingual Arbitrarily-Oriented Text Localization in Images/Videos
The text matter in an image or a video provides more important clue and semantic information of the particular event in the actual situation. Text localization task stands an interesting and challenging research-oriented process in the zone of image processing due to irregular alignments, brightness, degradation, and complexbackground. The multilingual textual information has different types of geometrical shapes and it makes further complex to locate the text information. In this work, an effective model is presented to locate the multilingual arbitrary oriented text. The proposed method developed a neighborhood structure model to locate the text region. Initially, the maxmin cluster is applied along with 3X3 sliding window to sharpen the text region. The neighborhood structure creates the boundary for every component using normal deviation calculated from the sharpened image. Finally, the double stroke structure model is employed to locate the accurate text region. The presented model is analyzed on five standard datasets such as NUS, arbitrarily oriented text, Hua's, MRRC and real-time video dataset with performance metrics such as recall, precision, and f-measure
Text detection and recognition in natural scene images
This thesis addresses the problem of end-to-end text detection and recognition in
natural scene images based on deep neural networks. Scene text detection and recognition
aim to find regions in an image that are considered as text by human beings,
generate a bounding box for each word and output a corresponding sequence of
characters. As a useful task in image analysis, scene text detection and recognition
attract much attention in computer vision field. In this thesis, we tackle this problem
by taking advantage of the success in deep learning techniques.
Car license plates can be viewed as a spacial case of scene text, as they both consist
of characters and appear in natural scenes. Nevertheless, they have their respective
specificities. During the research progress, we start from car license plate detection
and recognition. Then we extend the methods to general scene text, with additional
ideas proposed.
For both tasks, we develop two approaches respectively: a stepwise one and
an integrated one. Stepwise methods tackle text detection and recognition step by
step by respective models; while integrated methods handle both text detection and
recognition simultaneously via one model. All approaches are based on the powerful
deep Convolutional Neural Networks (CNNs) and Recurrent Neural Networks
(RNNs), considering the tremendous breakthroughs they brought into the computer
vision community.
To begin with, a stepwise framework is proposed to tackle text detection and
recognition, with its application to car license plates and general scene text respectively.
A character CNN classifier is well trained to detect characters from an image
in a sliding window manner. The detected characters are then grouped together as
license plates or text lines according to some heuristic rules. A sequence labeling
based method is proposed to recognize the whole license plate or text line without
character level segmentation.
On the basis of the sequence labeling based recognition method, to accelerate the
processing speed, an integrated deep neural network is then proposed to address
car license plate detection and recognition concurrently. It integrates both CNNs
and RNNs in one network, and can be trained end-to-end. Both car license plate
bounding boxes and their labels are generated in a single forward evaluation of the
network. The whole process involves no heuristic rule, and avoids intermediate
procedures like image cropping or feature recalculation, which not only prevents
error accumulation, but also reduces computation burden.
Lastly, the unified network is extended to simultaneous general text detection and
recognition in natural scene. In contrast to the one for car license plates, some innovations
are proposed to accommodate the special characteristics of general text. A
varying-size RoI encoding method is proposed to handle the various aspect ratios of general text. An attention-based sequence-to-sequence learning structure is adopted
for word recognition. It is expected that a character-level language model can be
learnt in this manner. The whole framework can be trained end-to-end, requiring
only images, the ground-truth bounding boxes and text labels. Through end-to-end
training, the learned features can be more discriminative, which improves the overall
performance. The convolutional features are calculated only once and shared by both
detection and recognition, which saves the processing time. The proposed method
has achieved state-of-the-art performance on several standard benchmark datasets.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 201
ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation
Recently, foundational models such as CLIP and SAM have shown promising
performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However,
either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible
key drawbacks: 1) CLIP primarily focuses on global feature alignment across
different inputs, leading to imprecise segmentation of local anomalous parts;
2) SAM tends to generate numerous redundant masks without proper prompt
constraints, resulting in complex post-processing requirements. In this work,
we innovatively propose a CLIP and SAM collaboration framework called ClipSAM
for ZSAS. The insight behind ClipSAM is to employ CLIP's semantic understanding
capability for anomaly localization and rough segmentation, which is further
used as the prompt constraints for SAM to refine the anomaly segmentation
results. In details, we introduce a crucial Unified Multi-scale Cross-modal
Interaction (UMCI) module for interacting language with visual features at
multiple scales of CLIP to reason anomaly positions. Then, we design a novel
Multi-level Mask Refinement (MMR) module, which utilizes the positional
information as multi-level prompts for SAM to acquire hierarchical levels of
masks and merges them. Extensive experiments validate the effectiveness of our
approach, achieving the optimal segmentation performance on the MVTec-AD and
VisA datasets.Comment: 17 pages,17 figure
Self-Supervised Predictive Convolutional Attentive Block for Anomaly Detection
Anomaly detection is commonly pursued as a one-class classification problem,
where models can only learn from normal training samples, while being evaluated
on both normal and abnormal test samples. Among the successful approaches for
anomaly detection, a distinguished category of methods relies on predicting
masked information (e.g. patches, future frames, etc.) and leveraging the
reconstruction error with respect to the masked information as an abnormality
score. Different from related methods, we propose to integrate the
reconstruction-based functionality into a novel self-supervised predictive
architectural building block. The proposed self-supervised block is generic and
can easily be incorporated into various state-of-the-art anomaly detection
methods. Our block starts with a convolutional layer with dilated filters,
where the center area of the receptive field is masked. The resulting
activation maps are passed through a channel attention module. Our block is
equipped with a loss that minimizes the reconstruction error with respect to
the masked area in the receptive field. We demonstrate the generality of our
block by integrating it into several state-of-the-art frameworks for anomaly
detection on image and video, providing empirical evidence that shows
considerable performance improvements on MVTec AD, Avenue, and ShanghaiTech. We
release our code as open source at https://github.com/ristea/sspcab.Comment: Accepted at CVPR 2022. Paper + supplementary (14 pages, 9 figures
Joint Attention-Guided Feature Fusion Network for Saliency Detection of Surface Defects
Surface defect inspection plays an important role in the process of
industrial manufacture and production. Though Convolutional Neural Network
(CNN) based defect inspection methods have made huge leaps, they still confront
a lot of challenges such as defect scale variation, complex background, low
contrast, and so on. To address these issues, we propose a joint
attention-guided feature fusion network (JAFFNet) for saliency detection of
surface defects based on the encoder-decoder network. JAFFNet mainly
incorporates a joint attention-guided feature fusion (JAFF) module into
decoding stages to adaptively fuse low-level and high-level features. The JAFF
module learns to emphasize defect features and suppress background noise during
feature fusion, which is beneficial for detecting low-contrast defects. In
addition, JAFFNet introduces a dense receptive field (DRF) module following the
encoder to capture features with rich context information, which helps detect
defects of different scales. The JAFF module mainly utilizes a learned joint
channel-spatial attention map provided by high-level semantic features to guide
feature fusion. The attention map makes the model pay more attention to defect
features. The DRF module utilizes a sequence of multi-receptive-field (MRF)
units with each taking as inputs all the preceding MRF feature maps and the
original input. The obtained DRF features capture rich context information with
a large range of receptive fields. Extensive experiments conducted on
SD-saliency-900, Magnetic tile, and DAGM 2007 indicate that our method achieves
promising performance in comparison with other state-of-the-art methods.
Meanwhile, our method reaches a real-time defect detection speed of 66 FPS
Vulnerable road users and connected autonomous vehicles interaction: a survey
There is a group of users within the vehicular traffic ecosystem known as Vulnerable Road Users (VRUs). VRUs include pedestrians, cyclists, motorcyclists, among others. On the other hand, connected autonomous vehicles (CAVs) are a set of technologies that combines, on the one hand, communication technologies to stay always ubiquitous connected, and on the other hand, automated technologies to assist or replace the human driver during the driving process. Autonomous vehicles are being visualized as a viable alternative to solve road accidents providing a general safe environment for all the users on the road specifically to the most vulnerable. One of the problems facing autonomous vehicles is to generate mechanisms that facilitate their integration not only within the mobility environment, but also into the road society in a safe and efficient way. In this paper, we analyze and discuss how this integration can take place, reviewing the work that has been developed in recent years in each of the stages of the vehicle-human interaction, analyzing the challenges of vulnerable users and proposing solutions that contribute to solving these challenges.This work was partially funded by the Ministry of Economy, Industry, and Competitiveness
of Spain under Grant: Supervision of drone fleet and optimization of commercial operations flight
plans, PID2020-116377RB-C21.Peer ReviewedPostprint (published version
Recommended from our members
Towards Universal Object Detection
Object detection is one of the most important and challenging research topics in computer vision. It is playing an important role in our everyday life and has many applications, e.g. surveillance, autonomous driving, robotics, drone, medical imaging, etc. The ultimate goal of object detection is a universal object detector that can work very well in any case under any condition like human vision system. However, there are multiple challenges on the universality of object detection, e.g. scale-variance, high-quality requirement, domain shift, computational constraint, etc. These will prevent the object detector from being widely used for various scales of objects, critical applications requiring extremely accurate localization, scenarios with changing domain priors, and diverse hardware settings. To address these challenges, multiple solutions have been proposed in this thesis. These include an efficient multi-scale architecture to achieve scale-invariant detection, a robust multi-stage framework effective for high-quality requirement, a cross-domain solution to extend the universality over various domains, and a design of complexity-aware cascades and a novel low-precision network to enhance the universality under different computational constraints. All these efforts have substantially improved the universality of object detection, and the advanced object detector can be applied to broader environments
- …