25 research outputs found
Real-time Scene Text Detection with Differentiable Binarization
Recently, segmentation-based methods are quite popular in scene text
detection, as the segmentation results can more accurately describe scene text
of various shapes such as curve text. However, the post-processing of
binarization is essential for segmentation-based detection, which converts
probability maps produced by a segmentation method into bounding boxes/regions
of text. In this paper, we propose a module named Differentiable Binarization
(DB), which can perform the binarization process in a segmentation network.
Optimized along with a DB module, a segmentation network can adaptively set the
thresholds for binarization, which not only simplifies the post-processing but
also enhances the performance of text detection. Based on a simple segmentation
network, we validate the performance improvements of DB on five benchmark
datasets, which consistently achieves state-of-the-art results, in terms of
both detection accuracy and speed. In particular, with a light-weight backbone,
the performance improvements by DB are significant so that we can look for an
ideal tradeoff between detection accuracy and efficiency. Specifically, with a
backbone of ResNet-18, our detector achieves an F-measure of 82.8, running at
62 FPS, on the MSRA-TD500 dataset. Code is available at:
https://github.com/MhLiao/DBComment: Accepted to AAAI 202
Real-time Arabic scene text detection using fully convolutional neural networks
The aim of this research is to propose a fully convolutional approach to address the problem of real-time scene text detection for Arabic language. Text detection is performed using a two-steps multi-scale approach. The first step uses light-weighted fully convolutional network: TextBlockDetector FCN, an adaptation of VGG-16 to eliminate non-textual elements, localize wide scale text and give text scale estimation. The second step determines narrow scale range of text using fully convolutional network for maximum performance. To evaluate the system, we confront the results of the framework to the results obtained with single VGG-16 fully deployed for text detection in one-shot; in addition to previous results in the state-of-the-art. For training and testing, we initiate a dataset of 575 images manually processed along with data augmentation to enrich training process. The system scores a precision of 0.651 vs 0.64 in the state-of-the-art and a FPS of 24.3 vs 31.7 for a VGG-16 fully deployed
CCLAP: Controllable Chinese Landscape Painting Generation via Latent Diffusion Model
With the development of deep generative models, recent years have seen great
success of Chinese landscape painting generation. However, few works focus on
controllable Chinese landscape painting generation due to the lack of data and
limited modeling capabilities. In this work, we propose a controllable Chinese
landscape painting generation method named CCLAP, which can generate painting
with specific content and style based on Latent Diffusion Model. Specifically,
it consists of two cascaded modules, i.e., content generator and style
aggregator. The content generator module guarantees the content of generated
paintings specific to the input text. While the style aggregator module is to
generate paintings of a style corresponding to a reference image. Moreover, a
new dataset of Chinese landscape paintings named CLAP is collected for
comprehensive evaluation. Both the qualitative and quantitative results
demonstrate that our method achieves state-of-the-art performance, especially
in artfully-composed and artistic conception. Codes are available at
https://github.com/Robin-WZQ/CCLAP.Comment: 8 pages,13 figure
UNITS: Unsupervised Intermediate Training Stage for Scene Text Detection
Recent scene text detection methods are almost based on deep learning and
data-driven. Synthetic data is commonly adopted for pre-training due to
expensive annotation cost. However, there are obvious domain discrepancies
between synthetic data and real-world data. It may lead to sub-optimal
performance to directly adopt the model initialized by synthetic data in the
fine-tuning stage. In this paper, we propose a new training paradigm for scene
text detection, which introduces an \textbf{UN}supervised \textbf{I}ntermediate
\textbf{T}raining \textbf{S}tage (UNITS) that builds a buffer path to
real-world data and can alleviate the gap between the pre-training stage and
fine-tuning stage. Three training strategies are further explored to perceive
information from real-world data in an unsupervised way. With UNITS, scene text
detectors are improved without introducing any parameters and computations
during inference. Extensive experimental results show consistent performance
improvements on three public datasets.Comment: Accepted by ICME 202
A Text Recognition Algorithm Based on a Dual-Attention Mechanism in Complex Driving Environment
In response to many problems such as complex background of text recognition environment, perspective distortion, shallow handwriting, and mixed Chinese and English characters, we have designed an OCR algorithm framework with features such as landmark extraction and correction, image enhancement, text detection, and text recognition. We have designed a DBNet based on dual attention mechanism and content-aware upsampling. We have also designed a text recognition module incorporating the central loss CRNN + CTC to improve content awareness. Experimental results show that the improved text detection network in this paper has increased accuracy by 5.09%, recall by 2.12%, and F-score by 3.46% on the ICDAR2015 dataset. The text recognition network has improved the accuracy of recognizing Chinese and English characters by 1.2%
MixNet: Toward Accurate Detection of Challenging Scene Text in the Wild
Detecting small scene text instances in the wild is particularly challenging,
where the influence of irregular positions and nonideal lighting often leads to
detection errors. We present MixNet, a hybrid architecture that combines the
strengths of CNNs and Transformers, capable of accurately detecting small text
from challenging natural scenes, regardless of the orientations, styles, and
lighting conditions. MixNet incorporates two key modules: (1) the Feature
Shuffle Network (FSNet) to serve as the backbone and (2) the Central
Transformer Block (CTBlock) to exploit the 1D manifold constraint of the scene
text. We first introduce a novel feature shuffling strategy in FSNet to
facilitate the exchange of features across multiple scales, generating
high-resolution features superior to popular ResNet and HRNet. The FSNet
backbone has achieved significant improvements over many existing text
detection methods, including PAN, DB, and FAST. Then we design a complementary
CTBlock to leverage center line based features similar to the medial axis of
text regions and show that it can outperform contour-based approaches in
challenging cases when small scene texts appear closely. Extensive experimental
results show that MixNet, which mixes FSNet with CTBlock, achieves
state-of-the-art results on multiple scene text detection datasets