11 research outputs found
Detection and Rectification of Arbitrary Shaped Scene Texts by using Text Keypoints and Links
Detection and recognition of scene texts of arbitrary shapes remain a grand
challenge due to the super-rich text shape variation in text line orientations,
lengths, curvatures, etc. This paper presents a mask-guided multi-task network
that detects and rectifies scene texts of arbitrary shapes reliably. Three
types of keypoints are detected which specify the centre line and so the shape
of text instances accurately. In addition, four types of keypoint links are
detected of which the horizontal links associate the detected keypoints of each
text instance and the vertical links predict a pair of landmark points (for
each keypoint) along the upper and lower text boundary, respectively. Scene
texts can be located and rectified by linking up the associated landmark points
(giving localization polygon boxes) and transforming the polygon boxes via thin
plate spline, respectively. Extensive experiments over several public datasets
show that the use of text keypoints is tolerant to the variation in text
orientations, lengths, and curvatures, and it achieves superior scene text
detection and rectification performance as compared with state-of-the-art
methods
Dataset Condensation via Generative Model
Dataset condensation aims to condense a large dataset with a lot of training
samples into a small set. Previous methods usually condense the dataset into
the pixels format. However, it suffers from slow optimization speed and large
number of parameters to be optimized. When increasing image resolutions and
classes, the number of learnable parameters grows accordingly, prohibiting
condensation methods from scaling up to large datasets with diverse classes.
Moreover, the relations among condensed samples have been neglected and hence
the feature distribution of condensed samples is often not diverse. To solve
these problems, we propose to condense the dataset into another format, a
generative model. Such a novel format allows for the condensation of large
datasets because the size of the generative model remains relatively stable as
the number of classes or image resolution increases. Furthermore, an
intra-class and an inter-class loss are proposed to model the relation of
condensed samples. Intra-class loss aims to create more diverse samples for
each class by pushing each sample away from the others of the same class.
Meanwhile, inter-class loss increases the discriminability of samples by
widening the gap between the centers of different classes. Extensive
comparisons with state-of-the-art methods and our ablation studies confirm the
effectiveness of our method and its individual component. To our best
knowledge, we are the first to successfully conduct condensation on
ImageNet-1k.Comment: old work,done in 202
Is synthetic data from generative models ready for image recognition?
Recent text-to-image generation models have shown promising results in
generating high-fidelity photo-realistic images. Though the results are
astonishing to human eyes, how applicable these generated images are for
recognition tasks remains under-explored. In this work, we extensively study
whether and how synthetic images generated from state-of-the-art text-to-image
generation models can be used for image recognition tasks, and focus on two
perspectives: synthetic data for improving classification models in data-scarce
settings (i.e. zero-shot and few-shot), and synthetic data for large-scale
model pre-training for transfer learning. We showcase the powerfulness and
shortcomings of synthetic data from existing generative models, and propose
strategies for better applying synthetic data for recognition tasks. Code:
https://github.com/CVMI-Lab/SyntheticData.Comment: ICLR 2023, spotligh
Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks
Despite the rapid advancement of unsupervised learning in visual
representation, it requires training on large-scale datasets that demand costly
data collection, and pose additional challenges due to concerns regarding data
privacy. Recently, synthetic images generated by text-to-image diffusion
models, have shown great potential for benefiting image recognition. Although
promising, there has been inadequate exploration dedicated to unsupervised
learning on diffusion-generated images. To address this, we start by uncovering
that diffusion models' cross-attention layers inherently provide
annotation-free attention masks aligned with corresponding text inputs on
generated images. We then investigate the problems of three prevalent
unsupervised learning techniques ( i.e., contrastive learning, masked modeling,
and vision-language pretraining) and introduce customized solutions by fully
exploiting the aforementioned free attention masks. Our approach is validated
through extensive experiments that show consistent improvements in baseline
models across various downstream tasks, including image classification,
detection, segmentation, and image-text retrieval. By utilizing our method, it
is possible to close the performance gap between unsupervised pretraining on
synthetic data and real-world scenarios
Analysis of the driving path of e-commerce to high-quality agricultural development in China: empirical evidence from mediating effect models
PurposeThis study investigates the impact of e-commerce on high-quality agricultural development (HQAD) in China. As the agricultural sector transitions towards higher quality production in the digital era, understanding the influence pathways and mechanisms of e-commerce becomes crucial. We aim to quantify this influence through a hierarchical approach.MethodsUtilizing provincial panel data from 2000 to 2021, we construct a comprehensive HQAD evaluation system using the entropy method. Parallel mediating effect models are employed to empirically assess the multi-level effects of e-commerce on HQAD.ResultsBenchmark regression analyzes reveal a significant positive effect of e-commerce on HQAD, indicating its role as a key driver in China’s agricultural advancement. Mechanism tests identify several intermediary pathways through which e-commerce indirectly promotes HQAD, including market expansion, agricultural value chain optimization, enhanced social services, and improved infrastructure. Notably, market expansion and value chain optimization demonstrate the most substantial mediation effects, accounting for 43.27 and 14.18% of the total effect, respectively.DiscussionThis research contributes to the literature by establishing a comprehensive HQAD evaluation framework, providing a theoretical foundation for future studies. By incorporating circulation factors into the production system, we elucidate the complex influence mechanisms of e-commerce on agricultural production, addressing a significant research gap. Furthermore, we propose a novel “demand-driven supply optimization” paradigm, offering valuable insights for policy formulation aimed at fostering HQAD in China
Image aesthetic style classification and region detection using Convolutional Neural Network
Convolutional Neural Network (CNN) becomes popular in recent years, especially in the field of image processing. This algorithm has been successfully applied on object image classification, object detection, video analysis and so on with good results. Due to good feature extraction performance of CNN, research on automatically aesthetic analysis of images by deep learning has started. However, previous work for image aesthetic analysis like [5] are mainly about image aesthetic rating or image aesthetic binary classification. Therefore, our project aims at learning the image aesthetic styles using CNN as well as generating the bounding box of region for corresponding styles.
This project comprises of two main parts, which are image aesthetic style classification and image aesthetic style region detection. We firstly build the network based on [5] and train an image aesthetic style classification model on AVA Dataset [4] with some selected style classes after data cleaning. By using this pre-trained model, we then apply Faster R-CNN [1] algorithm on image aesthetic style region detection. This is implemented by firstly manually labeling image aesthetic style region in selected images in AVA Dataset, building corresponding Region Proposal Network and Fast R-CNN Network [1] based on RAPID Network [5] and training on these labeled images with pre-trained image aesthetic style classification model.Bachelor of Engineering (Computer Engineering
Accurate and robust detection and recognition of texts in scene
Scene text detection and recognition aim to localize the texts in natural scene images and output corresponding character sequences of texts. Automated scene text detection and recognition have attracted increasing interest in computer vision and deep learning communities due to its wide range of applications in neural machine translation, autonomous driving, etc. As compared with preliminary research that focuses on the design of hand-crafted features, modern deep-learning-based techniques have achieved significant improvements on scene text detection and recognition tasks. Such frameworks usually deploy convolutional neural networks (CNN), recurrent neural networks (RNN), or Transformers to extract image features for accurate text detection and recognition. However, automated detecting and recognizing texts in scenes remain challenging due to the complexity of scene text images. First, texts in scenes exhibit high variability and diversity in appearance due to the complex patterns of texts (e.g., colors, fonts, etc.) and various environments (e.g., lighting, occlusion, etc.). Second, scene texts usually have different lengths, orientations, and shapes that may suffer from both perspective and curvature distortions. Third, scene images usually have complex backgrounds that may contain similar patterns with texts (e.g., trees, traffic signs, etc.). Either of them will lead to incorrect prediction in scene text detection and recognition task.
In this thesis, we propose several novel techniques for scene text detection and recognition that aim to produce more accurate detection and recognition of scene texts in different orientations, lengths, sizes, and shapes. First, we design a novel scene text detection approach that detects texts through border semantics awareness and bootstrapping. We introduce a bootstrapping technique that samples multiple `subsections' of a word or text line and accordingly relieves the constraint of limited training data effectively. In addition, a semantics-aware text border detection technique is designed which produces four types of text border segments for text detection. Second, we develop a novel multi-scale shape regression network (MSR) for accurate scene text detection. It detects scene texts by predicting dense text boundary points instead of sparse quadrilateral vertices which often suffers from regression errors while dealing with long text lines. Additionally, the multi-scale network extracts and fuses features at different scales concurrently and seamlessly which demonstrates superb tolerance to the text scale variation. Third, we design a mask-guided multi-task network that reliably detects and rectifies scene texts of arbitrary shapes. The proposed network detects text keypoints and landmark points for accurate text detection and rectification. Forth, we propose a novel scene text recognition method I2C2W that is tolerant to geometric and photometric degradation by decomposing scene text recognition into two inter-connected tasks and leveraging the advances of Transformer architecture. Extensive experiments show that the proposed techniques can accurately detect and recognize texts with various lengths, orientations, and shapes from natural scene images.Doctor of Philosoph