424 research outputs found
Recommended from our members
Gaussian Process Modeling for Upsampling Algorithms With Applications in Computer Vision and Computational Fluid Dynamics
Across a variety of fields, interpolation algorithms have been used to upsample lowresolution or coarse data fields. In this work, novel Gaussian Process based methodsare employed to solve a variety of upsampling problems. Specifically threeapplications are explored: coarse data prolongation in Adaptive Mesh Refinement(AMR) in the field of Computational Fluid Dynamics, accurate document imageupsampling to enhance Optical Character Recognition (OCR) accuracy, and fastand accurate Single Image Super Resolution (SISR). For AMR, a new, efficient,and “3rd order accurate” algorithm called GP-AMR is presented. Next, a novel,non-zero mean, windowed GP model is generated to upsample low resolution documentimages to generate a higher OCR accuracy, when compared to the industrystandard. Finally, a hybrid GP convolutional neural network algorithm is used togenerate a computationally efficient and high quality SISR model
Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text Image Super-Resolution
Current Scene text image super-resolution approaches primarily focus on
extracting robust features, acquiring text information, and complex training
strategies to generate super-resolution images. However, the upsampling module,
which is crucial in the process of converting low-resolution images to
high-resolution ones, has received little attention in existing works. To
address this issue, we propose the Pixel Adapter Module (PAM) based on graph
attention to address pixel distortion caused by upsampling. The PAM effectively
captures local structural information by allowing each pixel to interact with
its neighbors and update features. Unlike previous graph attention mechanisms,
our approach achieves 2-3 orders of magnitude improvement in efficiency and
memory utilization by eliminating the dependency on sparse adjacency matrices
and introducing a sliding window approach for efficient parallel computation.
Additionally, we introduce the MLP-based Sequential Residual Block (MSRB) for
robust feature extraction from text images, and a Local Contour Awareness loss
() to enhance the model's perception of details.
Comprehensive experiments on TextZoom demonstrate that our proposed method
generates high-quality super-resolution images, surpassing existing methods in
recognition accuracy. For single-stage and multi-stage strategies, we achieved
improvements of 0.7\% and 2.6\%, respectively, increasing the performance from
52.6\% and 53.7\% to 53.3\% and 56.3\%. The code is available at
https://github.com/wenyu1009/RTSRN
Extracting structured information from 2D images
Convolutional neural networks can handle an impressive array of supervised learning tasks while relying on a single backbone architecture, suggesting that one solution fits all vision problems. But for many tasks, we can directly make use of the problem structure within neural networks to deliver more accurate predictions. In this thesis, we propose novel deep learning components that exploit the structured output space of an increasingly complex set of problems. We start from Optical Character Recognition (OCR) in natural scenes and leverage the constraints imposed by a spatial outline of letters and language requirements. Conventional OCR systems do not work well in natural scenes due to distortions, blur, or letter variability. We introduce a new attention-based model, equipped with extra information about the neuron positions to guide its focus across characters sequentially. It beats the previous state-of-the-art benchmark by a significant margin. We then turn to dense labeling tasks employing encoder-decoder architectures. We start with an experimental study that documents the drastic impact that decoder design can have on task performance. Rather than optimizing one decoder per task separately, we propose new robust layers for the upsampling of high-dimensional encodings. We show that these better suit the structured per pixel output across the board of all tasks. Finally, we turn to the problem of urban scene understanding. There is an elaborate structure in both the input space (multi-view recordings, aerial and street-view scenes) and the output space (multiple fine-grained attributes for holistic building understanding). We design new models that benefit from a relatively simple cuboidal-like geometry of buildings to create a single unified representation from multiple views. To benchmark our model, we build a new multi-view large-scale dataset of buildings images and fine-grained attributes and show systematic improvements when compared to a broad range of strong CNN-based baselines
Person annotation in video sequences
In the recent years, the demand for video tools to automatically annotate and classify large audiovisual datasets has increased considerably. One specific task in this field applies to TV broadcast videos, to determine who and when a person appears in a video sequence. This work starts from the base of the ALBAYZIN evaluation series presented in the IberSPEECH-RTVE 2018 in Barcelona, and the purpose of this thesis is trying to improve the results obtained and compare the different face detection and tracking methods. We will evaluate the performance of classic face detection techniques and other techniques based on machine learning on a closed dataset of 34 known people. The rest of characters on the audiovisual document will be labelled as "unknown". We will work with small videos and images of each known character to build his/her model and finally, evaluate the performance of the ALBAYZIN algorithm over a 2h video called "La noche en 24H" whose format is like a news program. We will analyze the results and the type of errors and scenarios we encountered as well as the solutions we propose for each of them if there is any. In this work, We will only focus on a monomodal basis of face recognition and tracking
Real-time multiframe blind deconvolution of solar images
The quality of images of the Sun obtained from the ground are severely
limited by the perturbing effect of the turbulent Earth's atmosphere. The
post-facto correction of the images to compensate for the presence of the
atmosphere require the combination of high-order adaptive optics techniques,
fast measurements to freeze the turbulent atmosphere and very time consuming
blind deconvolution algorithms. Under mild seeing conditions, blind
deconvolution algorithms can produce images of astonishing quality. They can be
very competitive with those obtained from space, with the huge advantage of the
flexibility of the instrumentation thanks to the direct access to the
telescope. In this contribution we leverage deep learning techniques to
significantly accelerate the blind deconvolution process and produce corrected
images at a peak rate of ~100 images per second. We present two different
architectures that produce excellent image corrections with noise suppression
while maintaining the photometric properties of the images. As a consequence,
polarimetric signals can be obtained with standard polarimetric modulation
without any significant artifact. With the expected improvements in computer
hardware and algorithms, we anticipate that on-site real-time correction of
solar images will be possible in the near future.Comment: 16 pages, 12 figures, accepted for publication in A&
MOVIN: Real-time Motion Capture using a Single LiDAR
Recent advancements in technology have brought forth new forms of interactive
applications, such as the social metaverse, where end users interact with each
other through their virtual avatars. In such applications, precise full-body
tracking is essential for an immersive experience and a sense of embodiment
with the virtual avatar. However, current motion capture systems are not easily
accessible to end users due to their high cost, the requirement for special
skills to operate them, or the discomfort associated with wearable devices. In
this paper, we present MOVIN, the data-driven generative method for real-time
motion capture with global tracking, using a single LiDAR sensor. Our
autoregressive conditional variational autoencoder (CVAE) model learns the
distribution of pose variations conditioned on the given 3D point cloud from
LiDAR.As a central factor for high-accuracy motion capture, we propose a novel
feature encoder to learn the correlation between the historical 3D point cloud
data and global, local pose features, resulting in effective learning of the
pose prior. Global pose features include root translation, rotation, and foot
contacts, while local features comprise joint positions and rotations.
Subsequently, a pose generator takes into account the sampled latent variable
along with the features from the previous frame to generate a plausible current
pose. Our framework accurately predicts the performer's 3D global information
and local joint details while effectively considering temporally coherent
movements across frames. We demonstrate the effectiveness of our architecture
through quantitative and qualitative evaluations, comparing it against
state-of-the-art methods. Additionally, we implement a real-time application to
showcase our method in real-world scenarios. MOVIN dataset is available at
\url{https://movin3d.github.io/movin_pg2023/}
Investigating Ensembles of Single-class Classifiers for Multi-class Classification
Traditional methods of multi-class classification in machine learning involve the use of a monolithic feature extractor and classifier head trained on data from all of the classes at once. These architectures (especially the classifier head) are dependent on the number and types of classes, and are therefore rigid against changes to the class set. For best performance, one must retrain networks with these architectures from scratch, incurring a large cost in training time. As well, these networks can be biased towards classes with a large imbalance in training data compared to other classes. Instead, ensembles of so-called ``single-class'' classifiers can be used for multi-class classification by training an individual network for each class.We show that these ensembles of single-class classifiers are more flexible to changes to the class set than traditional models, and can be quickly retrained to consider small changes to the class set, such as by adding, removing, splitting, or fusing classes. As well, we show that these ensembles are less biased towards classes with large imbalances in their training data than traditional models. We also introduce a new, more powerful single-class classification architecture. These models are trained and tested on a plant disease dataset with high variance in the number of classes and amount of data in each class, as well as on an Alzheimer's dataset with low amounts of data and a large imbalance in data between classes
TextScanner: Reading Characters in Order for Robust Scene Text Recognition
Driven by deep learning and the large volume of data, scene text recognition
has evolved rapidly in recent years. Formerly, RNN-attention based methods have
dominated this field, but suffer from the problem of \textit{attention drift}
in certain situations. Lately, semantic segmentation based algorithms have
proven effective at recognizing text of different forms (horizontal, oriented
and curved). However, these methods may produce spurious characters or miss
genuine characters, as they rely heavily on a thresholding procedure operated
on segmentation maps. To tackle these challenges, we propose in this paper an
alternative approach, called TextScanner, for scene text recognition.
TextScanner bears three characteristics: (1) Basically, it belongs to the
semantic segmentation family, as it generates pixel-wise, multi-channel
segmentation maps for character class, position and order; (2) Meanwhile, akin
to RNN-attention based methods, it also adopts RNN for context modeling; (3)
Moreover, it performs paralleled prediction for character position and class,
and ensures that characters are transcripted in correct order. The experiments
on standard benchmark datasets demonstrate that TextScanner outperforms the
state-of-the-art methods. Moreover, TextScanner shows its superiority in
recognizing more difficult text such Chinese transcripts and aligning with
target characters.Comment: Accepted by AAAI-202
- …