424 research outputs found

    Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text Image Super-Resolution

    Full text link
    Current Scene text image super-resolution approaches primarily focus on extracting robust features, acquiring text information, and complex training strategies to generate super-resolution images. However, the upsampling module, which is crucial in the process of converting low-resolution images to high-resolution ones, has received little attention in existing works. To address this issue, we propose the Pixel Adapter Module (PAM) based on graph attention to address pixel distortion caused by upsampling. The PAM effectively captures local structural information by allowing each pixel to interact with its neighbors and update features. Unlike previous graph attention mechanisms, our approach achieves 2-3 orders of magnitude improvement in efficiency and memory utilization by eliminating the dependency on sparse adjacency matrices and introducing a sliding window approach for efficient parallel computation. Additionally, we introduce the MLP-based Sequential Residual Block (MSRB) for robust feature extraction from text images, and a Local Contour Awareness loss (Llca\mathcal{L}_{lca}) to enhance the model's perception of details. Comprehensive experiments on TextZoom demonstrate that our proposed method generates high-quality super-resolution images, surpassing existing methods in recognition accuracy. For single-stage and multi-stage strategies, we achieved improvements of 0.7\% and 2.6\%, respectively, increasing the performance from 52.6\% and 53.7\% to 53.3\% and 56.3\%. The code is available at https://github.com/wenyu1009/RTSRN

    Extracting structured information from 2D images

    Get PDF
    Convolutional neural networks can handle an impressive array of supervised learning tasks while relying on a single backbone architecture, suggesting that one solution fits all vision problems. But for many tasks, we can directly make use of the problem structure within neural networks to deliver more accurate predictions. In this thesis, we propose novel deep learning components that exploit the structured output space of an increasingly complex set of problems. We start from Optical Character Recognition (OCR) in natural scenes and leverage the constraints imposed by a spatial outline of letters and language requirements. Conventional OCR systems do not work well in natural scenes due to distortions, blur, or letter variability. We introduce a new attention-based model, equipped with extra information about the neuron positions to guide its focus across characters sequentially. It beats the previous state-of-the-art benchmark by a significant margin. We then turn to dense labeling tasks employing encoder-decoder architectures. We start with an experimental study that documents the drastic impact that decoder design can have on task performance. Rather than optimizing one decoder per task separately, we propose new robust layers for the upsampling of high-dimensional encodings. We show that these better suit the structured per pixel output across the board of all tasks. Finally, we turn to the problem of urban scene understanding. There is an elaborate structure in both the input space (multi-view recordings, aerial and street-view scenes) and the output space (multiple fine-grained attributes for holistic building understanding). We design new models that benefit from a relatively simple cuboidal-like geometry of buildings to create a single unified representation from multiple views. To benchmark our model, we build a new multi-view large-scale dataset of buildings images and fine-grained attributes and show systematic improvements when compared to a broad range of strong CNN-based baselines

    Person annotation in video sequences

    Get PDF
    In the recent years, the demand for video tools to automatically annotate and classify large audiovisual datasets has increased considerably. One specific task in this field applies to TV broadcast videos, to determine who and when a person appears in a video sequence. This work starts from the base of the ALBAYZIN evaluation series presented in the IberSPEECH-RTVE 2018 in Barcelona, and the purpose of this thesis is trying to improve the results obtained and compare the different face detection and tracking methods. We will evaluate the performance of classic face detection techniques and other techniques based on machine learning on a closed dataset of 34 known people. The rest of characters on the audiovisual document will be labelled as "unknown". We will work with small videos and images of each known character to build his/her model and finally, evaluate the performance of the ALBAYZIN algorithm over a 2h video called "La noche en 24H" whose format is like a news program. We will analyze the results and the type of errors and scenarios we encountered as well as the solutions we propose for each of them if there is any. In this work, We will only focus on a monomodal basis of face recognition and tracking

    Real-time multiframe blind deconvolution of solar images

    Full text link
    The quality of images of the Sun obtained from the ground are severely limited by the perturbing effect of the turbulent Earth's atmosphere. The post-facto correction of the images to compensate for the presence of the atmosphere require the combination of high-order adaptive optics techniques, fast measurements to freeze the turbulent atmosphere and very time consuming blind deconvolution algorithms. Under mild seeing conditions, blind deconvolution algorithms can produce images of astonishing quality. They can be very competitive with those obtained from space, with the huge advantage of the flexibility of the instrumentation thanks to the direct access to the telescope. In this contribution we leverage deep learning techniques to significantly accelerate the blind deconvolution process and produce corrected images at a peak rate of ~100 images per second. We present two different architectures that produce excellent image corrections with noise suppression while maintaining the photometric properties of the images. As a consequence, polarimetric signals can be obtained with standard polarimetric modulation without any significant artifact. With the expected improvements in computer hardware and algorithms, we anticipate that on-site real-time correction of solar images will be possible in the near future.Comment: 16 pages, 12 figures, accepted for publication in A&

    MOVIN: Real-time Motion Capture using a Single LiDAR

    Full text link
    Recent advancements in technology have brought forth new forms of interactive applications, such as the social metaverse, where end users interact with each other through their virtual avatars. In such applications, precise full-body tracking is essential for an immersive experience and a sense of embodiment with the virtual avatar. However, current motion capture systems are not easily accessible to end users due to their high cost, the requirement for special skills to operate them, or the discomfort associated with wearable devices. In this paper, we present MOVIN, the data-driven generative method for real-time motion capture with global tracking, using a single LiDAR sensor. Our autoregressive conditional variational autoencoder (CVAE) model learns the distribution of pose variations conditioned on the given 3D point cloud from LiDAR.As a central factor for high-accuracy motion capture, we propose a novel feature encoder to learn the correlation between the historical 3D point cloud data and global, local pose features, resulting in effective learning of the pose prior. Global pose features include root translation, rotation, and foot contacts, while local features comprise joint positions and rotations. Subsequently, a pose generator takes into account the sampled latent variable along with the features from the previous frame to generate a plausible current pose. Our framework accurately predicts the performer's 3D global information and local joint details while effectively considering temporally coherent movements across frames. We demonstrate the effectiveness of our architecture through quantitative and qualitative evaluations, comparing it against state-of-the-art methods. Additionally, we implement a real-time application to showcase our method in real-world scenarios. MOVIN dataset is available at \url{https://movin3d.github.io/movin_pg2023/}

    Investigating Ensembles of Single-class Classifiers for Multi-class Classification

    Get PDF
    Traditional methods of multi-class classification in machine learning involve the use of a monolithic feature extractor and classifier head trained on data from all of the classes at once. These architectures (especially the classifier head) are dependent on the number and types of classes, and are therefore rigid against changes to the class set. For best performance, one must retrain networks with these architectures from scratch, incurring a large cost in training time. As well, these networks can be biased towards classes with a large imbalance in training data compared to other classes. Instead, ensembles of so-called ``single-class'' classifiers can be used for multi-class classification by training an individual network for each class.We show that these ensembles of single-class classifiers are more flexible to changes to the class set than traditional models, and can be quickly retrained to consider small changes to the class set, such as by adding, removing, splitting, or fusing classes. As well, we show that these ensembles are less biased towards classes with large imbalances in their training data than traditional models. We also introduce a new, more powerful single-class classification architecture. These models are trained and tested on a plant disease dataset with high variance in the number of classes and amount of data in each class, as well as on an Alzheimer's dataset with low amounts of data and a large imbalance in data between classes

    TextScanner: Reading Characters in Order for Robust Scene Text Recognition

    Full text link
    Driven by deep learning and the large volume of data, scene text recognition has evolved rapidly in recent years. Formerly, RNN-attention based methods have dominated this field, but suffer from the problem of \textit{attention drift} in certain situations. Lately, semantic segmentation based algorithms have proven effective at recognizing text of different forms (horizontal, oriented and curved). However, these methods may produce spurious characters or miss genuine characters, as they rely heavily on a thresholding procedure operated on segmentation maps. To tackle these challenges, we propose in this paper an alternative approach, called TextScanner, for scene text recognition. TextScanner bears three characteristics: (1) Basically, it belongs to the semantic segmentation family, as it generates pixel-wise, multi-channel segmentation maps for character class, position and order; (2) Meanwhile, akin to RNN-attention based methods, it also adopts RNN for context modeling; (3) Moreover, it performs paralleled prediction for character position and class, and ensures that characters are transcripted in correct order. The experiments on standard benchmark datasets demonstrate that TextScanner outperforms the state-of-the-art methods. Moreover, TextScanner shows its superiority in recognizing more difficult text such Chinese transcripts and aligning with target characters.Comment: Accepted by AAAI-202
    • …