10 research outputs found

    EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

    Full text link
    The vision and language generative models have been overgrown in recent years. For video generation, various open-sourced models and public-available services are released for generating high-visual quality videos. However, these methods often use a few academic metrics, for example, FVD or IS, to evaluate the performance. We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Thus, we propose a new framework and pipeline to exhaustively evaluate the performance of the generated videos. To achieve this, we first conduct a new prompt list for text-to-video generation by analyzing the real-world prompt list with the help of the large language model. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmarks, in terms of visual qualities, content qualities, motion qualities, and text-caption alignment with around 18 objective metrics. To obtain the final leaderboard of the models, we also fit a series of coefficients to align the objective metrics to the users' opinions. Based on the proposed opinion alignment method, our final score shows a higher correlation than simply averaging the metrics, showing the effectiveness of the proposed evaluation method.Comment: Technical Report, Project page: https://evalcrafter.github.io

    Seq-UPS: Sequential Uncertainty-aware Pseudo-label Selection for Semi-Supervised Text Recognition

    Full text link
    This paper looks at semi-supervised learning (SSL) for image-based text recognition. One of the most popular SSL approaches is pseudo-labeling (PL). PL approaches assign labels to unlabeled data before re-training the model with a combination of labeled and pseudo-labeled data. However, PL methods are severely degraded by noise and are prone to over-fitting to noisy labels, due to the inclusion of erroneous high confidence pseudo-labels generated from poorly calibrated models, thus, rendering threshold-based selection ineffective. Moreover, the combinatorial complexity of the hypothesis space and the error accumulation due to multiple incorrect autoregressive steps posit pseudo-labeling challenging for sequence models. To this end, we propose a pseudo-label generation and an uncertainty-based data selection framework for semi-supervised text recognition. We first use Beam-Search inference to yield highly probable hypotheses to assign pseudo-labels to the unlabelled examples. Then we adopt an ensemble of models, sampled by applying dropout, to obtain a robust estimate of the uncertainty associated with the prediction, considering both the character-level and word-level predictive distribution to select good quality pseudo-labels. Extensive experiments on several benchmark handwriting and scene-text datasets show that our method outperforms the baseline approaches and the previous state-of-the-art semi-supervised text-recognition methods.Comment: Accepted at WACV 202

    HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models

    Full text link
    In recent years, Text-to-Image (T2I) models have been extensively studied, especially with the emergence of diffusion models that achieve state-of-the-art results on T2I synthesis tasks. However, existing benchmarks heavily rely on subjective human evaluation, limiting their ability to holistically assess the model's capabilities. Furthermore, there is a significant gap between efforts in developing new T2I architectures and those in evaluation. To address this, we introduce HRS-Bench, a concrete evaluation benchmark for T2I models that is Holistic, Reliable, and Scalable. Unlike existing bench-marks that focus on limited aspects, HRS-Bench measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias. In addition, HRS-Bench covers 50 scenarios, including fashion, animals, transportation, food, and clothes. We evaluate nine recent large-scale T2I models using metrics that cover a wide range of skills. A human evaluation aligned with 95% of our evaluations on average was conducted to probe the effectiveness of HRS-Bench. Our experiments demonstrate that existing models often struggle to generate images with the desired count of objects, visual text, or grounded emotions. We hope that our benchmark help ease future text-to-image generation research. The code and data are available at https://eslambakr.github.io/hrsbench.github.i

    Text Detection and Recognition in the Wild

    Get PDF
    Text detection and recognition (TDR) in highly structured environments with a clean background and consistent fonts (e.g., office documents, postal addresses and bank cheque) is a well understood problem (i.e., OCR), however this is not the case for unstructured environments. The main objective for scene text detection is to locate text within images captured in the wild. For scene text recognition, the techniques map each detected or cropped word image into string. Nowadays, convolutional neural networks (CNNs) and Recurrent Neural Networks (RNN) deep learning architectures dominate most of the recent state-of-the-art (SOTA) scene TDR methods. Most of the reported respective accuracies of current SOTA TDR methods are in the range of 80% to 90% on benchmark datasets with regular and clear text instances. However, those detecting and/or recognizing results drastically deteriorate 10% and 30% - in terms of F-measure detection and word recognition accuracy performances with irregular or occluded text images. Transformers and their variations are new deep learning architectures that mitigate the above-mentioned issues for CNN and RNN-based pipelines.Unlike Recurrent Neural Networks (RNNs), transformers are models that learn how to encode and decode data by looking not only backward but also forward in order to extract relevant information from a whole sequence. This thesis utilizes the transformer architecture to address the irregular (multi-oriented and arbitrarily shaped) and occluded text challenges in the wild images. Our main contributions are as follows: (1) We first targeted solving the irregular TDR in two separate architectures as follows: In Chapter 4, unlike the SOTA text detection frameworks that have complex pipelines and use many hand-designed components and post-processing stages, we design a conceptually more straightforward and trainable end-to-end architecture of transformer-based detector for multi-oriented scene text detection, which can directly predict the set of detections (i.e., text and box regions) of the input image. A central contribution to our work is introducing a loss function tailored to the rotated text detection problem that leverages a rotated version of a generalized intersection over union score to capture the rotated text instances adequately. In Chapter 5, we extend our previous architecture to arbitrary shaped scene text detection. We design a new text detection technique that aims to better infer n-vertices of a polygon or the degree of a Bezier curve to represent irregular-text instances. We also propose a loss function that combines a generalized-split-intersection-over union loss defined over the piece-wise polygons. In Chapter 6, we show that our transformer-based architecture without rectifying the input curved text instances is more suitable than SOTA RNN-based frameworks equipped with rectification modules for irregular text recognition in the wild images. Our main contribution to this chapter is leveraging a 2D Learnable Sinusoidal frequencies Positional Encoding (2LSPE) with a modified feed-forward neural network to better encode the 2D spatial dependencies of characters in the irregular text instances. (2) Since TDR tasks encounter the same challenging problems (e.g., irregular text, illumination variations, low-resolution text, etc.), we present a new transformer model that can detect and recognize individual characters of text instances in an end-to-end manner. Reading individual characters later makes a robust occlusion and arbitrarily shaped text spotting model without needing polygon annotation or multiple stages of detection and recognition modules used in SOTA text spotting architectures. In Chapter 7, unlike SOTA methods that combine two different pipelines of detection and recognition modules for a complete text reading, we utilize our text detection framework by leveraging a recent transformer-based technique, namely Deformable Patch-based Transformer (DPT), as a feature extracting backbone, to robustly read the class and box coordinates of irregular characters in the wild images. (3) Finally, we address the occlusion problem by using a multi-task end-to-end scene text spotting framework. In Chapter 8, we leverage a recent transformer-based framework in deep learning, namely Masked Auto Encoder (MAE), as a backbone for scene text recognition and end-to-end scene text spotting pipelines to overcome the partial occlusion limitation. We design a new multitask End-to-End transformer network that directly outputs characters, word instances, and their bounding box representations, saving the computational overhead as it eliminates multiple processing steps. The unified proposed framework can also detect and recognize arbitrarily shaped text instances without using polygon annotations