16,740 research outputs found

    Unconstrained Scene Text and Video Text Recognition for Arabic Script

    Full text link
    Building robust recognizers for Arabic has always been challenging. We demonstrate the effectiveness of an end-to-end trainable CNN-RNN hybrid architecture in recognizing Arabic text in videos and natural scenes. We outperform previous state-of-the-art on two publicly available video text datasets - ALIF and ACTIV. For the scene text recognition task, we introduce a new Arabic scene text dataset and establish baseline results. For scripts like Arabic, a major challenge in developing robust recognizers is the lack of large quantity of annotated data. We overcome this by synthesising millions of Arabic text images from a large vocabulary of Arabic words and phrases. Our implementation is built on top of the model introduced here [37] which is proven quite effective for English scene text recognition. The model follows a segmentation-free, sequence to sequence transcription approach. The network transcribes a sequence of convolutional features from the input image to a sequence of target labels. This does away with the need for segmenting input image into constituent characters/glyphs, which is often difficult for Arabic script. Further, the ability of RNNs to model contextual dependencies yields superior recognition results.Comment: 5 page

    Improving Unsupervised Defect Segmentation by Applying Structural Similarity to Autoencoders

    Full text link
    Convolutional autoencoders have emerged as popular methods for unsupervised defect segmentation on image data. Most commonly, this task is performed by thresholding a pixel-wise reconstruction error based on an ℓp\ell^p distance. This procedure, however, leads to large residuals whenever the reconstruction encompasses slight localization inaccuracies around edges. It also fails to reveal defective regions that have been visually altered when intensity values stay roughly consistent. We show that these problems prevent these approaches from being applied to complex real-world scenarios and that it cannot be easily avoided by employing more elaborate architectures such as variational or feature matching autoencoders. We propose to use a perceptual loss function based on structural similarity which examines inter-dependencies between local image regions, taking into account luminance, contrast and structural information, instead of simply comparing single pixel values. It achieves significant performance gains on a challenging real-world dataset of nanofibrous materials and a novel dataset of two woven fabrics over the state of the art approaches for unsupervised defect segmentation that use pixel-wise reconstruction error metrics

    Action Recognition in Videos: from Motion Capture Labs to the Web

    Full text link
    This paper presents a survey of human action recognition approaches based on visual data recorded from a single video camera. We propose an organizing framework which puts in evidence the evolution of the area, with techniques moving from heavily constrained motion capture scenarios towards more challenging, realistic, "in the wild" videos. The proposed organization is based on the representation used as input for the recognition task, emphasizing the hypothesis assumed and thus, the constraints imposed on the type of video that each technique is able to address. Expliciting the hypothesis and constraints makes the framework particularly useful to select a method, given an application. Another advantage of the proposed organization is that it allows categorizing newest approaches seamlessly with traditional ones, while providing an insightful perspective of the evolution of the action recognition task up to now. That perspective is the basis for the discussion in the end of the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4 table

    Autonomous Cleaning of Corrupted Scanned Documents - A Generative Modeling Approach

    Full text link
    We study the task of cleaning scanned text documents that are strongly corrupted by dirt such as manual line strokes, spilled ink etc. We aim at autonomously removing dirt from a single letter-size page based only on the information the page contains. Our approach, therefore, has to learn character representations without supervision and requires a mechanism to distinguish learned representations from irregular patterns. To learn character representations, we use a probabilistic generative model parameterizing pattern features, feature variances, the features' planar arrangements, and pattern frequencies. The latent variables of the model describe pattern class, pattern position, and the presence or absence of individual pattern features. The model parameters are optimized using a novel variational EM approximation. After learning, the parameters represent, independently of their absolute position, planar feature arrangements and their variances. A quality measure defined based on the learned representation then allows for an autonomous discrimination between regular character patterns and the irregular patterns making up the dirt. The irregular patterns can thus be removed to clean the document. For a full Latin alphabet we found that a single page does not contain sufficiently many character examples. However, even if heavily corrupted by dirt, we show that a page containing a lower number of character types can efficiently and autonomously be cleaned solely based on the structural regularity of the characters it contains. In different examples using characters from different alphabets, we demonstrate generality of the approach and discuss its implications for future developments.Comment: oral presentation and Google Student Travel Award; IEEE conference on Computer Vision and Pattern Recognition 201

    Boundary, Brightness, and Depth Interactions During Preattentive Representation and Attentive Recognition of Figure and Ground

    Full text link
    This article applies a recent theory of 3-D biological vision, called FACADE Theory, to explain several percepts which Kanizsa pioneered. These include 3-D pop-out of an occluding form in front of an occluded form, leading to completion and recognition of the occluded form; 3-D transparent and opaque percepts of Kanizsa squares, with and without Varin wedges; and interactions between percepts of illusory contours, brightness, and depth in response to 2-D Kanizsa images. These explanations clarify how a partially occluded object representation can be completed for purposes of object recognition, without the completed part of the representation necessarily being seen. The theory traces these percepts to neural mechanisms that compensate for measurement uncertainty and complementarity at individual cortical processing stages by using parallel and hierarchical interactions among several cortical processing stages. These interactions are modelled by a Boundary Contour System (BCS) that generates emergent boundary segmentations and a complementary Feature Contour System (FCS) that fills-in surface representations of brightness, color, and depth. The BCS and FCS interact reciprocally with an Object Recognition System (ORS) that binds BCS boundary and FCS surface representations into attentive object representations. The BCS models the parvocellular LGN→Interblob→Interstripe→V4 cortical processing stream, the FCS models the parvocellular LGN→Blob→Thin Stripe→V4 cortical processing stream, and the ORS models inferotemporal cortex.Air Force Office of Scientific Research (F49620-92-J-0499); Defense Advanced Research Projects Agency (N00014-92-J-4015); Office of Naval Research (N00014-91-J-4100

    BodyNet: Volumetric Inference of 3D Human Body Shapes

    Get PDF
    Human shape estimation is an important task for video editing, animation and fashion industry. Predicting 3D human body shape from natural images, however, is highly challenging due to factors such as variation in human bodies, clothing and viewpoint. Prior methods addressing this problem typically attempt to fit parametric body models with certain priors on pose and shape. In this work we argue for an alternative representation and propose BodyNet, a neural network for direct inference of volumetric body shape from a single image. BodyNet is an end-to-end trainable network that benefits from (i) a volumetric 3D loss, (ii) a multi-view re-projection loss, and (iii) intermediate supervision of 2D pose, 2D body part segmentation, and 3D pose. Each of them results in performance improvement as demonstrated by our experiments. To evaluate the method, we fit the SMPL model to our network output and show state-of-the-art results on the SURREAL and Unite the People datasets, outperforming recent approaches. Besides achieving state-of-the-art performance, our method also enables volumetric body-part segmentation.Comment: Appears in: European Conference on Computer Vision 2018 (ECCV 2018). 27 page
    • 

    corecore