31 research outputs found

    Lipreading with Long Short-Term Memory

    Full text link
    Lipreading, i.e. speech recognition from visual-only recordings of a speaker's face, can be achieved with a processing pipeline based solely on neural networks, yielding significantly better accuracy than conventional methods. Feed-forward and recurrent neural network layers (namely Long Short-Term Memory; LSTM) are stacked to form a single structure which is trained by back-propagating error gradients through all the layers. The performance of such a stacked network was experimentally evaluated and compared to a standard Support Vector Machine classifier using conventional computer vision features (Eigenlips and Histograms of Oriented Gradients). The evaluation was performed on data from 19 speakers of the publicly available GRID corpus. With 51 different words to classify, we report a best word accuracy on held-out evaluation speakers of 79.6% using the end-to-end neural network-based solution (11.6% improvement over the best feature-based solution evaluated).Comment: Accepted for publication at ICASSP 201

    Combining Residual Networks with LSTMs for Lipreading

    Full text link
    We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0, yielding 6.8 absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.Comment: Submitted to Interspeech 201

    Fast eigenspace decomposition of correlated images

    Get PDF
    Includes bibliographical references.We present a computationally efficient algorithm for the eigenspace decomposition of correlated images. Our approach is motivated by the fact that for a planar rotation of a two-dimensional image, analytical expressions can be given for the eigendecomposition, based on the theory of circulant matrices. These analytical expressions turn out to be good first approximations of the eigendecomposition, even for three-dimensional objects rotated about a single axis. We use this observation to automatically determine the dimension of the subspace required to represent an image with a guaranteed user-specified accuracy, as well as to quickly compute a basis for the subspace. Examples show that the algorithm performs very well on a range of test images composed of three-dimensional objects rotated about a single axis.This work was supported by the Sze Tsao Chang Memorial Engineering Fund and by the Office of Naval Research under contract no. N00014-97-1-0540

    Combining residual networks with LSTMs for lipreading

    Get PDF
    We propose an end-to-end deep learning architecture for word level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We trained and evaluated it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size vocabulary consisting of video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over the current state-of-the-art

    Analysis of eigendecomposition for sets of correlated images at different resolutions

    Get PDF
    Includes bibliographical references.Eigendecomposition is a common technique that is performed on sets of correlated images in a number of computer vision and robotics applications. Unfortunately, the computation of an eigendecomposition can become prohibitively expensive when dealing with very high resolution images. While reducing the resolution of the images will reduce the computational expense, it is not known how this will affect the quality of the resulting eigendecomposition. The work presented here gives the theoretical background for quantifying the effects of varying the resolution of images on the eigendecomposition that is computed from those images. A computationally efficient algorithm for this eigendecomposition is proposed using derived analytical expressions. Examples show that this algorithm performs very well on arbitrary video sequences.This work was supported by the National Imagery and Mapping Agency under contract no. NMA201-00-1-1003 and through collaborative participation in the Robotics Consortium sponsored by the U. S. Army Research Laboratory under the Collaborative Technology Alliance Program, Cooperative Agreement DAAD19-01-2-0012

    Fast eigenspace decomposition of correlated images

    Get PDF
    Includes bibliographical references.We present a computationally efficient algorithm for the eigenspace decomposition of correlated images. Our approach is motivated by the fact that for a planar rotation of a two-dimensional (2-D) image, analytical expressions can be given for the eigendecomposition, based on the theory of circulant matrices. These analytical expressions turn out to be good first approximations of the eigendecomposition, even for three-dimensional (3-D) objects rotated about a single axis. In addition, the theory of circulant matrices yields good approximations to the eigendecomposition for images that result when objects are translated and scaled. We use these observations to automatically determine the dimension of the subspace required to represent an image with a guaranteed user-specified accuracy, as well as to quickly compute a basis for the subspace. Examples show that the algorithm performs very well on a number of test cases ranging from images of 3-D objects rotated about a single axis to arbitrary video sequences.This work was supported by the Sze Tsao Chang Memorial Engineering Fund, the National Imagery and Mapping Agency under Contract NMA201-00-1-1003, and by the Office of Naval Research under Contract N00014-97-1-0640

    Quadtree-based eigendecomposition for pose estimation in the presence of occlusion and background clutter

    Get PDF
    Includes bibliographical references (pages 29-30).Eigendecomposition-based techniques are popular for a number of computer vision problems, e.g., object and pose estimation, because they are purely appearance based and they require few on-line computations. Unfortunately, they also typically require an unobstructed view of the object whose pose is being detected. The presence of occlusion and background clutter precludes the use of the normalizations that are typically applied and significantly alters the appearance of the object under detection. This work presents an algorithm that is based on applying eigendecomposition to a quadtree representation of the image dataset used to describe the appearance of an object. This allows decisions concerning the pose of an object to be based on only those portions of the image in which the algorithm has determined that the object is not occluded. The accuracy and computational efficiency of the proposed approach is evaluated on 16 different objects with up to 50% of the object being occluded and on images of ships in a dockyard

    Automatic Visual Speech Recognition

    Get PDF
    Intelligent SystemsElectrical Engineering, Mathematics and Computer Scienc
    corecore