Search CORE

7 research outputs found

Articulatory feature-based methods for acoustic and audio-visual speech recognition: Summary from the 2006 JHU Summer Workshop.

Author: Bartels Chris
Bezman Ari
Borges Nash
Dawson-Haggerty Stephen
Frankel Joe
Hasegawa-Johnson Mark
Kantor Arthur
King Simon
Lal Partha
Livescu Karen
Magimai-Doss Mathew
Saenko Kate
Woods Bronwyn
Yung Lisa
Çetin Ozgur
Publication venue
Publication date: 01/01/2007
Field of study

We report on investigations, conducted at the 2006 Johns HopkinsWorkshop, into the use of articulatory features (AFs) for observation and pronunciation models in speech recognition. In the area of observation modeling, we use the outputs of AF classiers both directly, in an extension of hybrid HMM/neural network models, and as part of the observation vector, an extension of the tandem approach. In the area of pronunciation modeling, we investigate a model having multiple streams of AF states with soft synchrony constraints, for both audio-only and audio-visual recognition. The models are implemented as dynamic Bayesian networks, and tested on tasks from the Small-Vocabulary Switchboard (SVitchboard) corpus and the CUAVE audio-visual digits corpus. Finally, we analyze AF classication and forced alignment using a newly collected set of feature-level manual transcriptions

CiteSeerX

Reconstruction of intelligible audio speech from visual speech information

Author: Le Cornu Thomas
Publication venue
Publication date: 01/11/2016
Field of study

The aim of the work conducted in this thesis is to reconstruct audio speech signals using information which can be extracted solely from a visual stream of a speaker's face, with application for surveillance scenarios and silent speech interfaces. Visual speech is limited to that which can be seen of the mouth, lips, teeth, and tongue, where the visual articulators convey considerably less information than in the audio domain, leading to the task being difficult. Accordingly, the emphasis is on the reconstruction of intelligible speech, with less regard given to quality. A speech production model is used to reconstruct audio speech, where methods are presented in this work for generating or estimating the necessary parameters for the model. Three approaches are explored for producing spectral-envelope estimates from visual features as this parameter provides the greatest contribution to speech intelligibility. The first approach uses regression to perform the visual-to-audio mapping, and then two further approaches are explored using vector quantisation techniques and classification models, with long-range temporal information incorporated at the feature and model-level. Excitation information, namely fundamental frequency and aperiodicity, is generated using artificial methods and joint-feature clustering approaches. Evaluations are first performed using mean squared error analyses and objective measures of speech intelligibility to refine the various system configurations, and then subjective listening tests are conducted to determine word-level accuracy, giving real intelligibility scores, of reconstructed speech. The best performing visual-to-audio domain mapping approach, using a clustering-and-classification framework with feature-level temporal encoding, is able to achieve audio-only intelligibility scores of 77 %, and audiovisual intelligibility scores of 84 %, on the GRID dataset. Furthermore, the methods are applied to a larger and more continuous dataset, with less favourable results, but with the belief that extensions to the work presented will yield a further increase in intelligibility

Recommended from our members

Human retroviruses and AIDS 1996. A compilation and analysis of nucleic acid and amino acid sequences

Author: Foley B.
Jeang K. T.
Korber B.
Los Alamos National Lab. NM (United States). Theoretical Div.
Mellors J. W.
Myers G.
National Institutes of Health Bethesda, MD (United States). Molecular Virology Section]
Univ. of Pittsburgh PA (United States)]
Wain-Hobson S.
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 01/04/1997
Field of study

This compendium and the accompanying floppy diskettes are the result of an effort to compile and rapidly publish all relevant molecular data concerning the human immunodeficiency viruses (HIV) and related retroviruses. The scope of the compendium and database is best summarized by the five parts that it comprises: (1) Nuclear Acid Alignments and Sequences; (2) Amino Acid Alignments; (3) Analysis; (4) Related Sequences; and (5) Database Communications. Information within all the parts is updated throughout the year on the Web site, http://hiv-web.lanl.gov. While this publication could take the form of a review or sequence monograph, it is not so conceived. Instead, the literature from which the database is derived has simply been summarized and some elementary computational analyses have been performed upon the data. Interpretation and commentary have been avoided insofar as possible so that the reader can form his or her own judgments concerning the complex information. In addition to the general descriptions of the parts of the compendium, the user should read the individual introductions for each part

UNT Digital Library