Search CORE

5 research outputs found

Eighty Challenges Facing Speech Input/Output Technologies

Author: Victor Zue
Publication venue
Publication date: 24/04/2020
Field of study

ABSTRACT During the past three decades, we have witnessed remarkable progress in the development of speech input/output technologies. Despite these successes, we are far from reaching human capabilities of recognizing nearly perfectly the speech spoken by many speakers, under varying acoustic environments, with essentially unrestricted vocabulary. Synthetic speech still sounds stilted and robot-like, lacking in real personality and emotion. There are many challenges that will remain unmet unless we can advance our fundamental understanding of human communication -how speech is produced and perceived, utilizing our innate linguistic competence. This paper outlines some of these challenges, ranging from signal presentation and lexical access to language understanding and multimodal integration, and speculates on how these challenges could be met

CiteSeerX

Speaker Independent Acoustic-to-Articulatory Inversion

Author: Ji An
Publication venue: e-Publications@Marquette
Publication date: 01/10/2014
Field of study

Acoustic-to-articulatory inversion, the determination of articulatory parameters from acoustic signals, is a difficult but important problem for many speech processing applications, such as automatic speech recognition (ASR) and computer aided pronunciation training (CAPT). In recent years, several approaches have been successfully implemented for speaker dependent models with parallel acoustic and kinematic training data. However, in many practical applications inversion is needed for new speakers for whom no articulatory data is available. In order to address this problem, this dissertation introduces a novel speaker adaptation approach called Parallel Reference Speaker Weighting (PRSW), based on parallel acoustic and articulatory Hidden Markov Models (HMM). This approach uses a robust normalized articulatory space and palate referenced articulatory features combined with speaker-weighted adaptation to form an inversion mapping for new speakers that can accurately estimate articulatory trajectories. The proposed PRSW method is evaluated on the newly collected Marquette electromagnetic articulography - Mandarin Accented English (EMA-MAE) corpus using 20 native English speakers. Cross-speaker inversion results show that given a good selection of reference speakers with consistent acoustic and articulatory patterns, the PRSW approach gives good speaker independent inversion performance even without kinematic training data

epublications@Marquette

The stop-like modification of /ð/ : a case study in the analysis and handling of speech variation

Author: Zhao Sherry Yi, 1980-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2007
Field of study

Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2007.Includes bibliographical references (leaves 138-142).Phonetic variation is pervasive in everyday speech. Studying these variations is essential for building acoustic models and lexical representations that effectively capture the variability of speech. This thesis examines one of the commonly-occurring phonetic variations in English: the stop-like modification of the dental fricative /ð/. This variant exhibits a drastic change from the canonical /ð/; the manner of production is changed from one that is fricative to one that is stop-like. Furthermore, the place of articulation of stop-like /0/ has been a point of uncertainty, leading to the confusion between stop-like /ð/1 and /d/. This thesis aims to uncover the segmental context of stop-like /ð/, possible causes of the modification, whether the dental place of articulation is preserved despite modification, and if there are salient acoustic cues that distinguish between stop-like /ð/ and /d/. Word-initial /ð/ in the read speech of the TIMIT Database, the task-oriented spontaneous speech of the AEMT Corpus, and the non-task-oriented spontaneous speech of the Buckeye Corpus are examined acoustically. It is found that stop-like /ð/ occurs most often when it is preceded by silence or when preceded by a stop consonant. The occurrence is less frequent when /ð/ is preceded by a fricative or an affricate consonant. This modification rarely occurs when /ð/ is preceded by a vowel or liquid consonant. The findings suggest that possible factors that may contribute to the stop-like modification of /ð/include physiological mechanisms of speech production, prosody, and/or other aspects of speaking style and manner. Acoustic analysis indicates that stop-like /ð/ is significantly different from /d/ in burst amplitude, burst spectrum shape, burst peak frequency, and second formant at following- vowel onset.(cont.) Moreover, the acoustic differences indicate that the dental place of articulation is preserved for stop-like /ð/. Automatic classification experiments involving these acoustic measures suggest that they are robust in distinguishing stop-like /ð/ from /d/. Applications of these findings may lie in areas of automatic speech recognition, speech transcription, and development of acoustic measures for speech disorder diagnosis.by Sherry Y. Zhao.Ph.D

DSpace@MIT

Applications of broad class knowledge for noise robust speech recognition

Author: Sainath Tara N
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2009
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 157-164).This thesis introduces a novel technique for noise robust speech recognition by first describing a speech signal through a set of broad speech units, and then conducting a more detailed analysis from these broad classes. These classes are formed by grouping together parts of the acoustic signal that have similar temporal and spectral characteristics, and therefore have much less variability than typical sub-word units used in speech recognition (i.e., phonemes, acoustic units). We explore broad classes formed along phonetic and acoustic dimensions. This thesis first introduces an instantaneous adaptation technique to robustly recognize broad classes in the input signal. Given an initial set of broad class models and input speech data, we explore a gradient steepness metric using the Extended Baum-Welch (EBW) transformations to explain how much these initial model must be adapted to fit the target data. We incorporate this gradient metric into a Hidden Markov Model (HMM) framework for broad class recognition and illustrate that this metric allows for a simple and effective adaptation technique which does not suffer from issues such as data scarcity and computational intensity that affect other adaptation methods such as Maximum a-Posteriori (MAP), Maximum Likelihood Linear Regression (MLLR) and feature-space Maximum Likelihood Linear Regression (fM-LLR). Broad class recognition experiments indicate that the EBW gradient metric method outperforms the standard likelihood technique, both when initial models are adapted via MLLR and without adaptation.(cont.) Next, we explore utilizing broad class knowledge as a pre-processor for segmentbased speech recognition systems, which have been observed to be quite sensitive to noise. The experiments are conducted with the SUMMIT segment-based speech recognizer, which detects landmarks - representing possible transitions between phonemes - from large energy changes in the acoustic signal. These landmarks are often poorly detected in noisy conditions. We investigate using the transitions between broad classes, which typically occur at areas of large acoustic change in the audio signal, to aid in landmark detection. We also explore broad classes motivated along both acoustic and phonetic dimensions. Phonetic recognition experiments indicate that utilizing either phonetically or acoustically motivated broad classes offers significant recognition improvements compared to the baseline landmark method in both stationary and non-stationary noise conditions. Finally, this thesis investigates using broad class knowledge for island-driven search. Reliable regions of a speech signal, known as islands, carry most information in the signal compared to unreliable regions, known as gaps. Most speech recognizers do not differentiate between island and gap regions during search and as a result most of the search computation is spent in unreliable regions. Island-driven search addresses this problem by first identifying islands in the speech signal and directing the search outwards from these islands.(cont.) In this thesis, we develop a technique to identify islands from broad classes which have been confidently identified from the input signal. We explore a technique to prune the search space given island/gap knowledge. Finally, to further limit the amount of computation in unreliable regions, we investigate scoring less detailed broad class models in gap regions and more detailed phonetic models in island regions. Experiments on both small and large scale vocabulary tasks indicate that the island-driven search strategy results in an improvement in recognition accuracy and computation time.by Tara N. Sainath.Ph.D

DSpace@MIT

Feature extraction and event detection for automatic speech recognition

Author: Stouten Frederik
Publication venue: Ghent University. Faculty of Engineering
Publication date: 01/01/2008
Field of study

Ghent University Academic Bibliography