1,747 research outputs found

    ARTICULATORY INFORMATION FOR ROBUST SPEECH RECOGNITION

    Get PDF
    Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria. Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as `beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as `coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal. Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system

    ANALYTIC MODELS IN PHONOLOGY

    Get PDF

    Statistical and Dynamical Modeling of Riemannian Trajectories with Application to Human Movement Analysis

    Get PDF
    abstract: The data explosion in the past decade is in part due to the widespread use of rich sensors that measure various physical phenomenon -- gyroscopes that measure orientation in phones and fitness devices, the Microsoft Kinect which measures depth information, etc. A typical application requires inferring the underlying physical phenomenon from data, which is done using machine learning. A fundamental assumption in training models is that the data is Euclidean, i.e. the metric is the standard Euclidean distance governed by the L-2 norm. However in many cases this assumption is violated, when the data lies on non Euclidean spaces such as Riemannian manifolds. While the underlying geometry accounts for the non-linearity, accurate analysis of human activity also requires temporal information to be taken into account. Human movement has a natural interpretation as a trajectory on the underlying feature manifold, as it evolves smoothly in time. A commonly occurring theme in many emerging problems is the need to \emph{represent, compare, and manipulate} such trajectories in a manner that respects the geometric constraints. This dissertation is a comprehensive treatise on modeling Riemannian trajectories to understand and exploit their statistical and dynamical properties. Such properties allow us to formulate novel representations for Riemannian trajectories. For example, the physical constraints on human movement are rarely considered, which results in an unnecessarily large space of features, making search, classification and other applications more complicated. Exploiting statistical properties can help us understand the \emph{true} space of such trajectories. In applications such as stroke rehabilitation where there is a need to differentiate between very similar kinds of movement, dynamical properties can be much more effective. In this regard, we propose a generalization to the Lyapunov exponent to Riemannian manifolds and show its effectiveness for human activity analysis. The theory developed in this thesis naturally leads to several benefits in areas such as data mining, compression, dimensionality reduction, classification, and regression.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    Doctor of Philosophy

    Get PDF
    dissertationStatistical analysis of time dependent imaging data is crucial for understanding normal anatomical development as well as disease progression. The most promising studies are of longitudinal design, where repeated observations are obtained from the same subjects. Analysis in this case is challenging due to the difficulty in modeling longitudinal changes, such as growth, and comparing changes across different populations. In any case, the study of anatomical change over time has the potential to further our understanding of many dynamic processes. What is needed are accurate computational models to capture, describe, and quantify anatomical change over time. Anatomical shape is encoded in a variety of representations, such as medical imaging data and derived geometric information extracted as points, curves, and/or surfaces. By considering various shape representations embedded into the same ambient space as a shape complex, either in 2D or 3D, we obtain a more comprehensive description of the anatomy than provided by an single isolated shape. In this dissertation, we develop spatiotemporal models of anatomical change designed to leverage multiple shape representations simultaneously. Rather than study directly the geometric changes to a shape itself, we instead consider how the ambient space deforms, which allows all embedded shapes to be included simultaneously in model estimation. Around this idea, we develop two complementary spatiotemporal models: a flexible nonparametric model designed to capture complex anatomical trajectories, and a generative model designed as a compact statistical representation of anatomical change. We present several ways spatiotemporal models can support the statistical analysis of scalar measurements, such as volume, extracted from shape. Finally, we cover the statistical analysis of higher dimensional shape features to take better advantage of the rich morphometric information provided by shape, as well as the trajectory of change captured by spatiotemporal models

    Prosody-Based Automatic Segmentation of Speech into Sentences and Topics

    Get PDF
    A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models -- for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2), Special Issue on Accessing Information in Spoken Audio, September 200

    Rate-invariant analysis of covariance trajectories

    Get PDF
    Statistical analysis of dynamic systems, such as videos and dynamic functional connectivity, is often translated into a problem of analyzing trajectories of relevant features, particularly covariance matrices. As an example, in video-based action recognition, a natural mathematical representation of activity videos is as parameterized trajectories on the set of symmetric, positive-definite matrices (SPDMs). The variable execution-rates of actions, implying arbitrary parameterizations of trajectories, complicates their analysis and classification. To handle this challenge, we represent covariance trajectories using transported square-root vector fields (TSRVFs), constructed by parallel translating scaled-velocity vectors of trajectories to their starting points. The space of such representations forms a vector bundle on the SPDM manifold. Using a natural Riemannian metric on this vector bundle, we approximate geodesic paths and geodesic distances between trajectories in the quotient space of this vector bundle. This metric is invariant to the action of the reparameterization group, and leads to a rate-invariant analysis of trajectories. In the process, we remove the parameterization variability and temporally register trajectories during analysis. We demonstrate this framework in multiple contexts, using both generative statistical models and discriminative data analysis. The latter is illustrated using several applications involving video-based action recognition and dynamic functional connectivity analysis
    • …
    corecore