Search CORE

10 research outputs found

New Metrics on Image Articulation Manifolds Using Optical Flow

Author: Nagaraj Sriram
Publication venue
Publication date: 01/01/2011
Field of study

Image articulation manifolds (IAMs) arise in a wide variety of contexts in image processing and computer vision applications. IAMs are a natural nonlinear model for image ensembles generated by the variation of imaging parameters (scale, pose, lighting etc.). In the past, IAMs have been studied as being embedded submanifolds of higher dimensional Euclidean spaces. However, this view suffers from two major defects: lack of a meaningful metric and reliance on linear transport operators via tangent vectors. Recent work in the area indicates the existence of better nonlinear transport operators for IAMs, with optical flow based transport being a prime candidate. In this thesis, we provide a detailed theoretical analysis of optical flow based transport on IAMs. In particular, we develop new analytical tools reminiscent of differential geometry to handle the apriori data driven nature of IAMs using the notion of optical flow manifolds (OFMs). We define an appropriate metric on the IAM via a metric on the corresponding OFMs that satisfy certain local isometry conditions and we show how to use this new metric to develop a host of mathematical tools such as optical flow fields on the IAM, parallel fields and parallel transport as well as an intuitive notion of "optical curvature". We show that the space of optical flow fields along a path of constant optical curvature has a natural multiscale structure. We also consider the question of approximating non-parallel flow fields by parallel flow fields

DSpace at Rice University

Isometry and convexity in dimensionality reduction

Author: Vasiloglou Nikolaos
Publication venue: Georgia Institute of Technology
Publication date: 30/03/2009
Field of study

The size of data generated every year follows an exponential growth. The number of data points as well as the dimensions have increased dramatically the past 15 years. The gap between the demand from the industry in data processing and the solutions provided by the machine learning community is increasing. Despite the growth in memory and computational power, advanced statistical processing on the order of gigabytes is beyond any possibility. Most sophisticated Machine Learning algorithms require at least quadratic complexity. With the current computer model architecture, algorithms with higher complexity than linear O(N) or O(N logN) are not considered practical. Dimensionality reduction is a challenging problem in machine learning. Often data represented as multidimensional points happen to have high dimensionality. It turns out that the information they carry can be expressed with much less dimensions. Moreover the reduced dimensions of the data can have better interpretability than the original ones. There is a great variety of dimensionality reduction algorithms under the theory of Manifold Learning. Most of the methods such as Isomap, Local Linear Embedding, Local Tangent Space Alignment, Diffusion Maps etc. have been extensively studied under the framework of Kernel Principal Component Analysis (KPCA). In this dissertation we study two current state of the art dimensionality reduction methods, Maximum Variance Unfolding (MVU) and Non-Negative Matrix Factorization (NMF). These two dimensionality reduction methods do not fit under the umbrella of Kernel PCA. MVU is cast as a Semidefinite Program, a modern convex nonlinear optimization algorithm, that offers more flexibility and power compared to iv KPCA. Although MVU and NMF seem to be two disconnected problems, we show that there is a connection between them. Both are special cases of a general nonlinear factorization algorithm that we developed. Two aspects of the algorithms are of particular interest: computational complexity and interpretability. In other words computational complexity answers the question of how fast we can find the best solution of MVU/NMF for large data volumes. Since we are dealing with optimization programs, we need to find the global optimum. Global optimum is strongly connected with the convexity of the problem. Interpretability is strongly connected with local isometry1 that gives meaning in relationships between data points. Another aspect of interpretability is association of data with labeled information. The contributions of this thesis are the following: 1. MVU is modified so that it can scale more efficient. Results are shown on 1 million speech datasets. Limitations of the method are highlighted. 2. An algorithm for fast computations for the furthest neighbors is presented for the first time in the literature. 3. Construction of optimal kernels for Kernel Density Estimation with modern convex programming is presented. For the first time we show that the Leave One Cross Validation (LOOCV) function is quasi-concave. 4. For the first time NMF is formulated as a convex optimization problem 5. An algorithm for the problem of Completely Positive Matrix Factorization is presented. 6. A hybrid algorithm of MVU and NMF the isoNMF is presented combining advantages of both methods. 7. The Isometric Separation Maps (ISM) a variation of MVU that contains classification information is presented. 8. Large scale nonlinear dimensional analysis on the TIMIT speech database is performed. 9. A general nonlinear factorization algorithm is presented based on sequential convex programming. Despite the efforts to scale the proposed methods up to 1 million data points in reasonable time, the gap between the industrial demand and the current state of the art is still orders of magnitude wide.Ph.D.Committee Chair: David Anderson; Committee Co-Chair: Alexander Gray; Committee Member: Anthony Yezzi; Committee Member: Hongyuan Zha; Committee Member: Justin Romberg; Committee Member: Ronald Schafe

Scholarly Materials And Research @ Georgia Tech

Computer analysis of children's non-native English speech for language learning and assessment

Author: Qian Mengjie
Publication venue
Publication date: 08/12/2021
Field of study

Children's ASR appears to be more challenging than adults' and it's even more diﬃcult when it comes to non-native children's speech. This research investigates diﬀerent techniques to compensate for the eﬀects of non-native and children on the performance of ASR systems. The study mainly utilises hybrid DNN-HMM systems with conventional DNNs, LSTMs and more advanced TDNN models. This work uses the CALL-ST corpus and TLT-school corpus to study children's non-native English speech. Initially, data augmentation was explored on the CALL-ST corpus to address the lack of data problem using the AMI corpus and PF-STAR German corpus. Feature selection, acoustic model adaptation and selection were also investigated on CALL-ST. More aspects of the ASR system, including pronunciation modelling, acoustic modelling, language modelling and system fusion, were explored on the TLT-school corpus as this corpus has a bigger amount of data. Then, the relationships between the CALL-ST and TLT-school corpora were studied and utilised to improve ASR performance. The other part of the present work is text processing for non-native children's English speech. We focused on providing accept/reject feedback to learners based on the text generated by the ASR system from learners' spoken responses. A rule-based and a machine learning-based system were proposed for making the judgement, several aspects of the systems were evaluated. The inﬂuence of the ASR system on the text processing system was explored

University of Birmingham Research Archive, E-theses Repository

Sparse and Low-rank Modeling for Automatic Speech Recognition

Author: Dighe Pranay
Publication venue: Lausanne, EPFL
Publication date: 28/02/2019
Field of study

This thesis deals with exploiting the low-dimensional multi-subspace structure of speech towards the goal of improving acoustic modeling for automatic speech recognition (ASR). Leveraging the parsimonious hierarchical nature of speech, we hypothesize that whenever a speech signal is measured in a high-dimensional feature space, the true class information is embedded in low-dimensional subspaces whereas noise is scattered as random high-dimensional erroneous estimations in the features. In this context, the contribution of this thesis is twofold: (i) identify sparse and low-rank modeling approaches as excellent tools for extracting the class-specific low-dimensional subspaces in speech features, and (ii) employ these tools under novel ASR frameworks to enrich the acoustic information present in the speech features towards the goal of improving ASR. Techniques developed in this thesis focus on deep neural network (DNN) based posterior features which, under the sparse and low-rank modeling approaches, unveil the underlying class-specific low-dimensional subspaces very elegantly. In this thesis, we tackle ASR tasks of varying difficulty, ranging from isolated word recognition (IWR) and connected digit recognition (CDR) to large-vocabulary continuous speech recognition (LVCSR). For IWR and CDR, we propose a novel \textit{Compressive Sensing} (CS) perspective towards ASR. Here exemplar-based speech recognition is posed as a problem of recovering sparse high-dimensional word representations from compressed low-dimensional phonetic representations. In the context of LVCSR, this thesis argues that albeit their power in representation learning, DNN based acoustic models still have room for improvement in exploiting the \textit{union of low-dimensional subspaces} structure of speech data. Therefore, this thesis proposes to enhance DNN posteriors by projecting them onto the manifolds of the underlying classes using principal component analysis (PCA) or compressive sensing based dictionaries. Projected posteriors are shown to be more accurate training targets for learning better acoustic models, resulting in improved ASR performance. The proposed approach is evaluated on both close-talk and far-field conditions, confirming the importance of sparse and low-rank modeling of speech in building a robust ASR framework. Finally, the conclusions of this thesis are further consolidated by an information theoretic analysis approach which explicitly quantifies the contribution of proposed techniques in improving ASR

Infoscience - École polytechnique fédérale de Lausanne

By

Author: Andrew R. Plummer
Andrew R. Plummer
E. Beckman
Prof Mary
Prof William
Schuler Copyright
Publication venue
Publication date
Field of study

Vowel normalization is a computation that is meant to account for the differences in the absolute direct (physical or psychophysical) representations of qualitatively equivalent vowel productions that arise due to differences in speaker properties such as body size types, age, gender, and other socially interpreted categories that are based on natural variation in vocal tract size and shape. In this dissertation, we address the metaphysical and epistemological aspects of vowel normalization pertaining to spoken language acquisition during early infancy. We begin by reviewing approaches to conceptualizing and modeling the phonetic components of early spoken language acquisition, forming a catalog of phenomena that serves as the basis for our discourse. We then establish the existence of a vowel normalization computation carried out by infants early in their spoken language acquisition, and put forward a conceptual and technical framework for its investigation which focuses attention on the generative nature of the computation. We then situate the acquisition of vowel normalization within a broader developmental framework encompassing a suite of vocal learning phenomena, including language-specific caretaker vocal exchanges

CiteSeerX

Recommended from our members

Effects of manipulating fundamental frequency and speech rate on synthetic voice recognition performance and perceived speaker identity, sex, and age

Author: Gous GE
Publication venue
Publication date: 01/09/2017
Field of study

Vocal fundamental frequency (F0) and speech rate provide the listener with important information relating to the identity, sex, and age of the speaker. Furthermore, it has also been demonstrated that manipulations in F0 or speech rate can lead to accentuation effects in voice memory. As a result, listeners appear to exaggerate the representation of a target voice in terms of F0 or speech rate, and mistakenly remember it as being higher or lower in F0, or faster or slower in speech rate, than the voice originally heard. The aim of this thesis was to understand the effect of manipulations/shifts in F0 or speech rate on voice matching performance and perceived speaker identity, sex, and age. Synthesised male and female voices speaking prescribed sentences were generated and shifted in either F0 and speech rate. In the first set of experiments (Experiments 2, 3, and 4), male and female listeners made judgements about the perceived identity, sex, or age of the speaker. In the second set of experiments (Experiment 5, 6, and 7) male and female listeners made target matching responses for voices presented with and without a delay, and with different spoken sentences. The results of Experiments 2, 3, and 4 indicated the following: (1) Shifts in either F0 or speech rate increased uncertainty about the identity of the speaker, though were more robust to shifts in speech rate than they were to shifts in F0. (2) Shifts in F0 also increased uncertainty about speaker sex, but shifts in speech rate did not. Male voices were accurately perceived as male irrespective of the direction of manipulation in F0. However, for female voices, decreasing F0 increased the uncertainty of speaker sex (i.e., the voices were more likely to be perceived as male rather than female). (3) Increasing either F0 or speech rate resulted in both male and female voices as sounding younger, whereas decreasing either F0 or speech rate lead to listeners perceiving the voices as sounding older. The results of Experiments 5, 6, and 7 indicated the following: (4) Shifts in either F0 or speech rate did increase matching errors for the target voice, however, there was no evidence of an accentuation effect. Specifically, for voices shifted in F0, there was an increase in the selection of voices higher in F0 compared to voices lower in F0. For voices shifted in speech rate, there was an increase in the selection of voices faster in speech rate compared to voices slower in speech rate, but only for slow speech rate target voices. (5) Accentuation errors were no more likely to occur when the inter-stimulus interval was increased, or (6) when a different sentence was spoken in the sequential voice pair to the one previously spoken by the target voice. The findings have theoretical and applied relevance. The work has provided a clearer understanding of how shifts in F0 or speech rate are likely to affect perceptions about the identity, sex, and age of the speaker than was possible to establish from previous studies. It has also contributed further to our understanding about the effect of shifts in F0 or speech rate on voice matching performance, and their importance in accurate recognition. This information might be insightful to the police and help to determine the accuracy of descriptions made about a voice and decisions made during a voice lineup, particularly if a suspect of a crime was likely to be disguising their voice

Nottingham Trent Institutional Repository (IRep)