723 research outputs found

    Soft margin estimation for automatic speech recognition

    Get PDF
    In this study, a new discriminative learning framework, called soft margin estimation (SME), is proposed for estimating the parameters of continuous density hidden Markov models (HMMs). The proposed method makes direct use of the successful ideas of margin in support vector machines to improve generalization capability and decision feedback learning in discriminative training to enhance model separation in classifier design. SME directly maximizes the separation of competing models to enhance the testing samples to approach a correct decision if the deviation from training samples is within a safe margin. Frame and utterance selections are integrated into a unified framework to select the training utterances and frames critical for discriminating competing models. SME offers a flexible and rigorous framework to facilitate the incorporation of new margin-based optimization criteria into HMMs training. The choice of various loss functions is illustrated and different kinds of separation measures are defined under a unified SME framework. SME is also shown to be able to jointly optimize feature extraction and HMMs. Both the generalized probabilistic descent algorithm and the Extended Baum-Welch algorithm are applied to solve SME. SME has demonstrated its great advantage over other discriminative training methods in several speech recognition tasks. Tested on the TIDIGITS digit recognition task, the proposed SME approach achieves a string accuracy of 99.61%, the best result ever reported in literature. On the 5k-word Wall Street Journal task, SME reduced the word error rate (WER) from 5.06% of MLE models to 3.81%, with relative 25% WER reduction. This is the first attempt to show the effectiveness of margin-based acoustic modeling for large vocabulary continuous speech recognition in a HMMs framework. The generalization of SME was also well demonstrated on the Aurora 2 robust speech recognition task, with around 30% relative WER reduction from the clean-trained baseline.Ph.D.Committee Chair: Dr. Chin-Hui Lee; Committee Member: Dr. Anthony Joseph Yezzi; Committee Member: Dr. Biing-Hwang (Fred) Juang; Committee Member: Dr. Mark Clements; Committee Member: Dr. Ming Yua

    Solving Large-Margin Hidden Markov Model Estimation via Semidefinite Programming

    Full text link

    Hidden Markov Models

    Get PDF
    Hidden Markov Models (HMMs), although known for decades, have made a big career nowadays and are still in state of development. This book presents theoretical issues and a variety of HMMs applications in speech recognition and synthesis, medicine, neurosciences, computational biology, bioinformatics, seismology, environment protection and engineering. I hope that the reader will find this book useful and helpful for their own research

    Kernel methods in machine learning

    Full text link
    We review machine learning methods employing positive definite kernels. These methods formulate learning and estimation problems in a reproducing kernel Hilbert space (RKHS) of functions defined on the data domain, expanded in terms of a kernel. Working in linear spaces of function has the benefit of facilitating the construction and analysis of learning algorithms while at the same time allowing large classes of functions. The latter include nonlinear functions as well as functions defined on nonvectorial data. We cover a wide range of methods, ranging from binary classifiers to sophisticated methods for estimation with structured data.Comment: Published in at http://dx.doi.org/10.1214/009053607000000677 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Positive Definite Kernels in Machine Learning

    Full text link
    This survey is an introduction to positive definite kernels and the set of methods they have inspired in the machine learning literature, namely kernel methods. We first discuss some properties of positive definite kernels as well as reproducing kernel Hibert spaces, the natural extension of the set of functions {k(x,⋅),x∈X}\{k(x,\cdot),x\in\mathcal{X}\} associated with a kernel kk defined on a space X\mathcal{X}. We discuss at length the construction of kernel functions that take advantage of well-known statistical models. We provide an overview of numerous data-analysis methods which take advantage of reproducing kernel Hilbert spaces and discuss the idea of combining several kernels to improve the performance on certain tasks. We also provide a short cookbook of different kernels which are particularly useful for certain data-types such as images, graphs or speech segments.Comment: draft. corrected a typo in figure

    Correlation features and a structured SVM family for phoneme classification and automatic speech recognition

    Get PDF
    Das Hauptziel dieser Arbeit ist, zur Verbesserung der Klassifikation von Phonemen und als direkte Folge davon zur Verbesserung automatischer Spracherkennung beizutragen. Die ausschlaggebende Innovation ist hierbei, dass unterschiedliche Phasen – von der Erstellung der Klassifikationsmerkmale ĂŒber die innere Struktur der Klassifizierer bis hin zu deren Gesamttopologie – von ein und derselben Grundidee aus deduziert werden. Diese manifestiert sich vor allem in der Interaktion von Korrelation und der verwendeten Tristate-Modellierung von Phonemen. Basis ist dafĂŒr die Sprache eigene Charakteristik der (schwachen) KurzzeitstationaritĂ€t, reprĂ€sentiert durch Segmente mit dieser Eigenschaft und UbergĂ€nge zwischen solchen. Die Tristate-Topologie partitioniert dabei Phoneme, oder allgemeiner Beobachtungen, in drei Bereiche, Starte, Mitte und Ende, und simuliert in Verbindung mit den bekannten Hidden Markov Modellen eben jene Zustandsfolgen von quasi statischen Momenten und Transitionen. Auf Basis der StationaritĂ€t und der Tristate Struktur entfaltet sich unser Ansatz wie folgt. Wir betrachten ein Sprachsignal als eine Realisierung eines Zufallsprozesses, welcher innerhalb kurzer Segmente o.g. Eigenschaften annimmt. Durch diese wird die ZeitunabhĂ€ngigkeit der ersten beiden statistischen Momente determiniert, d.h. die Momente werden allein durch zeitliche Differenzen von Beobachtungen charakterisiert. Mit wechselnden Segmenten und Transitionen zwischen diesen Ă€ndern sich daher Auto-und Kreuzkorrelation und in infolgedessen die durch sie definierten, neu entwickelten Merkmale. In diesem Sinne analysieren wir, basierend auf herkömmlichen MFCCVektoren, in einem ersten Schritt mögliche Verbesserungen durch Verwendung von Autokorrelationsdaten und entwickeln aufgrund motivierender Resultate im Weiteren spezielle (Kreuz-) Korrelationsmerkmale. Dabei hilft die Tatsache, dass im Gegensatz zu verschiedenen MFCC-Vektorkomponenten ein und desselben Merkmalvektors (innerhalb dessen die unterschiedliche Komponenten verschiedene FrequenzbĂ€nder reprĂ€sentieren), gleiche EintrĂ€ge unterschiedlicher Vektoren im Allgemeinen nicht dekorreliert sind. Im darauffolgenden Schritt geht die Operation der Korrelation direkt in die fĂŒr die Phonemklassifikation benutzten Support Vektor Klassifizierer insofern ein, als dass deren (reproduzierender) Kern gewonnen wird aus besagter Transformation. Die dafĂŒr theoretischen Voraussetzungen werden hergeleitet und die notwendigen Eigenschaften des neuen reproduzierenden Kernes wird bewiesen. Einhergehend mit diesem speziellen Kern wird eine Familie aus Klassifizierern eingefĂŒhrt, deren Struktur, den Features folgend, direkt an das Tristatemodel angelehnt und ebenfalls von der Korrelation beeinflusst ist. In ihrer Gesamtheit zielen die Konzepte darauf ab, die stationaritĂ€ren Phasen als auch Transitionen zwischen verschiedenen Sprachsegmenten adĂ€quater zu modellieren als bisherige Verfahren. Die Verbesserung der Erkennungsrate im Vergleich zum Standardansatz wird anschließend anhand von vergleichenden Experimenten gezeigt, und im weiteren Verlauf wird das Verfahren eingebunden in ein allgemeines automatisches Spracherkennungssystem und auf diesem ausgewertet. Vergleichende Experimente mit Standardverfahren demonstrieren dabei das Potential des neuen Ansatzes, und VorschlĂ€ge zu Verbesserungen und Weiterentwicklungen schließen die Arbeit ab.The foremost aim of this thesis is to introduce concepts targeting at improving both phoneme classification and in line with this automatic speech recognition. The most distinctive part of the herein presented, new approach is that the different stages of the analysis, from feature vector creation to classification, are all developed upon the common basis. This foundation becomes apparent by the interaction of correlation and the formal structure of a tristate phoneme model that manifests itself in short time weak stationary characteristic and transitions between such segments within phonemes. The tristate layout is a topology that partitions a phoneme, or more generally an observed frame, into three main sections, start, middle and end. In combination with the well known Hidden Markov Model (HMM) it targets at modeling the above mentioned states of transitions and stationarity. On the base of weak stationarity and the tristate structure, our approach evolves as follows. A stochastic process such as a speech signal that is short time weak stationary has first and second order moments independent of time t, they are affected only by the timespan between observations. This effect is reflected by the (auto)covariance of the process and carries over to (auto)correlation and to some degree to cross correlation. In this light, based on common MFCC feature vectors, we first analyze potential improvements when using autocorrelation data and due to motivating results introduce both new MFCC autocorrelation- and later specific cross correlation features. In this context we note that, in contrast to different components (roughly representing the different frequency bands) of a single MFCC vector, identical components across different MFCC vectors in general are not decorrelated. In a subsequent step, the cross correlation transform is integrated into support vector classifiers used for phoneme classification such that a specialized reproducing kernel utilized by the classifiers is deduced directly from the transform. The theoretical prerequisites for the new kernel to be established are derived and proven along with its necessary requirements. Concerning the support vector machines, in line with the new reproducing kernel a family of classifiers is introduced. The structure of the latter evolves around immanent aspects inherited from concepts of phoneme representation and their acoustic progression: The above mentioned tristate model. Based on the topology of the latter and the construction of the features, a specifically structured collection of classes and associated support vector classifiers is designed under additional integration of correlation. All this aims at developing a framework that represents and models both stationarity and transitions within acoustical events to a degree not achieved by recognition and classification systems hitherto. To prove the success of this approach, experiments are conducted to demonstrate the improved recognition rates resulting from the new topology. Further on, the framework is integrated into a common automatic speech recognition system and evaluated in this context. Again, experiments that compare the new approach to a standard recognition system reveal its potentials. Finally, prospects and suggestions for further potential improvements seclude the thesis

    Correlation features and a structured SVM family for phoneme classification and automatic speech recognition

    Get PDF
    Das Hauptziel dieser Arbeit ist, zur Verbesserung der Klassifikation von Phonemen und als direkte Folge davon zur Verbesserung automatischer Spracherkennung beizutragen. Die ausschlaggebende Innovation ist hierbei, dass unterschiedliche Phasen – von der Erstellung der Klassifikationsmerkmale ĂŒber die innere Struktur der Klassifizierer bis hin zu deren Gesamttopologie – von ein und derselben Grundidee aus deduziert werden. Diese manifestiert sich vor allem in der Interaktion von Korrelation und der verwendeten Tristate-Modellierung von Phonemen. Basis ist dafĂŒr die Sprache eigene Charakteristik der (schwachen) KurzzeitstationaritĂ€t, reprĂ€sentiert durch Segmente mit dieser Eigenschaft und UbergĂ€nge zwischen solchen. Die Tristate-Topologie partitioniert dabei Phoneme, oder allgemeiner Beobachtungen, in drei Bereiche, Starte, Mitte und Ende, und simuliert in Verbindung mit den bekannten Hidden Markov Modellen eben jene Zustandsfolgen von quasi statischen Momenten und Transitionen. Auf Basis der StationaritĂ€t und der Tristate Struktur entfaltet sich unser Ansatz wie folgt. Wir betrachten ein Sprachsignal als eine Realisierung eines Zufallsprozesses, welcher innerhalb kurzer Segmente o.g. Eigenschaften annimmt. Durch diese wird die ZeitunabhĂ€ngigkeit der ersten beiden statistischen Momente determiniert, d.h. die Momente werden allein durch zeitliche Differenzen von Beobachtungen charakterisiert. Mit wechselnden Segmenten und Transitionen zwischen diesen Ă€ndern sich daher Auto-und Kreuzkorrelation und in infolgedessen die durch sie definierten, neu entwickelten Merkmale. In diesem Sinne analysieren wir, basierend auf herkömmlichen MFCCVektoren, in einem ersten Schritt mögliche Verbesserungen durch Verwendung von Autokorrelationsdaten und entwickeln aufgrund motivierender Resultate im Weiteren spezielle (Kreuz-) Korrelationsmerkmale. Dabei hilft die Tatsache, dass im Gegensatz zu verschiedenen MFCC-Vektorkomponenten ein und desselben Merkmalvektors (innerhalb dessen die unterschiedliche Komponenten verschiedene FrequenzbĂ€nder reprĂ€sentieren), gleiche EintrĂ€ge unterschiedlicher Vektoren im Allgemeinen nicht dekorreliert sind. Im darauffolgenden Schritt geht die Operation der Korrelation direkt in die fĂŒr die Phonemklassifikation benutzten Support Vektor Klassifizierer insofern ein, als dass deren (reproduzierender) Kern gewonnen wird aus besagter Transformation. Die dafĂŒr theoretischen Voraussetzungen werden hergeleitet und die notwendigen Eigenschaften des neuen reproduzierenden Kernes wird bewiesen. Einhergehend mit diesem speziellen Kern wird eine Familie aus Klassifizierern eingefĂŒhrt, deren Struktur, den Features folgend, direkt an das Tristatemodel angelehnt und ebenfalls von der Korrelation beeinflusst ist. In ihrer Gesamtheit zielen die Konzepte darauf ab, die stationaritĂ€ren Phasen als auch Transitionen zwischen verschiedenen Sprachsegmenten adĂ€quater zu modellieren als bisherige Verfahren. Die Verbesserung der Erkennungsrate im Vergleich zum Standardansatz wird anschließend anhand von vergleichenden Experimenten gezeigt, und im weiteren Verlauf wird das Verfahren eingebunden in ein allgemeines automatisches Spracherkennungssystem und auf diesem ausgewertet. Vergleichende Experimente mit Standardverfahren demonstrieren dabei das Potential des neuen Ansatzes, und VorschlĂ€ge zu Verbesserungen und Weiterentwicklungen schließen die Arbeit ab.The foremost aim of this thesis is to introduce concepts targeting at improving both phoneme classification and in line with this automatic speech recognition. The most distinctive part of the herein presented, new approach is that the different stages of the analysis, from feature vector creation to classification, are all developed upon the common basis. This foundation becomes apparent by the interaction of correlation and the formal structure of a tristate phoneme model that manifests itself in short time weak stationary characteristic and transitions between such segments within phonemes. The tristate layout is a topology that partitions a phoneme, or more generally an observed frame, into three main sections, start, middle and end. In combination with the well known Hidden Markov Model (HMM) it targets at modeling the above mentioned states of transitions and stationarity. On the base of weak stationarity and the tristate structure, our approach evolves as follows. A stochastic process such as a speech signal that is short time weak stationary has first and second order moments independent of time t, they are affected only by the timespan between observations. This effect is reflected by the (auto)covariance of the process and carries over to (auto)correlation and to some degree to cross correlation. In this light, based on common MFCC feature vectors, we first analyze potential improvements when using autocorrelation data and due to motivating results introduce both new MFCC autocorrelation- and later specific cross correlation features. In this context we note that, in contrast to different components (roughly representing the different frequency bands) of a single MFCC vector, identical components across different MFCC vectors in general are not decorrelated. In a subsequent step, the cross correlation transform is integrated into support vector classifiers used for phoneme classification such that a specialized reproducing kernel utilized by the classifiers is deduced directly from the transform. The theoretical prerequisites for the new kernel to be established are derived and proven along with its necessary requirements. Concerning the support vector machines, in line with the new reproducing kernel a family of classifiers is introduced. The structure of the latter evolves around immanent aspects inherited from concepts of phoneme representation and their acoustic progression: The above mentioned tristate model. Based on the topology of the latter and the construction of the features, a specifically structured collection of classes and associated support vector classifiers is designed under additional integration of correlation. All this aims at developing a framework that represents and models both stationarity and transitions within acoustical events to a degree not achieved by recognition and classification systems hitherto. To prove the success of this approach, experiments are conducted to demonstrate the improved recognition rates resulting from the new topology. Further on, the framework is integrated into a common automatic speech recognition system and evaluated in this context. Again, experiments that compare the new approach to a standard recognition system reveal its potentials. Finally, prospects and suggestions for further potential improvements seclude the thesis

    Learning in vision and robotics

    Get PDF
    I present my work on learning from video and robotic input. This is an important problem, with numerous potential applications. The use of machine learning makes it possible to obtain models which can handle noise and variation without explicitly programming them. It also raises the possibility of robots which can interact more seamlessly with humans rather than only exhibiting hard-coded behaviors. I will present my work in two areas: video action recognition, and robot navigation. First, I present a video action recognition method which represents actions in video by sequences of retinotopic appearance and motion detectors, learns such models automatically from training data, and allow actions in new video to be recognized and localized completely automatically. Second, I present a new method which allows a mobile robot to learn word meanings from a combination of robot sensor measurements and sentential descriptions corresponding to a set of robotically driven paths. These word meanings support automatic driving from sentential input, and generation of sentential description of new paths. Finally, I also present work on a new action recognition dataset, and comparisons of the performance of recent methods on this dataset and others

    On the use of support vector machines for phonetic classification

    Full text link
    • 

    corecore