494 research outputs found
Generalized Hidden Filter Markov Models Applied to Speaker Recognition
Classification of time series has wide Air Force, DoD and commercial interest, from automatic target recognition systems on munitions to recognition of speakers in diverse environments. The ability to effectively model the temporal information contained in a sequence is of paramount importance. Toward this goal, this research develops theoretical extensions to a class of stochastic models and demonstrates their effectiveness on the problem of text-independent (language constrained) speaker recognition. Specifically within the hidden Markov model architecture, additional constraints are implemented which better incorporate observation correlations and context, where standard approaches fail. Two methods of modeling correlations are developed, and their mathematical properties of convergence and reestimation are analyzed. These differ in modeling correlation present in the time samples and those present in the processed features, such as Mel frequency cepstral coefficients. The system models speaker dependent phonemes, making use of word dictionary grammars, and recognition is based on normalized log-likelihood Viterbi decoding. Both closed set identification and speaker verification using cohorts are performed on the YOHO database. YOHO is the only large scale, multiple-session, high-quality speech database for speaker authentication and contains over one hundred speakers stating combination locks. Equal error rates of 0.21% for males and 0.31% for females are demonstrated. A critical error analysis using a hypothesis test formulation provides the maximum number of errors observable while still meeting the goal error rates of 1% False Reject and 0.1% False Accept. Our system achieves this goal
Automatsko raspoznavanje hrvatskoga govora velikoga vokabulara
This paper presents procedures used for development of a Croatian large vocabulary automatic speech recognition system (LVASR). The proposed acoustic model is based on context-dependent triphone hidden Markov models and Croatian phonetic rules. Different acoustic and language models, developed using a large collection of Croatian speech, are discussed and compared. The paper proposes the best feature vectors and acoustic modeling procedures using which lowest word error rates for Croatian speech are achieved. In addition, Croatian language modeling procedures are evaluated and adopted for speaker independent spontaneous speech recognition. Presented experiments and results show that the proposed approach for automatic speech recognition using context-dependent acoustic modeling based on Croatian phonetic rules and a parameter tying procedure can be used for efďŹcient Croatian large vocabulary speech recognition with word error rates below 5%.Älanak prikazuje postupke akustiÄkog i jeziÄnog modeliranja sustava za automatsko raspoznavanje hrvatskoga govora velikoga vokabulara. PredloĹženi akustiÄki modeli su zasnovani na kontekstno-ovisnim skrivenim Markovljevim modelima trifona i hrvatskim fonetskim pravilima. Na hrvatskome govoru prikupljenom u korpusu su ocjenjeni i usporeÄeni razliÄiti akustiÄki i jeziÄni modeli. U Älanku su usporeÄ eni i predloĹženi postupci za izraÄun vektora znaÄajki za akustiÄko modeliranje kao i sam pristup akustiÄkome modeliranju hrvatskoga govora s kojim je postignuta najmanja mjera pogreĹĄno raspoznatih rijeÄi. Predstavljeni su rezultati raspoznavanja spontanog hrvatskog govora neovisni o govorniku. Postignuti rezultati eksperimenata s mjerom pogreĹĄke ispod 5% ukazuju na primjerenost predloĹženih postupaka za automatsko raspoznavanje hrvatskoga govora velikoga vokabulara pomoÄu vezanih kontekstnoovisnih akustiÄkih modela na osnovu hrvatskih fonetskih pravila
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Confidence Scoring and Speaker Adaptation in Mobile Automatic Speech Recognition Applications
Generally, the user group of a language is remarkably diverse in terms of speaker-specific characteristics such as dialect and speaking style. Hence, quality of spoken content varies notably from one individual to another. This diversity causes problems for Automatic Speech Recognition systems. An Automatic Speech Recognition system should be able to assess the hypothesised results. This can be done by evaluating a confidence measure on the recognition results and comparing the resulting measure to a specified threshold. This threshold value, referred to as confidence score, informs how reliable a particular recognition result is for the given speech.
A system should perform optimally irrespective of input speaker characteristics. However, most systems are inflexible and non-adaptive and thus, speaker adaptability can be improved. For achieving these purposes, a solid criterion is required to evaluate the quality of spoken content and the system should be made robust and adaptive towards new speakers as well.
This thesis implements a confidence score using posterior probabilities to examine the quality of the output, based on the speech data and corpora provided by Devoca Oy. Furthermore, speaker adaptation algorithms: Maximum Likelihood Linear Regression and Maximum a Posteriori are applied on a GMM-HMM system and their results are compared. Experiments show that Maximum a Posteriori adaptation brings 2% to 25% improvement in word error rates of semi-continuous model and is recommended for use in the commercial product. The results of other methods are also reported. In addition, word graph is suggested as the method for obtaining posterior probabilities. Since it guarantees no such improvement in the results, the confidence score is proposed as an optional feature for the system
ARTICULATORY INFORMATION FOR ROBUST SPEECH RECOGNITION
Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria.
Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as `beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as `coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal.
Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system
On adaptive decision rules and decision parameter adaptation for automatic speech recognition
Recent advances in automatic speech recognition are accomplished by designing a plug-in maximum a posteriori decision rule such that the forms of the acoustic and language model distributions are specified and the parameters of the assumed distributions are estimated from a collection of speech and language training corpora. Maximum-likelihood point estimation is by far the most prevailing training method. However, due to the problems of unknown speech distributions, sparse training data, high spectral and temporal variabilities in speech, and possible mismatch between training and testing conditions, a dynamic training strategy is needed. To cope with the changing speakers and speaking conditions in real operational conditions for high-performance speech recognition, such paradigms incorporate a small amount of speaker and environment specific adaptation data into the training process. Bayesian adaptive learning is an optimal way to combine prior knowledge in an existing collection of general models with a new set of condition-specific adaptation data. In this paper, the mathematical framework for Bayesian adaptation of acoustic and language model parameters is first described. Maximum a posteriori point estimation is then developed for hidden Markov models and a number of useful parameters densities commonly used in automatic speech recognition and natural language processing.published_or_final_versio
Design of hardware architectures for HMMâbased signal processing systems with applications to advanced human-machine interfaces
In questa tesi viene proposto un nuovo approccio per lo sviluppo di interfacce uomoâmacchina. In particolare si
tratta il caso di sistemi di pattern recognition che fanno uso di Hidden Markov Models per la classificazione.
Il progetto di ricerca è partito dallâideazione di nuove tecniche per la realizzazione di sistemi di riconoscimento
vocale per parlato spontaneo. Gli HMM sono stati scelti come lo strumento algoritmico di base per la realizzazione
del sistema. Dopo una fase di studio preliminare gli obiettivi sono stati estesi alla realizzazione di una architettura
hardware in grado di fornire uno strumento riconfigurabile che possa essere utilizzato non solo per il riconoscimento
vocale, ma in qualsiasi tipo di classificatore basato su HMM.
Il lavoro si concentra quindi sullo sviluppo di architetture hardware dedicate, ma nuovi risultati sono stati ottenuti
anche a livello di applicazione per quanto riguarda la classificazione di segnali elettroencefalografici attraverso
gli HMM.
Innanzitutto state sviluppata una architettura a livello di sistema applicabile a qualsiasi sistema di pattern
recognition che faccia usi di HMM. Lâarchitettura stata concepita in modo tale da essere utilizzabile come un
sistema standâalone. Definita lâarchitettura, un processore hardware per HMM, completamente riconfigurabile,
stato decritto in linguaggio VHDL e simulato con successo. Un array parallelo di questi processori costituisce di
fatto il nucleo di processamento dellâarchitettura sviluppata.
Sulla base del progetto in VHDL, due piattaforme di prototipaggio rapido basate su FPGA sono state selezionate
per dei test di implementazione. Diverse configurazioni costituite da array paralleli di processori HMM sono state
implementate su FPGA. Le soluzioni che offrivano un miglior compromesso tra prestazioni e quantitĂ di risorse
hardware utilizzate sono state selezionate per ulteriori analisi.
Un sistema software per il pattern recognition basato su HMM stato scelto come sistema di riferimento per
verificare la corretta funzionalitĂ delle architetture implementate. Diversi test sono stati progettati per validare che
il funzionamento del sistema corrispondesse alle specifiche iniziali. Le versioni implementate del sistema sono state
confrontate con il software di riferimento sulla base dei risultati forniti dai test. Dal confronto è stato possibile
appurare che le architetture sviluppate hanno un comportamento corrispondente a quello richiesto.
Infine le implementazioni dellâarray parallelo di processori HMM `e sono state applicate a due applicazioni reali:
un riconoscitore vocale, ed un classificatore per interfacce basate su segnali elettroencefalografici. In entrambi i
casi lâarchitettura si è dimostrata in grado di gestire lâapplicazione senza alcun problema. Lâuso del processamento
hardware per il riconoscimento vocale apre di fatto la strada a nuovi sviluppi nel campo grazie al notevole incremento
di prestazioni ottenibili in termini di tempo di esecuzione. Lâapplicazione al processamento dellâEEG, invece,
introduce di fatto un approccio completamente nuovo alla classificazione di questo tipo di segnali, e mostra come in
futuro potrebbe essere possibile lo sviluppo di interfacce basate sulla classificazione dei segnali generati dal pensiero
spontaneo.
I possibili sviluppi del lavoro iniziato con questa tesi sono molteplici. Una direzione possibile è quella dellâimplementazione
completa dellâarchitettura proposta come un sistema standâalone riconfigurabile per lâaccelerazione
di sistemi per pattern recognition di qualsiasi natura purchè basati su HMM. Le potenzialità di tale sistema renderebbero
possibile la realizzazione di classificatiori in tempo reale con un alto grado di complessitĂ , e quindi allo
sviluppo di interfacce realmente multimodali, con una vasta gamma di applicazioni, dai sistemi di per lo spazio a
quelli di supporto per persone disabili.In this thesis a new approach is described for the development of humanâcomputer interfaces. In particular
the case of pattern recognition systems based on Hidden Markov Models have been taken into account.
The research started from he development of techniques for the realization of natural language speech
recognition systems. The Hidden Markov Model (HMM) was chosen as the main algorithmic tool to be
used to build the system. After the early work the goal was extended to the development of an hardware
architecture that provided a reconfigurable tool to be used in any pattern recognition task, and not only in
speech recognition.
The whole work is thus focused on the development of dedicated hardware architectures, but also some
new results have been obtained on the classification of electroencephalographic signals through the use of
HMMs.
Firstly a systemâlevel architecture has been developed to be used in HMM based pattern recognition
systems. The architecture has been conceived in order to be able to work as a standâalone system. Then a
VHDL description has been made of a flexible and completely reconfigurable hardware HMM processor and
the design was successfully simulated. A parallel array of these processors is actually the core processing
block of the developed architecture.
Then two suitable FPGA based, fast prototyping platforms have been identified to be the targets for
the implementation tests. Different configurations of parallel HMM processor arrays have been set up and
mapped on the target FPGAs. Some solutions have been selected to be the best in terms of balance between
performance and resources utilization.
Furthermore a software HMM based pattern recognition system has been chosen to be the reference system
for the functionality of the implemented subsystems. A set of tests have been developed with the aim to test
the correct functionality of the hardware. The implemented system was compared to the reference system
on the basis of the testsâ results, and it was found that the behavior was the one expected and the required
functionality was correctly achieved.
Finally the implementation of the parallel HMM array was tested through its application to two realâworld
applications: a speech recognition task and a brainâcomputer interface task. In both cases the architecture
showed to be functionally suitable and powerful enough to handle the task without problems. The application
of the hardware processing to speech recognition opens new perspectives in the design of this kind of systems
because of the dramatic increment in performance. The application to brainâcomputer interface is really
interesting because of a new approach in the classification of EEG that shows how could be possible a future
development of interfaces based on the classification of spontaneous thought.
The possible evolution directions of the work started with this thesis are many. Effort could be spent of
the implementation of the developed architecture as a standâalone reconfigurable system suitable for any kind
of HMMâbased pattern recognition task. The potential performance of such a system could open the way
to extremely complex realâtime pattern recognition systems, and thus to the realization of truly multimodal
interfaces, with a variety of applications, from space to aid systems for the impaired
Text-Independent Automatic Speaker Identification Using Partitioned Neural Networks
This dissertation introduces a binary partitioned approach to statistical pattern classification which is applied to talker identification using neural networks. In recent years artificial neural networks have been shown to work exceptionally well for small but difficult pattern classification tasks. However, their application to large tasks (i.e., having more than ten to 20 categories) is limited by a dramatic increase in required training time. The time required to train a single network to perform N-way classification is nearly proportional to the exponential of N. In contrast, the binary partitioned approach requires training times on the order of N2. Besides partitioning, other related issues were investigated such as acoustic feature selection for speaker identification and neural network optimization.
The binary partitioned approach was used to develop an automatic speaker identification system for 120 male and 130 female speakers of a standard speech data base. The system performs with 100% accuracy in a text-independent mode when trained with about nine to 14 seconds of speech and tested with six to eight seconds of speech
- âŚ