37 research outputs found

    Segmentation, Diarization and Speech Transcription: Surprise Data Unraveled

    Get PDF
    In this thesis, research on large vocabulary continuous speech recognition for unknown audio conditions is presented. For automatic speech recognition systems based on statistical methods, it is important that the conditions of the audio used for training the statistical models match the conditions of the audio to be processed. Any mismatch will decrease the accuracy of the recognition. If it is unpredictable what kind of data can be expected, or in other words if the conditions of the audio to be processed are unknown, it is impossible to tune the models. If the material consists of `surprise data' the output of the system is likely to be poor. In this thesis methods are presented for which no external training data is required for training models. These novel methods have been implemented in a large vocabulary continuous speech recognition system called SHoUT. This system consists of three subsystems: speech/non-speech classification, speaker diarization and automatic speech recognition. The speech/non-speech classification subsystem separates speech from silence and unknown audible non-speech events. The type of non-speech present in audio recordings can vary from paper shuffling in recordings of meetings to sound effects in television shows. Because it is unknown what type of non-speech needs to be detected, it is not possible to train high quality statistical models for each type of non-speech sound. The speech/non-speech classification subsystem, also called the speech activity detection subsystem, does not attempt to classify all audible non-speech in a single run. Instead, first a bootstrap speech/silence classification is obtained using a standard speech activity component. Next, the models for speech, silence and audible non-speech are trained on the target audio using the bootstrap classification. This approach makes it possible to classify speech and non-speech with high accuracy, without the need to know what kinds of sound are present in the audio recording. Once all non-speech is filtered out of the audio, it is the task of the speaker diarization subsystem to determine how many speakers occur in the recording and exactly when they are speaking. The speaker diarization subsystem applies agglomerative clustering to create clusters of speech fragments for each speaker in the recording. First, statistical speaker models are created on random chunks of the recording and by iteratively realigning the data, retraining the models and merging models that represent the same speaker, accurate speaker models are obtained for speaker clustering. This method does not require any statistical models developed on a training set, which makes the diarization subsystem insensitive for variation in audio conditions. Unfortunately, because the algorithm is of complexity O(n3)O(n^3), this clustering method is slow for long recordings. Two variations of the subsystem are presented that reduce the needed computational effort, so that the subsystem is applicable for long audio recordings as well. The automatic speech recognition subsystem developed for this research, is based on Viterbi decoding on a fixed pronunciation prefix tree. Using the fixed tree, a flexible modular decoder could be developed, but it was not straightforward to apply full language model look-ahead efficiently. In this thesis a novel method is discussed that makes it possible to apply language model look-ahead effectively on the fixed tree. Also, to obtain higher speech recognition accuracy on audio with unknown acoustical conditions, a selection from the numerous known methods that exist for robust automatic speech recognition is applied and evaluated in this thesis. The three individual subsystems as well as the entire system have been successfully evaluated on three international benchmarks. The diarization subsystem has been evaluated at the NIST RT06s benchmark and the speech activity detection subsystem has been tested at RT07s. The entire system was evaluated at N-Best, the first automatic speech recognition benchmark for Dutch

    Bayesian Speaker Adaptation Based on a New Hierarchical Probabilistic Model

    Get PDF
    In this paper, a new hierarchical Bayesian speaker adaptation method called HMAP is proposed that combines the advantages of three conventional algorithms, maximum a posteriori (MAP), maximum-likelihood linear regression (MLLR), and eigenvoice, resulting in excellent performance across a wide range of adaptation conditions. The new method efficiently utilizes intra-speaker and inter-speaker correlation information through modeling phone and speaker subspaces in a consistent hierarchical Bayesian way. The phone variations for a specific speaker are assumed to be located in a low-dimensional subspace. The phone coordinate, which is shared among different speakers, implicitly contains the intra-speaker correlation information. For a specific speaker, the phone variation, represented by speaker-dependent eigenphones, are concatenated into a supervector. The eigenphone supervector space is also a low dimensional speaker subspace, which contains inter-speaker correlation information. Using principal component analysis (PCA), a new hierarchical probabilistic model for the generation of the speech observations is obtained. Speaker adaptation based on the new hierarchical model is derived using the maximum a posteriori criterion in a top-down manner. Both batch adaptation and online adaptation schemes are proposed. With tuned parameters, the new method can handle varying amounts of adaptation data automatically and efficiently. Experimental results on a Mandarin Chinese continuous speech recognition task show good performance under all testing conditions

    Suomenkielisen puhepohjaisen dialogijärjestelmän kehitys koulutusrobottiin

    Get PDF
    Spoken dialog systems are coming in the every day life, for example in the personal assistants such as Siri from Apple. However, spoken dialog systems could be used in a vast range of products. In this thesis a spoken dialog system prototype was developed to be used in an educational robot. The main problem in an educational robot to recognize children's speech. The speech of the children varies significantly between speakers, which makes it more difficult to recognize with a single acoustic model. The main focus of the thesis is in the speech recognition and adaptation. The acoustic model used is trained with data gathered from adults and then adapted with the data from children. The adaptation is done for each speaker separately and also as an average child adaptation. The results are compared to the commercial speech recognizer developed by Google Inc. The experiments show that, when adapting the adult model with data from each speaker separately word error rate can be decreased from 8.1 % to 2.4 % and with the average adaptation to 3.1 %. The adaptation that was used was vocal tract length normalization (VTLN) and constrained maximum likelihood linear regression (CMLLR) combined. In comparison word error rate of the commercial product used is 7.4 %.Applen puhelimissa olevan assistentti Sirin tavoin puhepohjaiset dialogijärjestelmät ovat tulossa osaksi jokapäiväistä elämäämme. Puhepohjaisia dialogijärjestelmiä voi kuitenkin käyttää myös monissa muissakin sovelluksissa. Tässä diplomityössä sdialogijärjestelmän prototyyppi kehitettiin käytettäväksi koulutusrobotissa. Suurin haaste koulutusrobotissa on lapsien automaattinen puheentunnistus. Lasten puhe on hyvin vaihtelevaa puhujien välillä, minkä takia puheentunnistus on hyvin vaikeaa yhtä akustista mallia käyttämällä. Tämä diplomityö keskittyy pääasiassa puheentunnistukseen ja akustisen mallin adaptointiin. Akustista mallia, joka on opetettu aikuisten puheella, adaptoidaan, jotta se antaisi parempia tuloksia lasten puheen tunnistuksessa. Adaptointi tehdään kahdella tavalla: puhuja adaptointina ja keskimääräisenä lapsiadaptointina. Tuloksia verrataan Googlen kehittämään kaupalliseen puheentunnistimeen. Kokeet osoittavat, että adaptoimalla aikusten akustista mallia puhuja kohtaisesti sanavirheprosentti (WER) saatiin laskemaan 8.1 %:sta 2.4 %:iin ja Keskimääräisellä lapsiadaptoinnilla taas 3.1 %:iin. Adaptointiin käytettiin Vocal tract length normalization (VTLN) sekä Constrained maximum likelihood linear regression (CMLLR) -tekniikoita erikseen ja yhdistettynä. Vertailukohtana käytettiin Googlen puheentunnistimen sanavirheprosenttia 7.4 %
    corecore