764 research outputs found
Identification of Non-Linguistic Speech Features
Over the last decade technological advances have been made which enable us to envision real-world applications of speech technologies. It is possible to foresee applications where the spoken query is to be recognized without even prior knowledge of the language being spoken, for example, information centers in public places such as train stations and airports. Other applications may require accurate identification of the speaker for security reasons, including control of access to confidential information or for telephone-based transactions. Ideally, the speaker's identity can be verified continually during the transaction, in a manner completely transparent to the user. With these views in mind, this paper presents a unified approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods. This technique is shown to be effective for text-independent language, sex, and speaker identification and can enable better and more friendly human-machine interaction. With 2s of speech, the language can be identified with better than 99 % accuracy. Error in sex-identification is about 1% on a per-sentence basis, and speaker identification accuracies of 98.5 % on TIMIT (168 speakers) and 99.2 % on BREF (65 speakers), were obtained with one utterance per speaker, and 100 % with 2 utterances for both corpora. An experiment using unsupervised adaptation for speaker identification on the 168 TIMIT speakers had the same identification accuracies obtained with supervised adaptation
Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model
Multilingual models for Automatic Speech Recognition (ASR) are attractive as
they have been shown to benefit from more training data, and better lend
themselves to adaptation to under-resourced languages. However, initialisation
from monolingual context-dependent models leads to an explosion of
context-dependent states. Connectionist Temporal Classification (CTC) is a
potential solution to this as it performs well with monophone labels.
We investigate multilingual CTC in the context of adaptation and
regularisation techniques that have been shown to be beneficial in more
conventional contexts. The multilingual model is trained to model a universal
International Phonetic Alphabet (IPA)-based phone set using the CTC loss
function. Learning Hidden Unit Contribution (LHUC) is investigated to perform
language adaptive training. In addition, dropout during cross-lingual
adaptation is also studied and tested in order to mitigate the overfitting
problem.
Experiments show that the performance of the universal phoneme-based CTC
system can be improved by applying LHUC and it is extensible to new phonemes
during cross-lingual adaptation. Updating all the parameters shows consistent
improvement on limited data. Applying dropout during adaptation can further
improve the system and achieve competitive performance with Deep Neural Network
/ Hidden Markov Model (DNN/HMM) systems on limited data
Distant Speech Recognition for Home Automation: Preliminary Experimental Results in a Smart Home
International audienceThis paper presents a study that is part of the Sweet-Home project which aims at developing a new home automation system based on voice command. The study focused on two tasks: distant speech recognition and sentence spotting (e.g., recognition of domotic orders). Regarding the first task, different combinations of ASR systems, language and acoustic models were tested. Fusion of ASR outputs by consensus and with a triggered language model (using a priori knowledge) were investigated. For the sentence spotting task, an algorithm based on distance evaluation between the current ASR hypotheses and the predefine set of keyword patterns was introduced in order to retrieve the correct sentences in spite of the ASR errors. The techniques were assessed on real daily living data collected in a 4-room smart home that was fully equipped with standard tactile commands and with 7 wireless microphones set in the ceiling. Thanks to Driven Decoding Algorithm techniques, a classical ASR system reached 7.9% WER against 35% WER in standard configuration and 15% with MLLR adaptation only. The best keyword pattern classification result obtained in distant speech conditions was 7.5% CER
Exploiting foreign resources for DNN-based ASR
Manual transcription of audio databases for the development of automatic speech recognition (ASR) systems is a costly and time-consuming process. In the context of deriving acoustic models adapted to a specific application, or in low-resource scenarios, it is therefore essential to explore alternatives capable of improving speech recognition results. In this paper, we investigate the relevance of foreign data characteristics, in particular domain and language, when using this data as an auxiliary data source for training ASR acoustic models based on deep neural networks (DNNs). The acoustic models are evaluated on a challenging bilingual database within the scope of the MediaParl project. Experimental results suggest that in-language (but out-of-domain) data is more beneficial than in-domain (but out-of-language) data when employed in either supervised or semi-supervised training of DNNs. The best performing ASR system, an HMM/GMM acoustic model that exploits DNN as a discriminatively trained feature extractor outperforms the best performing HMM/DNN hybrid by about 5 % relative (in terms of WER). An accumulated relative gain with respect to the MFCC-HMM/GMM baseline is about 30 % WER
Language identification through parallel phone recognition dc by Christine S. Chou.
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1994.Includes bibliographical references (leaves 32-33).M.Eng
Improving Posterior Based Confidence Measures in Hybrid HMM/ANN Speech Recognition Systems
In this paper we define and investigate a set of confidence measures based on hybrid Hidden Markov Model/Artificial Neural Network (HMM/ANN) acoustic models. All these measures are using the neural network to estimate the local phone posterior probabilities, which are then combined and normalized in different ways. Experimental results will indeed show that the use of an appropriate duration normalization is very important to obtain good estimates of the phone and word confidences. The different measures are evaluated at the phone and word levels on both an isolated word task (PHONEBOOK) and a continuous speech recognition task (BREF). It will be shown that one of those confidence measures is well suited for utterance verification, and that (as one could expect) confidence measures at the word level perform better than those at the phone level. Finally, using the resulting approach on PHONEBOOK to rescore the N-best list is shown to yield a 34\% decrease in word error rate
Speech segmentation and clustering methods for a new speech recognition architecture
Perinteiset automaattiset puheentunnistusmenetelmät eivät pärjää suorituskyvyssä ihmisen puheenhavaintokyvylle. Voidaksemme kuroa tämän eron umpeen, on kehitettävä täysin uudentyyppisiä arkkitehtuureja puheentunnistusta varten. Puhetta ja kieltä itsestään ihmisen lailla oppiva järjestelmä on yksi tällainen vaihtoehto.
Tämä diplomityö esittelee erään lähtökohdan oppivalle järjestelmälle, koostuen uudenlaisesta sokeasta puheen segmentointialgoritmista, segmenttien piirteistyksestä, sekä menetelmistä vähittäiselle puhedatan luokittelulle klusteroinnin avulla. Kaikki metodit arvioitiin kattavilla kokeilla, ja itse arviontimenetelmien luonteeseen kiinnitettiin huomiota. Segmentoinnissa saavutettiin alan kirjallisuuteen nähden hyvät tulokset. Järjestelmän mahdollisia jatkokehityssuuntauksia on hahmoteltu muunmuassa mahdollisten muistiarkkitehtuurien ja älykkään top-down palautteen osalta.To reduce the gap between performance of traditional speech recognition systems and human speech recognition skills, a new architecture is required. A system that is capable of incremental learning offers one such solution to this problem.
This thesis introduces a bottom-up approach for such a speech processing system, consisting of a novel blind speech segmentation algorithm, a segmental feature extraction methodology, and data classification by incremental clustering. All methods were evaluated by extensive experiments with a broad range of test material and the evaluation methodology was itself also scrutinized. The segmentation algorithm achieved above standard quality results compared to what is found in current literature regarding blind segmentation. Possibilities for follow-up research of memory structures and intelligent top-down feedback in speech processing are also outlined
- …