1,191 research outputs found
Identification of Non-Linguistic Speech Features
Over the last decade technological advances have been made which enable us to envision real-world applications of speech technologies. It is possible to foresee applications where the spoken query is to be recognized without even prior knowledge of the language being spoken, for example, information centers in public places such as train stations and airports. Other applications may require accurate identification of the speaker for security reasons, including control of access to confidential information or for telephone-based transactions. Ideally, the speaker's identity can be verified continually during the transaction, in a manner completely transparent to the user. With these views in mind, this paper presents a unified approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods. This technique is shown to be effective for text-independent language, sex, and speaker identification and can enable better and more friendly human-machine interaction. With 2s of speech, the language can be identified with better than 99 % accuracy. Error in sex-identification is about 1% on a per-sentence basis, and speaker identification accuracies of 98.5 % on TIMIT (168 speakers) and 99.2 % on BREF (65 speakers), were obtained with one utterance per speaker, and 100 % with 2 utterances for both corpora. An experiment using unsupervised adaptation for speaker identification on the 168 TIMIT speakers had the same identification accuracies obtained with supervised adaptation
Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model
Multilingual models for Automatic Speech Recognition (ASR) are attractive as
they have been shown to benefit from more training data, and better lend
themselves to adaptation to under-resourced languages. However, initialisation
from monolingual context-dependent models leads to an explosion of
context-dependent states. Connectionist Temporal Classification (CTC) is a
potential solution to this as it performs well with monophone labels.
We investigate multilingual CTC in the context of adaptation and
regularisation techniques that have been shown to be beneficial in more
conventional contexts. The multilingual model is trained to model a universal
International Phonetic Alphabet (IPA)-based phone set using the CTC loss
function. Learning Hidden Unit Contribution (LHUC) is investigated to perform
language adaptive training. In addition, dropout during cross-lingual
adaptation is also studied and tested in order to mitigate the overfitting
problem.
Experiments show that the performance of the universal phoneme-based CTC
system can be improved by applying LHUC and it is extensible to new phonemes
during cross-lingual adaptation. Updating all the parameters shows consistent
improvement on limited data. Applying dropout during adaptation can further
improve the system and achieve competitive performance with Deep Neural Network
/ Hidden Markov Model (DNN/HMM) systems on limited data
Comparison of Spectral Properties of Read, Prepared and Casual Speech in French
International audienceIn this paper, we investigate the acoustic properties of phonemes in three speaking styles: read speech, prepared speech and spontaneous speech. Our aim is to better understand why speech recognition systems still fails to achieve good performances on spontaneous speech. This work follows the work of Nakamura et al. \cite{nakamura2008} on Japanese speaking styles, with the difference that we here focus on French. Using Nakamura's method, we use classical speech recognition features, MFCC, and try to represent the effects of the speaking styles on the spectral space. Two measurements are defined in order to represent the spectral space reduction and the spectral variance extension. Experiments are then carried on to investigate if indeed we find some differences between the three speaking styles using these measurements. We finally compare our results to those obtained by Nakamura on Japanese to see if the same phenomenon appears
Distant Speech Recognition for Home Automation: Preliminary Experimental Results in a Smart Home
International audienceThis paper presents a study that is part of the Sweet-Home project which aims at developing a new home automation system based on voice command. The study focused on two tasks: distant speech recognition and sentence spotting (e.g., recognition of domotic orders). Regarding the first task, different combinations of ASR systems, language and acoustic models were tested. Fusion of ASR outputs by consensus and with a triggered language model (using a priori knowledge) were investigated. For the sentence spotting task, an algorithm based on distance evaluation between the current ASR hypotheses and the predefine set of keyword patterns was introduced in order to retrieve the correct sentences in spite of the ASR errors. The techniques were assessed on real daily living data collected in a 4-room smart home that was fully equipped with standard tactile commands and with 7 wireless microphones set in the ceiling. Thanks to Driven Decoding Algorithm techniques, a classical ASR system reached 7.9% WER against 35% WER in standard configuration and 15% with MLLR adaptation only. The best keyword pattern classification result obtained in distant speech conditions was 7.5% CER
Exploiting foreign resources for DNN-based ASR
Manual transcription of audio databases for the development of automatic speech recognition (ASR) systems is a costly and time-consuming process. In the context of deriving acoustic models adapted to a specific application, or in low-resource scenarios, it is therefore essential to explore alternatives capable of improving speech recognition results. In this paper, we investigate the relevance of foreign data characteristics, in particular domain and language, when using this data as an auxiliary data source for training ASR acoustic models based on deep neural networks (DNNs). The acoustic models are evaluated on a challenging bilingual database within the scope of the MediaParl project. Experimental results suggest that in-language (but out-of-domain) data is more beneficial than in-domain (but out-of-language) data when employed in either supervised or semi-supervised training of DNNs. The best performing ASR system, an HMM/GMM acoustic model that exploits DNN as a discriminatively trained feature extractor outperforms the best performing HMM/DNN hybrid by about 5 % relative (in terms of WER). An accumulated relative gain with respect to the MFCC-HMM/GMM baseline is about 30 % WER
Olfactory vocabulary and collocation in French
International audienceThis article is concerned with words pertaining to olfaction in first and second language French. Focusing on adjective collocates for odeur and parfumin word association tasks and in short written productions, the results show certain preferences for each of these words. A number of similarities and differences between natives and non-natives are noted. The question of typical nativelike language use is raised
End-to-End Speech Recognition: A review for the French Language
Recently, end-to-end ASR based either on sequence-to-sequence networks or on
the CTC objective function gained a lot of interest from the community,
achieving competitive results over traditional systems using robust but complex
pipelines. One of the main features of end-to-end systems, in addition to the
ability to free themselves from extra linguistic resources such as dictionaries
or language models, is the capacity to model acoustic units such as characters,
subwords or directly words; opening up the capacity to directly translate
speech with different representations or levels of knowledge depending on the
target language. In this paper we propose a review of the existing end-to-end
ASR approaches for the French language. We compare results to conventional
state-of-the-art ASR systems and discuss which units are more suited to model
the French language.Comment: 10 pages, 2 column-styl
Adaptation Experiments on French MediaParl ASR
This document summarizes adaptation experiments done on French MediaParl corpus and other French corpora. Baseline adaptation techniques are briefly presented and evaluated in the MediaParl task for speaker adaptation, speaker adaptive training, database combination and environmental adaptation. Results show that by applying baseline adaptation techniques, a relative WER reduction of up to 22.8% can be reached in French transcription accuracy. For the MediaParl task, performance of systems trained on directly merged databases and of systems trained on databases combined via MAP adaptation did not differ significantly when large amount of data was available. During the experiments, French data recorded in Switzerland behaved in a similar way compared to French data recorded in France, which suggest that French spoken in Valais is close to the standard French spoken in France, and differencies in ASR accuracies between models trained on Swiss MediaParl and on French BREF are more likely caused by environmental factors or more spontaneity in speech
A characterization of the problem of new, out-of-vocabulary words in continuous-speech recognition and understanding
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Includes bibliographical references (p. 167-173).by Irvine Lee Hetherington.Ph.D
- âŠ