66 research outputs found
Unsupervised Subword Modeling Using Autoregressive Pretraining and Cross-Lingual Phone-Aware Modeling
This study addresses unsupervised subword modeling, i.e., learning feature
representations that can distinguish subword units of a language. The proposed
approach adopts a two-stage bottleneck feature (BNF) learning framework,
consisting of autoregressive predictive coding (APC) as a front-end and a
DNN-BNF model as a back-end. APC pretrained features are set as input features
to a DNN-BNF model. A language-mismatched ASR system is used to provide
cross-lingual phone labels for DNN-BNF model training. Finally, BNFs are
extracted as the subword-discriminative feature representation. A second aim of
this work is to investigate the robustness of our approach's effectiveness to
different amounts of training data. The results on Libri-light and the
ZeroSpeech 2017 databases show that APC is effective in front-end feature
pretraining. Our whole system outperforms the state of the art on both
databases. Cross-lingual phone labels for English data by a Dutch ASR
outperform those by a Mandarin ASR, possibly linked to the larger similarity of
Dutch compared to Mandarin with English. Our system is less sensitive to
training data amount when the training data is over 50 hours. APC pretraining
leads to a reduction of needed training material from over 5,000 hours to
around 200 hours with little performance degradation.Comment: 5 pages, 3 figures. Accepted for publication in INTERSPEECH 2020,
Shanghai, Chin
The Effects of Background Noise on Native and Non-native Spoken-word Recognition: A Computational Modelling Approach
How does the presence of background noise affect thecognitive processes underlying spoken-word recognition? Andhow do these effects differ in native and non-native languagelisteners? We addressed these questions using artificial neural-network modelling. We trained a deep auto-encoderarchitecture on binary phonological and semanticrepresentations of 121 English and Dutch translationequivalents. We also varied exposure to the two languages togenerate ânative Englishâ and ânon-native Englishâ trainednetworks. These networks captured key effects in theperformance (accuracy rates and the number of erroneousresponses per word stimulus) of English and Dutch listeners inan offline English spoken-word identification experiment(Scharenborg et al., 2017), which considered clean and noisylistening conditions and three intensities of speech-shapednoise, applied word-initially or word-finally. Our simulationssuggested that the effects of noise on native and non-nativelistening are comparable and can be accounted for within thesame cognitive architecture for spoken-word recognition
Recommended from our members
The Presence of Background Noise Extends the Competitor Space in Native and NonâNative SpokenâWord Recognition: Insights from Computational Modeling
Oral communication often takes place in noisy environments, which challenge spoken-word recognition. Previous research has suggested that the presence of background noise extends the number of candidate words competing with the target word for recognition and that this extension affects the time course and accuracy of spoken-word recognition. In this study, we further investigated the temporal dynamics of competition processes in the presence of background noise, and how these vary in listeners with different language proficiency (i.e., native and non-native) using computational modeling. We developed ListenIN (Listen-In-Noise), a neural-network model based on an autoencoder architecture, which learns to map phonological forms onto meanings in two languages and simulates native and non-native spoken-word comprehension. We also examined the model's activation states during online spoken-word recognition. These analyses demonstrated that the presence of background noise increases the number of competitor words, which are engaged in phonological competition and that this happens in similar ways intra and interlinguistically and in native and non-native listening. Taken together, our results support accounts positing a âmany-additional-competitors scenarioâ for the effects of noise on spoken-word recognition.Multimedia Computin
Improving Whispered Speech Recognition Performance using Pseudo-whispered based Data Augmentation
Whispering is a distinct form of speech known for its soft, breathy, and
hushed characteristics, often used for private communication. The acoustic
characteristics of whispered speech differ substantially from normally phonated
speech and the scarcity of adequate training data leads to low automatic speech
recognition (ASR) performance. To address the data scarcity issue, we use a
signal processing-based technique that transforms the spectral characteristics
of normal speech to those of pseudo-whispered speech. We augment an End-to-End
ASR with pseudo-whispered speech and achieve an 18.2% relative reduction in
word error rate for whispered speech compared to the baseline. Results for the
individual speaker groups in the wTIMIT database show the best results for US
English. Further investigation showed that the lack of glottal information in
whispered speech has the largest impact on whispered speech ASR performance.Comment: Accepted to ASRU 202
Cross-linguistic Influences on Sentence Accent Detection in Background Noise.
This paper investigates whether sentence accent detection in a non-native language is dependent on (relative) similarity between prosodic cues to accent between the non-native and the native language, and whether cross-linguistic differences in the use of local and more widely distributed (i.e., non-local) cues to sentence accent detection lead to differential effects of the presence of background noise on sentence accent detection in a non-native language. We compared Dutch, Finnish, and French non-native listeners of English, whose cueing and use of prosodic prominence is gradually further removed from English, and compared their results on a phoneme monitoring task in different levels of noise and a quiet condition to those of native listeners. Overall phoneme detection performance was high for the native and the non-native listeners, but deteriorated to the same extent in the presence of background noise. Crucially, relative similarity between the prosodic cues to sentence accent of one's native language compared to that of a non-native language does not determine the ability to perceive and use sentence accent for speech perception in that non-native language. Moreover, proficiency in the non-native language is not a straightforward predictor of sentence accent perception performance, although high proficiency in a non-native language can seemingly overcome certain differences at the prosodic level between the native and non-native language. Instead, performance is determined by the extent to which listeners rely on local cues (English and Dutch) versus cues that are more distributed (Finnish and French), as more distributed cues survive the presence of background noise better
Bayesian Models for Unit Discovery on a Very Low Resource Language
Developing speech technologies for low-resource languages has become a very
active research field over the last decade. Among others, Bayesian models have
shown some promising results on artificial examples but still lack of in situ
experiments. Our work applies state-of-the-art Bayesian models to unsupervised
Acoustic Unit Discovery (AUD) in a real low-resource language scenario. We also
show that Bayesian models can naturally integrate information from other
resourceful languages by means of informative prior leading to more consistent
discovered units. Finally, discovered acoustic units are used, either as the
1-best sequence or as a lattice, to perform word segmentation. Word
segmentation results show that this Bayesian approach clearly outperforms a
Segmental-DTW baseline on the same corpus.Comment: Accepted to ICASSP 201
- âŠ