9,030 research outputs found
Context effects on second-language learning of tonal contrasts.
Studies of lexical tone  learning generally focus on monosyllabic contexts, while reports of phonetic learning benefits associated with input variability are based largely on experienced learners. This study trained inexperienced learners on Mandarin tonal contrasts to test two hypotheses regarding the influence of context and variability on tone  learning. The first hypothesis was that increased phonetic variability of tones in disyllabic contexts makes initial tone  learning more challenging in disyllabic than monosyllabic words. The second hypothesis was that the learnability of a given tone varies across contexts due to differences in tonal variability. Results of a word learning experiment supported both hypotheses: tones were acquired less successfully in disyllables than in monosyllables, and the relative difficulty of disyllables was closely related to contextual tonal variability. These results indicate limited relevance of monosyllable-based data on Mandarin learning for the disyllabic majority of the Mandarin lexicon. Furthermore, in the short term, variability can diminish learning; its effects are not necessarily beneficial but dependent on acquisition stage and other learner characteristics. These findings thus highlight the importance of considering contextual variability and the interaction between variability and type of learner in the design, interpretation, and application of research on phonetic learning
Attention-Based End-to-End Speech Recognition on Voice Search
Recently, there has been a growing interest in end-to-end speech recognition
that directly transcribes speech to text without any predefined alignments. In
this paper, we explore the use of attention-based encoder-decoder model for
Mandarin speech recognition on a voice search task. Previous attempts have
shown that applying attention-based encoder-decoder to Mandarin speech
recognition was quite difficult due to the logographic orthography of Mandarin,
the large vocabulary and the conditional dependency of the attention model. In
this paper, we use character embedding to deal with the large vocabulary.
Several tricks are used for effective model training, including L2
regularization, Gaussian weight noise and frame skipping. We compare two
attention mechanisms and use attention smoothing to cover long context in the
attention model. Taken together, these tricks allow us to finally achieve a
character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on
the MiTV voice search dataset. While together with a trigram language model,
CER and SER reach 2.81% and 5.77%, respectively
Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM
We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR)
model. We learn to listen and write characters with a joint Connectionist
Temporal Classification (CTC) and attention-based encoder-decoder network. The
encoder is a deep Convolutional Neural Network (CNN) based on the VGG network.
The CTC network sits on top of the encoder and is jointly trained with the
attention-based decoder. During the beam search process, we combine the CTC
predictions, the attention-based decoder predictions and a separately trained
LSTM language model. We achieve a 5-10\% error reduction compared to prior
systems on spontaneous Japanese and Chinese speech, and our end-to-end model
beats out traditional hybrid ASR systems.Comment: Accepted for INTERSPEECH 201
- …