97 research outputs found
An Investigation of the Beliefs and Classroom Performances of the Overseas Students in Chinese Learning at DUT
This research intends to discover the beliefs of the overseas students in Chinese learning at DUT (Dalian University of Technology) and how their beliefs influence their classroom performances. Furthermore, it will figure out the relationship between the overseas Chinese learners’ beliefs and their corresponding classroom performances concerning Chinese learning. With reference to research method, qualitative and quantitative methods are used to collect data. Firstly, two kinds of questionnaire are designed to collect the data regarding overseas Chinese learners’ beliefs and their classroom performances. What is more, classroom observation is used to record the genuine classroom performances of the overseas students in Chinese classes to supplement data collected from questionnaires. And also, SPSS 17.0 is adopted to calculate the relationship between the learners’ beliefs and their classroom performances. Finally, this paper concludes that the overseas Chinese learners’ beliefs and their classroom performances have influences on their learning outcomes
An Empirical Study on Speaking Proficiency Training for Chinese EFL Learners
Improving students’ speaking proficiency has always been a challenge for Chinese EFL teachers. With the traditional training mode students had low motivation to speak, insufficient exposure to authentic language input, inadequate teachers’ instruction on social strategies and no collaborative learning environment to find a partner to practice English with. Aiming at solving the above-mentioned problems of traditional training mode, the research proposes a multi-dimensional training mode with DV as its media, task as its center, cooperative learning as its form, campus English native speakers as its resources, textbooks as its content. Results of the empirical study prove the mode to be effective in increasing the students’ levels of speaking proficiency, social strategy and motivation.Key words: EFL teaching in Chinese context; Speaking proficiency training; Task-based learning; Cooperative learning; D
Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASR
An end-to-end (E2E) ASR model implicitly learns a prior Internal Language
Model (ILM) from the training transcripts. To fuse an external LM using Bayes
posterior theory, the log likelihood produced by the ILM has to be accurately
estimated and subtracted. In this paper we propose two novel approaches to
estimate the ILM based on Listen-Attend-Spell (LAS) framework. The first method
is to replace the context vector of the LAS decoder at every time step with a
vector that is learned with training transcripts. Furthermore, we propose
another method that uses a lightweight feed-forward network to directly map
query vector to context vector in a dynamic sense. Since the context vectors
are learned by minimizing the perplexities on training transcripts, and their
estimation is independent of encoder output, hence the ILMs are accurately
learned for both methods. Experiments show that the ILMs achieve the lowest
perplexity, indicating the efficacy of the proposed methods. In addition, they
also significantly outperform the shallow fusion method, as well as two
previously proposed ILM Estimation (ILME) approaches on several datasets.Comment: Proceedings of INTERSPEEC
Rule-embedded network for audio-visual voice activity detection in live musical video streams
Detecting anchor's voice in live musical streams is an important
preprocessing for music and speech signal processing. Existing approaches to
voice activity detection (VAD) primarily rely on audio, however, audio-based
VAD is difficult to effectively focus on the target voice in noisy
environments. With the help of visual information, this paper proposes a
rule-embedded network to fuse the audio-visual (A-V) inputs to help the model
better detect target voice. The core role of the rule in the model is to
coordinate the relation between the bi-modal information and use visual
representations as the mask to filter out the information of non-target sound.
Experiments show that: 1) with the help of cross-modal fusion by the proposed
rule, the detection result of A-V branch outperforms that of audio branch; 2)
the performance of bi-modal model far outperforms that of audio-only models,
indicating that the incorporation of both audio and visual signals is highly
beneficial for VAD. To attract more attention to the cross-modal music and
audio signal processing, a new live musical video corpus with frame-level label
is introduced.Comment: Submitted to ICASSP 202
CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training
Speech or text representation generated by pre-trained models contains
modal-specific information that could be combined for benefiting spoken
language understanding (SLU) tasks. In this work, we propose a novel
pre-training paradigm termed Continuous Integrate-and-Fire Pre-Training
(CIF-PT). It relies on a simple but effective frame-to-token alignment:
continuous integrate-and-fire (CIF) to bridge the representations between
speech and text. It jointly performs speech-to-text training and language model
distillation through CIF as the pre-training (PT). Evaluated on SLU benchmark
SLURP dataset, CIF-PT outperforms the state-of-the-art model by 1.94% of
accuracy and 2.71% of SLU-F1 on the tasks of intent classification and slot
filling, respectively. We also observe the cross-modal representation extracted
by CIF-PT obtains better performance than other neural interfaces for the tasks
of SLU, including the dominant speech representation learned from
self-supervised pre-training.Comment: Accepted by ACL 2023 Finding
Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer
Deep biasing for the Transducer can improve the recognition performance of
rare words or contextual entities, which is essential in practical
applications, especially for streaming Automatic Speech Recognition (ASR).
However, deep biasing with large-scale rare words remains challenging, as the
performance drops significantly when more distractors exist and there are words
with similar grapheme sequences in the bias list. In this paper, we combine the
phoneme and textual information of rare words in Transducers to distinguish
words with similar pronunciation or spelling. Moreover, the introduction of
training with text-only data containing more rare words benefits large-scale
deep biasing. The experiments on the LibriSpeech corpus demonstrate that the
proposed method achieves state-of-the-art performance on rare word error rate
for different scales and levels of bias lists.Comment: Submitted to ASRU 202
Graph Contrastive Learning with Implicit Augmentations
Existing graph contrastive learning methods rely on augmentation techniques
based on random perturbations (e.g., randomly adding or dropping edges and
nodes). Nevertheless, altering certain edges or nodes can unexpectedly change
the graph characteristics, and choosing the optimal perturbing ratio for each
dataset requires onerous manual tuning. In this paper, we introduce Implicit
Graph Contrastive Learning (iGCL), which utilizes augmentations in the latent
space learned from a Variational Graph Auto-Encoder by reconstructing graph
topological structure. Importantly, instead of explicitly sampling
augmentations from latent distributions, we further propose an upper bound for
the expected contrastive loss to improve the efficiency of our learning
algorithm. Thus, graph semantics can be preserved within the augmentations in
an intelligent way without arbitrary manual design or prior human knowledge.
Experimental results on both graph-level and node-level tasks show that the
proposed method achieves state-of-the-art performance compared to other
benchmarks, where ablation studies in the end demonstrate the effectiveness of
modules in iGCL
Leveraging phone-level linguistic-acoustic similarity for utterance-level pronunciation scoring
Recent studies on pronunciation scoring have explored the effect of
introducing phone embeddings as reference pronunciation, but mostly in an
implicit manner, i.e., addition or concatenation of reference phone embedding
and actual pronunciation of the target phone as the phone-level pronunciation
quality representation. In this paper, we propose to use linguistic-acoustic
similarity to explicitly measure the deviation of non-native production from
its native reference for pronunciation assessment. Specifically, the deviation
is first estimated by the cosine similarity between reference phone embedding
and corresponding acoustic embedding. Next, a phone-level Goodness of
pronunciation (GOP) pre-training stage is introduced to guide this
similarity-based learning for better initialization of the aforementioned two
embeddings. Finally, a transformer-based hierarchical pronunciation scorer is
used to map a sequence of phone embeddings, acoustic embeddings along with
their similarity measures to predict the final utterance-level score.
Experimental results on the non-native databases suggest that the proposed
system significantly outperforms the baselines, where the acoustic and phone
embeddings are simply added or concatenated. A further examination shows that
the phone embeddings learned in the proposed approach are able to capture
linguistic-acoustic attributes of native pronunciation as reference.Comment: Accepted by ICASSP 202
- …