97 research outputs found

    An Investigation of the Beliefs and Classroom Performances of the Overseas Students in Chinese Learning at DUT

    Get PDF
    This research intends to discover the beliefs of the overseas students in Chinese learning at DUT (Dalian University of Technology) and how their beliefs influence their classroom performances. Furthermore, it will figure out the relationship between the overseas Chinese learners’ beliefs and their corresponding classroom performances concerning Chinese learning. With reference to research method, qualitative and quantitative methods are used to collect data. Firstly, two kinds of questionnaire are designed to collect the data regarding overseas Chinese learners’ beliefs and their classroom performances. What is more, classroom observation is used to record the genuine classroom performances of the overseas students in Chinese classes to supplement data collected from questionnaires. And also, SPSS 17.0 is adopted to calculate the relationship between the learners’ beliefs and their classroom performances. Finally, this paper concludes that the overseas Chinese learners’ beliefs and their classroom performances have influences on their learning outcomes

    An Empirical Study on Speaking Proficiency Training for Chinese EFL Learners

    Get PDF
    Improving students’ speaking proficiency has always been a challenge for Chinese EFL teachers. With the traditional training mode students had low motivation to speak, insufficient exposure to authentic language input, inadequate teachers’ instruction on social strategies and no collaborative learning environment to find a partner to practice English with. Aiming at solving the above-mentioned problems of traditional training mode, the research proposes a multi-dimensional training mode with DV as its media, task as its center, cooperative learning as its form, campus English native speakers as its resources, textbooks as its content. Results of the empirical study prove the mode to be effective in increasing the students’ levels of speaking proficiency, social strategy and motivation.Key words: EFL teaching in Chinese context; Speaking proficiency training; Task-based learning; Cooperative learning; D

    Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASR

    Full text link
    An end-to-end (E2E) ASR model implicitly learns a prior Internal Language Model (ILM) from the training transcripts. To fuse an external LM using Bayes posterior theory, the log likelihood produced by the ILM has to be accurately estimated and subtracted. In this paper we propose two novel approaches to estimate the ILM based on Listen-Attend-Spell (LAS) framework. The first method is to replace the context vector of the LAS decoder at every time step with a vector that is learned with training transcripts. Furthermore, we propose another method that uses a lightweight feed-forward network to directly map query vector to context vector in a dynamic sense. Since the context vectors are learned by minimizing the perplexities on training transcripts, and their estimation is independent of encoder output, hence the ILMs are accurately learned for both methods. Experiments show that the ILMs achieve the lowest perplexity, indicating the efficacy of the proposed methods. In addition, they also significantly outperform the shallow fusion method, as well as two previously proposed ILM Estimation (ILME) approaches on several datasets.Comment: Proceedings of INTERSPEEC

    Rule-embedded network for audio-visual voice activity detection in live musical video streams

    Full text link
    Detecting anchor's voice in live musical streams is an important preprocessing for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. With the help of visual information, this paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs to help the model better detect target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as the mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion by the proposed rule, the detection result of A-V branch outperforms that of audio branch; 2) the performance of bi-modal model far outperforms that of audio-only models, indicating that the incorporation of both audio and visual signals is highly beneficial for VAD. To attract more attention to the cross-modal music and audio signal processing, a new live musical video corpus with frame-level label is introduced.Comment: Submitted to ICASSP 202

    CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training

    Full text link
    Speech or text representation generated by pre-trained models contains modal-specific information that could be combined for benefiting spoken language understanding (SLU) tasks. In this work, we propose a novel pre-training paradigm termed Continuous Integrate-and-Fire Pre-Training (CIF-PT). It relies on a simple but effective frame-to-token alignment: continuous integrate-and-fire (CIF) to bridge the representations between speech and text. It jointly performs speech-to-text training and language model distillation through CIF as the pre-training (PT). Evaluated on SLU benchmark SLURP dataset, CIF-PT outperforms the state-of-the-art model by 1.94% of accuracy and 2.71% of SLU-F1 on the tasks of intent classification and slot filling, respectively. We also observe the cross-modal representation extracted by CIF-PT obtains better performance than other neural interfaces for the tasks of SLU, including the dominant speech representation learned from self-supervised pre-training.Comment: Accepted by ACL 2023 Finding

    Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer

    Full text link
    Deep biasing for the Transducer can improve the recognition performance of rare words or contextual entities, which is essential in practical applications, especially for streaming Automatic Speech Recognition (ASR). However, deep biasing with large-scale rare words remains challenging, as the performance drops significantly when more distractors exist and there are words with similar grapheme sequences in the bias list. In this paper, we combine the phoneme and textual information of rare words in Transducers to distinguish words with similar pronunciation or spelling. Moreover, the introduction of training with text-only data containing more rare words benefits large-scale deep biasing. The experiments on the LibriSpeech corpus demonstrate that the proposed method achieves state-of-the-art performance on rare word error rate for different scales and levels of bias lists.Comment: Submitted to ASRU 202

    Graph Contrastive Learning with Implicit Augmentations

    Full text link
    Existing graph contrastive learning methods rely on augmentation techniques based on random perturbations (e.g., randomly adding or dropping edges and nodes). Nevertheless, altering certain edges or nodes can unexpectedly change the graph characteristics, and choosing the optimal perturbing ratio for each dataset requires onerous manual tuning. In this paper, we introduce Implicit Graph Contrastive Learning (iGCL), which utilizes augmentations in the latent space learned from a Variational Graph Auto-Encoder by reconstructing graph topological structure. Importantly, instead of explicitly sampling augmentations from latent distributions, we further propose an upper bound for the expected contrastive loss to improve the efficiency of our learning algorithm. Thus, graph semantics can be preserved within the augmentations in an intelligent way without arbitrary manual design or prior human knowledge. Experimental results on both graph-level and node-level tasks show that the proposed method achieves state-of-the-art performance compared to other benchmarks, where ablation studies in the end demonstrate the effectiveness of modules in iGCL

    Leveraging phone-level linguistic-acoustic similarity for utterance-level pronunciation scoring

    Full text link
    Recent studies on pronunciation scoring have explored the effect of introducing phone embeddings as reference pronunciation, but mostly in an implicit manner, i.e., addition or concatenation of reference phone embedding and actual pronunciation of the target phone as the phone-level pronunciation quality representation. In this paper, we propose to use linguistic-acoustic similarity to explicitly measure the deviation of non-native production from its native reference for pronunciation assessment. Specifically, the deviation is first estimated by the cosine similarity between reference phone embedding and corresponding acoustic embedding. Next, a phone-level Goodness of pronunciation (GOP) pre-training stage is introduced to guide this similarity-based learning for better initialization of the aforementioned two embeddings. Finally, a transformer-based hierarchical pronunciation scorer is used to map a sequence of phone embeddings, acoustic embeddings along with their similarity measures to predict the final utterance-level score. Experimental results on the non-native databases suggest that the proposed system significantly outperforms the baselines, where the acoustic and phone embeddings are simply added or concatenated. A further examination shows that the phone embeddings learned in the proposed approach are able to capture linguistic-acoustic attributes of native pronunciation as reference.Comment: Accepted by ICASSP 202
    • …
    corecore