59 research outputs found

    TCT: A Cross-supervised Learning Method for Multimodal Sequence Representation

    Full text link
    Multimodalities provide promising performance than unimodality in most tasks. However, learning the semantic of the representations from multimodalities efficiently is extremely challenging. To tackle this, we propose the Transformer based Cross-modal Translator (TCT) to learn unimodal sequence representations by translating from other related multimodal sequences on a supervised learning method. Combined TCT with Multimodal Transformer Network (MTN), we evaluate MTN-TCT on the video-grounded dialogue which uses multimodality. The proposed method reports new state-of-the-art performance on video-grounded dialogue which indicates representations learned by TCT are more semantics compared to directly use unimodality.Comment: submitted to ICASSP 202

    Cross-task pre-training for acoustic scene classification

    Full text link
    Acoustic scene classification(ASC) and acoustic event detection(AED) are different but related tasks. Acoustic scenes can be shaped by occurred acoustic events which can provide useful information in training ASC tasks. However, most of the datasets are provided without either the acoustic event or scene labels. Therefore, We explored cross-task pre-training mechanism to utilize acoustic event information extracted from the pre-trained model to optimize the ASC task. We present three cross-task pre-training architectures and evaluated them in feature-based and fine-tuning strategies on two datasets respectively: TAU Urban Acoustic Scenes 2019 dataset and TUT Acoustic Scenes 2017 dataset. Results have shown that cross-task pre-training mechanism can significantly improve the performance of ASC tasks and the performance of our best model improved relatively 9.5% in the TAU Urban Acoustic Scenes 2019 dataset, and also improved 10% in the TUT Acoustic Scenes 2017 dataset compared with the official baseline.Comment: submitted to ICASSP202

    Towards End-to-End Code-Switching Speech Recognition

    Full text link
    Code-switching speech recognition has attracted an increasing interest recently, but the need for expert linguistic knowledge has always been a big issue. End-to-end automatic speech recognition (ASR) simplifies the building of ASR systems considerably by predicting graphemes or characters directly from acoustic input. In the mean time, the need of expert linguistic knowledge is also eliminated, which makes it an attractive choice for code-switching ASR. This paper presents a hybrid CTC-Attention based end-to-end Mandarin-English code-switching (CS) speech recognition system and studies the effect of hybrid CTC-Attention based models, different modeling units, the inclusion of language identification and different decoding strategies on the task of code-switching ASR. On the SEAME corpus, our system achieves a mixed error rate (MER) of 34.24%.Comment: 5 pages, submitted to ICASSP 201

    A comparable study of modeling units for end-to-end Mandarin speech recognition

    Full text link
    End-To-End speech recognition have become increasingly popular in mandarin speech recognition and achieved delightful performance. Mandarin is a tonal language which is different from English and requires special treatment for the acoustic modeling units. There have been several different kinds of modeling units for mandarin such as phoneme, syllable and Chinese character. In this work, we explore two major end-to-end models: connectionist temporal classification (CTC) model and attention based encoder-decoder model for mandarin speech recognition. We compare the performance of three different scaled modeling units: context dependent phoneme(CDP), syllable with tone and Chinese character. We find that all types of modeling units can achieve approximate character error rate (CER) in CTC model and the performance of Chinese character attention model is better than syllable attention model. Furthermore, we find that Chinese character is a reasonable unit for mandarin speech recognition. On DidiCallcenter task, Chinese character attention model achieves a CER of 5.68% and CTC model gets a CER of 7.29%, on the other DidiReading task, CER are 4.89% and 5.79%, respectively. Moreover, attention model achieves a better performance than CTC model on both datasets.Comment: 5 page

    Giant exchange bias and ferromagnetism in the CoO shell of Co/CoO-MgO core-shell nanoparticles

    Full text link
    Using magnetron sputtering, we produced a series of Co/CoO-MgO nanoparticles on Si(100) substrates. High-resolution transmission electron microscopy (HRTEM) image shows that small isolated Co-clusters (core) covered with CoO (shells) with a size of a few nm embedded in a MgO matrix. Resistivity as a function of Co atomic ratio exhibits a distinct percolation threshold with a sharp decrease around 69% Co content. Across the threshold, the resistivity drops about 7 orders of magnitude. For a sample at this percolation critical threshold, we have observed a giant exchange bias field HE=2460 Oe at T= 2K, and using soft x-ray magnetic circular dichroism at the Co-L2,3 edge, we have detected a ferromagnetic (FM) signal originating from the antiferromagnetic CoO shell. Moreover, decreasing the Mg-impurities will reduce the FM signal from CoO shell (namely the uncompensated spin density) and the size of HE, thus directly support the uncompensated spin model

    A Further Study of Unsupervised Pre-training for Transformer Based Speech Recognition

    Full text link
    Building a good speech recognition system usually requires large amounts of transcribed data, which is expensive to collect. To tackle this problem, many unsupervised pre-training methods have been proposed. Among these methods, Masked Predictive Coding achieved significant improvements on various speech recognition datasets with BERT-like Masked Reconstruction loss and Transformer backbone. However, many aspects of MPC have not been fully investigated. In this paper, we conduct a further study on MPC and focus on three important aspects: the effect of pre-training data speaking style, its extension on streaming model, and how to better transfer learned knowledge from pre-training stage to downstream tasks. Experiments reveled that pre-training data with matching speaking style is more useful on downstream recognition tasks. A unified training objective with APC and MPC provided 8.46% relative error reduction on streaming model trained on HKUST. Also, the combination of target data adaption and layer-wise discriminative training helped the knowledge transfer of MPC, which achieved 3.99% relative error reduction on AISHELL over a strong baseline

    Transformer based unsupervised pre-training for acoustic representation learning

    Full text link
    Recently, a variety of acoustic tasks and related applications arised. For many acoustic tasks, the labeled data size may be limited. To handle this problem, we propose an unsupervised pre-training method using Transformer based encoder to learn a general and robust high-level representation for all acoustic tasks. Experiments have been conducted on three kinds of acoustic tasks: speech emotion recognition, sound event detection and speech translation. All the experiments have shown that pre-training using its own training data can significantly improve the performance. With a larger pre-training data combining MuST-C, Librispeech and ESC-US datasets, for speech emotion recognition, the UAR can further improve absolutely 4.3% on IEMOCAP dataset. For sound event detection, the F1 score can further improve absolutely 1.5% on DCASE2018 task5 development set and 2.1% on evaluation set. For speech translation, the BLEU score can further improve relatively 12.2% on En-De dataset and 8.4% on En-Fr dataset.Comment: Accepted by ICASSP 202

    Audio Deep Fake Detection System with Neural Stitching for ADD 2022

    Full text link
    This paper describes our best system and methodology for ADD 2022: The First Audio Deep Synthesis Detection Challenge\cite{Yi2022ADD}. The very same system was used for both two rounds of evaluation in Track 3.2 with a similar training methodology. The first round of Track 3.2 data is generated from Text-to-Speech(TTS) or voice conversion (VC) algorithms, while the second round of data consists of generated fake audio from other participants in Track 3.1, aiming to spoof our systems. Our systems use a standard 34-layer ResNet, with multi-head attention pooling \cite{india2019self} to learn the discriminative embedding for fake audio and spoof detection. We further utilize neural stitching to boost the model's generalization capability in order to perform equally well in different tasks, and more details will be explained in the following sessions. The experiments show that our proposed method outperforms all other systems with a 10.1% equal error rate(EER) in Track 3.2.Comment: Accepted to ICASSP 202

    Audio-Visual Wake Word Spotting System For MISP Challenge 2021

    Full text link
    This paper presents the details of our system designed for the Task 1 of Multimodal Information Based Speech Processing (MISP) Challenge 2021. The purpose of Task 1 is to leverage both audio and video information to improve the environmental robustness of far-field wake word spotting. In the proposed system, firstly, we take advantage of speech enhancement algorithms such as beamforming and weighted prediction error (WPE) to address the multi-microphone conversational audio. Secondly, several data augmentation techniques are applied to simulate a more realistic far-field scenario. For the video information, the provided region of interest (ROI) is used to obtain visual representation. Then the multi-layer CNN is proposed to learn audio and visual representations, and these representations are fed into our two-branch attention-based network which can be employed for fusion, such as transformer and conformed. The focal loss is used to fine-tune the model and improve the performance significantly. Finally, multiple trained models are integrated by casting vote to achieve our final 0.091 score.Comment: Accepted to ICASSP 202

    A Human Immunoglobulin λ Locus Is Similarly Well Expressed in Mice and Humans

    Get PDF
    Transgenic mice carrying a 380-kb region of the human immunoglobulin (Ig) λ light (L) chain locus in germline configuration were created. The introduced translocus on a yeast artificial chromosome (YAC) accommodates the most proximal Igλ variable region (V) gene cluster, including 15 Vλ genes that contribute to >60% of λ L chains in humans, all Jλ-Cλ segments, and the 3′ enhancer. HuIgλYAC mice were bred with animals in which mouse Igκ production was silenced by gene targeting. In the κ−/− background, human Igλ was expressed by ∼84% of splenic B cells. A striking result was that human Igλ was also produced at high levels in mice with normal κ locus. Analysis of bone marrow cells showed that human Igλ and mouse Igκ were expressed at similar levels throughout B cell development, suggesting that the Igλ translocus and the endogenous κ locus rearrange independently and with equal efficiency at the same developmental stage. This is further supported by the finding that in hybridomas expressing human Igλ the endogenous L chain loci were in germline configuration. The presence of somatic hypermutation in the human Vλ genes indicated that the Igλ-expressing cells function normally. The finding that human λ genes can be utilized with similar efficiency in mice and humans implies that L chain expression is critically dependent on the configuration of the locus
    corecore