59 research outputs found
TCT: A Cross-supervised Learning Method for Multimodal Sequence Representation
Multimodalities provide promising performance than unimodality in most tasks.
However, learning the semantic of the representations from multimodalities
efficiently is extremely challenging. To tackle this, we propose the
Transformer based Cross-modal Translator (TCT) to learn unimodal sequence
representations by translating from other related multimodal sequences on a
supervised learning method. Combined TCT with Multimodal Transformer Network
(MTN), we evaluate MTN-TCT on the video-grounded dialogue which uses
multimodality. The proposed method reports new state-of-the-art performance on
video-grounded dialogue which indicates representations learned by TCT are more
semantics compared to directly use unimodality.Comment: submitted to ICASSP 202
Cross-task pre-training for acoustic scene classification
Acoustic scene classification(ASC) and acoustic event detection(AED) are
different but related tasks. Acoustic scenes can be shaped by occurred acoustic
events which can provide useful information in training ASC tasks. However,
most of the datasets are provided without either the acoustic event or scene
labels. Therefore, We explored cross-task pre-training mechanism to utilize
acoustic event information extracted from the pre-trained model to optimize the
ASC task. We present three cross-task pre-training architectures and evaluated
them in feature-based and fine-tuning strategies on two datasets respectively:
TAU Urban Acoustic Scenes 2019 dataset and TUT Acoustic Scenes 2017 dataset.
Results have shown that cross-task pre-training mechanism can significantly
improve the performance of ASC tasks and the performance of our best model
improved relatively 9.5% in the TAU Urban Acoustic Scenes 2019 dataset, and
also improved 10% in the TUT Acoustic Scenes 2017 dataset compared with the
official baseline.Comment: submitted to ICASSP202
Towards End-to-End Code-Switching Speech Recognition
Code-switching speech recognition has attracted an increasing interest
recently, but the need for expert linguistic knowledge has always been a big
issue. End-to-end automatic speech recognition (ASR) simplifies the building of
ASR systems considerably by predicting graphemes or characters directly from
acoustic input. In the mean time, the need of expert linguistic knowledge is
also eliminated, which makes it an attractive choice for code-switching ASR.
This paper presents a hybrid CTC-Attention based end-to-end Mandarin-English
code-switching (CS) speech recognition system and studies the effect of hybrid
CTC-Attention based models, different modeling units, the inclusion of language
identification and different decoding strategies on the task of code-switching
ASR. On the SEAME corpus, our system achieves a mixed error rate (MER) of
34.24%.Comment: 5 pages, submitted to ICASSP 201
A comparable study of modeling units for end-to-end Mandarin speech recognition
End-To-End speech recognition have become increasingly popular in mandarin
speech recognition and achieved delightful performance.
Mandarin is a tonal language which is different from English and requires
special treatment for the acoustic modeling units. There have been several
different kinds of modeling units for mandarin such as phoneme, syllable and
Chinese character.
In this work, we explore two major end-to-end models: connectionist temporal
classification (CTC) model and attention based encoder-decoder model for
mandarin speech recognition. We compare the performance of three different
scaled modeling units: context dependent phoneme(CDP), syllable with tone and
Chinese character.
We find that all types of modeling units can achieve approximate character
error rate (CER) in CTC model and the performance of Chinese character
attention model is better than syllable attention model. Furthermore, we find
that Chinese character is a reasonable unit for mandarin speech recognition. On
DidiCallcenter task, Chinese character attention model achieves a CER of 5.68%
and CTC model gets a CER of 7.29%, on the other DidiReading task, CER are 4.89%
and 5.79%, respectively. Moreover, attention model achieves a better
performance than CTC model on both datasets.Comment: 5 page
Giant exchange bias and ferromagnetism in the CoO shell of Co/CoO-MgO core-shell nanoparticles
Using magnetron sputtering, we produced a series of Co/CoO-MgO nanoparticles
on Si(100) substrates. High-resolution transmission electron microscopy (HRTEM)
image shows that small isolated Co-clusters (core) covered with CoO (shells)
with a size of a few nm embedded in a MgO matrix. Resistivity as a function of
Co atomic ratio exhibits a distinct percolation threshold with a sharp decrease
around 69% Co content. Across the threshold, the resistivity drops about 7
orders of magnitude. For a sample at this percolation critical threshold, we
have observed a giant exchange bias field HE=2460 Oe at T= 2K, and using soft
x-ray magnetic circular dichroism at the Co-L2,3 edge, we have detected a
ferromagnetic (FM) signal originating from the antiferromagnetic CoO shell.
Moreover, decreasing the Mg-impurities will reduce the FM signal from CoO shell
(namely the uncompensated spin density) and the size of HE, thus directly
support the uncompensated spin model
A Further Study of Unsupervised Pre-training for Transformer Based Speech Recognition
Building a good speech recognition system usually requires large amounts of
transcribed data, which is expensive to collect. To tackle this problem, many
unsupervised pre-training methods have been proposed. Among these methods,
Masked Predictive Coding achieved significant improvements on various speech
recognition datasets with BERT-like Masked Reconstruction loss and Transformer
backbone. However, many aspects of MPC have not been fully investigated. In
this paper, we conduct a further study on MPC and focus on three important
aspects: the effect of pre-training data speaking style, its extension on
streaming model, and how to better transfer learned knowledge from pre-training
stage to downstream tasks. Experiments reveled that pre-training data with
matching speaking style is more useful on downstream recognition tasks. A
unified training objective with APC and MPC provided 8.46% relative error
reduction on streaming model trained on HKUST. Also, the combination of target
data adaption and layer-wise discriminative training helped the knowledge
transfer of MPC, which achieved 3.99% relative error reduction on AISHELL over
a strong baseline
Transformer based unsupervised pre-training for acoustic representation learning
Recently, a variety of acoustic tasks and related applications arised. For
many acoustic tasks, the labeled data size may be limited. To handle this
problem, we propose an unsupervised pre-training method using Transformer based
encoder to learn a general and robust high-level representation for all
acoustic tasks. Experiments have been conducted on three kinds of acoustic
tasks: speech emotion recognition, sound event detection and speech
translation. All the experiments have shown that pre-training using its own
training data can significantly improve the performance. With a larger
pre-training data combining MuST-C, Librispeech and ESC-US datasets, for speech
emotion recognition, the UAR can further improve absolutely 4.3% on IEMOCAP
dataset. For sound event detection, the F1 score can further improve absolutely
1.5% on DCASE2018 task5 development set and 2.1% on evaluation set. For speech
translation, the BLEU score can further improve relatively 12.2% on En-De
dataset and 8.4% on En-Fr dataset.Comment: Accepted by ICASSP 202
Audio Deep Fake Detection System with Neural Stitching for ADD 2022
This paper describes our best system and methodology for ADD 2022: The First
Audio Deep Synthesis Detection Challenge\cite{Yi2022ADD}. The very same system
was used for both two rounds of evaluation in Track 3.2 with a similar training
methodology. The first round of Track 3.2 data is generated from
Text-to-Speech(TTS) or voice conversion (VC) algorithms, while the second round
of data consists of generated fake audio from other participants in Track 3.1,
aiming to spoof our systems. Our systems use a standard 34-layer ResNet, with
multi-head attention pooling \cite{india2019self} to learn the discriminative
embedding for fake audio and spoof detection. We further utilize neural
stitching to boost the model's generalization capability in order to perform
equally well in different tasks, and more details will be explained in the
following sessions. The experiments show that our proposed method outperforms
all other systems with a 10.1% equal error rate(EER) in Track 3.2.Comment: Accepted to ICASSP 202
Audio-Visual Wake Word Spotting System For MISP Challenge 2021
This paper presents the details of our system designed for the Task 1 of
Multimodal Information Based Speech Processing (MISP) Challenge 2021. The
purpose of Task 1 is to leverage both audio and video information to improve
the environmental robustness of far-field wake word spotting. In the proposed
system, firstly, we take advantage of speech enhancement algorithms such as
beamforming and weighted prediction error (WPE) to address the multi-microphone
conversational audio. Secondly, several data augmentation techniques are
applied to simulate a more realistic far-field scenario. For the video
information, the provided region of interest (ROI) is used to obtain visual
representation. Then the multi-layer CNN is proposed to learn audio and visual
representations, and these representations are fed into our two-branch
attention-based network which can be employed for fusion, such as transformer
and conformed. The focal loss is used to fine-tune the model and improve the
performance significantly. Finally, multiple trained models are integrated by
casting vote to achieve our final 0.091 score.Comment: Accepted to ICASSP 202
A Human Immunoglobulin λ Locus Is Similarly Well Expressed in Mice and Humans
Transgenic mice carrying a 380-kb region of the human immunoglobulin (Ig) λ light (L) chain locus in germline configuration were created. The introduced translocus on a yeast artificial chromosome (YAC) accommodates the most proximal Igλ variable region (V) gene cluster, including 15 Vλ genes that contribute to >60% of λ L chains in humans, all Jλ-Cλ segments, and the 3′ enhancer. HuIgλYAC mice were bred with animals in which mouse Igκ production was silenced by gene targeting. In the κ−/− background, human Igλ was expressed by ∼84% of splenic B cells. A striking result was that human Igλ was also produced at high levels in mice with normal κ locus. Analysis of bone marrow cells showed that human Igλ and mouse Igκ were expressed at similar levels throughout B cell development, suggesting that the Igλ translocus and the endogenous κ locus rearrange independently and with equal efficiency at the same developmental stage. This is further supported by the finding that in hybridomas expressing human Igλ the endogenous L chain loci were in germline configuration. The presence of somatic hypermutation in the human Vλ genes indicated that the Igλ-expressing cells function normally. The finding that human λ genes can be utilized with similar efficiency in mice and humans implies that L chain expression is critically dependent on the configuration of the locus
- …