5,928 research outputs found
Leveraging native language information for improved accented speech recognition
Recognition of accented speech is a long-standing challenge for automatic
speech recognition (ASR) systems, given the increasing worldwide population of
bi-lingual speakers with English as their second language. If we consider
foreign-accented speech as an interpolation of the native language (L1) and
English (L2), using a model that can simultaneously address both languages
would perform better at the acoustic level for accented speech. In this study,
we explore how an end-to-end recurrent neural network (RNN) trained system with
English and native languages (Spanish and Indian languages) could leverage data
of native languages to improve performance for accented English speech. To this
end, we examine pre-training with native languages, as well as multi-task
learning (MTL) in which the main task is trained with native English and the
secondary task is trained with Spanish or Indian Languages. We show that the
proposed MTL model performs better than the pre-training approach and
outperforms a baseline model trained simply with English data. We suggest a new
setting for MTL in which the secondary task is trained with both English and
the native language, using the same output set. This proposed scenario yields
better performance with +11.95% and +17.55% character error rate gains over
baseline for Hispanic and Indian accents, respectively.Comment: Accepted at Interspeech 201
A Speaker Diarization System for Studying Peer-Led Team Learning Groups
Peer-led team learning (PLTL) is a model for teaching STEM courses where
small student groups meet periodically to collaboratively discuss coursework.
Automatic analysis of PLTL sessions would help education researchers to get
insight into how learning outcomes are impacted by individual participation,
group behavior, team dynamics, etc.. Towards this, speech and language
technology can help, and speaker diarization technology will lay the foundation
for analysis. In this study, a new corpus is established called CRSS-PLTL, that
contains speech data from 5 PLTL teams over a semester (10 sessions per team
with 5-to-8 participants in each team). In CRSS-PLTL, every participant wears a
LENA device (portable audio recorder) that provides multiple audio recordings
of the event. Our proposed solution is unsupervised and contains a new online
speaker change detection algorithm, termed G 3 algorithm in conjunction with
Hausdorff-distance based clustering to provide improved detection accuracy.
Additionally, we also exploit cross channel information to refine our
diarization hypothesis. The proposed system provides good improvements in
diarization error rate (DER) over the baseline LIUM system. We also present
higher level analysis such as the number of conversational turns taken in a
session, and speaking-time duration (participation) for each speaker.Comment: 5 Pages, 2 Figures, 2 Tables, Proceedings of INTERSPEECH 2016, San
Francisco, US
Cool Customers in the Stellar Graveyard IV: Spitzer Search for Mid-IR excesses Around Five DAs
Hydrogen atmosphere white dwarfs with metal lines, so-called DAZs, require
external accretion of material to explain the presence of weak metal line
absorption in their photospheres. The source of this material is currently
unknown, but could come from the interstellar medium, unseen companions, or
relic planetesimals from asteroid belt or Kuiper belt analogues. Accurate
mid-infrared photometry of these white dwarfs provide additional information to
solve the mystery of this accretion and to look for evidence of planetary
systems that have survived post main sequence evolution. We present {\em
Spitzer} IRAC photometry accurate to 3% for four DAZs and one DA with
circumstellar absorption lines in the UV. We search for excesses due to unseen
companions or circumstellar dust disks. We use {\em Hubble Space Telescope}
NICMOS imaging of these white dwarfs to gauge the level of background
contamination to our targets as well as rule out common proper motion
companions to WD 1620-391. All of our targets show no excesses due to
companions 20 M, ruling out all but very low mass companions to these
white dwarfs at all separations. No excesses due to circumstellar disks are
observed, and we place limits on what types of disks may still be present.Comment: 18 pages, 8 figures, Accepted to A
Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations
In this study, we propose the global context guided channel and
time-frequency transformations to model the long-range, non-local
time-frequency dependencies and channel variances in speaker representations.
We use the global context information to enhance important channels and
recalibrate salient time-frequency locations by computing the similarity
between the global context and local features. The proposed modules, together
with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset,
which is a large scale speaker verification corpus collected in the wild. This
lightweight block can be easily incorporated into a CNN model with little
additional computational costs and effectively improves the speaker
verification performance compared to the baseline ResNet-LDE model and the
Squeeze&Excitation block by a large margin. Detailed ablation studies are also
performed to analyze various factors that may impact the performance of the
proposed modules. We find that by employing the proposed L2-tf-GTFC
transformation block, the Equal Error Rate decreases from 4.56% to 3.07%, a
relative 32.68% reduction, and a relative 27.28% improvement in terms of the
DCF score. The results indicate that our proposed global context guided
transformation modules can efficiently improve the learned speaker
representations by achieving time-frequency and channel-wise feature
recalibration.Comment: Accepted to Interspeech 202
Complex-Valued Time-Frequency Self-Attention for Speech Dereverberation
Several speech processing systems have demonstrated considerable performance
improvements when deep complex neural networks (DCNN) are coupled with
self-attention (SA) networks. However, the majority of DCNN-based studies on
speech dereverberation that employ self-attention do not explicitly account for
the inter-dependencies between real and imaginary features when computing
attention. In this study, we propose a complex-valued T-F attention (TFA)
module that models spectral and temporal dependencies by computing
two-dimensional attention maps across time and frequency dimensions. We
validate the effectiveness of our proposed complex-valued TFA module with the
deep complex convolutional recurrent network (DCCRN) using the REVERB challenge
corpus. Experimental findings indicate that integrating our complex-TFA module
with DCCRN improves overall speech quality and performance of back-end speech
applications, such as automatic speech recognition, compared to earlier
approaches for self-attention.Comment: Interspeech 2022: ISCA Best Student Paper Award Finalis
Efficient Adapter Tuning of Pre-trained Speech Models for Automatic Speaker Verification
With excellent generalization ability, self-supervised speech models have
shown impressive performance on various downstream speech tasks in the
pre-training and fine-tuning paradigm. However, as the growing size of
pre-trained models, fine-tuning becomes practically unfeasible due to heavy
computation and storage overhead, as well as the risk of overfitting. Adapters
are lightweight modules inserted into pre-trained models to facilitate
parameter-efficient adaptation. In this paper, we propose an effective adapter
framework designed for adapting self-supervised speech models to the speaker
verification task. With a parallel adapter design, our proposed framework
inserts two types of adapters into the pre-trained model, allowing the
adaptation of latent features within intermediate Transformer layers and output
embeddings from all Transformer layers. We conduct comprehensive experiments to
validate the efficiency and effectiveness of the proposed framework.
Experimental results on the VoxCeleb1 dataset demonstrate that the proposed
adapters surpass fine-tuning and other parameter-efficient transfer learning
methods, achieving superior performance while updating only 5% of the
parameters.Comment: Accepted to ICASSP 202
MixRep: Hidden Representation Mixup for Low-Resource Speech Recognition
In this paper, we present MixRep, a simple and effective data augmentation
strategy based on mixup for low-resource ASR. MixRep interpolates the feature
dimensions of hidden representations in the neural network that can be applied
to both the acoustic feature input and the output of each layer, which
generalizes the previous MixSpeech method. Further, we propose to combine the
mixup with a regularization along the time axis of the input, which is shown as
complementary. We apply MixRep to a Conformer encoder of an E2E LAS
architecture trained with a joint CTC loss. We experiment on the WSJ dataset
and subsets of the SWB dataset, covering reading and telephony conversational
speech. Experimental results show that MixRep consistently outperforms other
regularization methods for low-resource ASR. Compared to a strong SpecAugment
baseline, MixRep achieves a +6.5\% and a +6.7\% relative WER reduction on the
eval92 set and the Callhome part of the eval'2000 set.Comment: Accepted to Interspeech 202
Data-driven Attention and Data-independent DCT based Global Context Modeling for Text-independent Speaker Recognition
Learning an effective speaker representation is crucial for achieving
reliable performance in speaker verification tasks. Speech signals are
high-dimensional, long, and variable-length sequences that entail a complex
hierarchical structure. Signals may contain diverse information at each
time-frequency (TF) location. For example, it may be more beneficial to focus
on high-energy parts for phoneme classes such as fricatives. The standard
convolutional layer that operates on neighboring local regions cannot capture
the complex TF global context information. In this study, a general global
time-frequency context modeling framework is proposed to leverage the context
information specifically for speaker representation modeling. First, a
data-driven attention-based context model is introduced to capture the
long-range and non-local relationship across different time-frequency
locations. Second, a data-independent 2D-DCT based context model is proposed to
improve model interpretability. A multi-DCT attention mechanism is presented to
improve modeling power with alternate DCT base forms. Finally, the global
context information is used to recalibrate salient time-frequency locations by
computing the similarity between the global context and local features. The
proposed lightweight blocks can be easily incorporated into a speaker model
with little additional computational costs and effectively improves the speaker
verification performance compared to the standard ResNet model and
Squeeze\&Excitation block by a large margin. Detailed ablation studies are also
performed to analyze various factors that may impact performance of the
proposed individual modules. Results from experiments show that the proposed
global context modeling framework can efficiently improve the learned speaker
representations by achieving channel-wise and time-frequency feature
recalibration
- …