988 research outputs found
Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation
Mapping two modalities, speech and text, into a shared representation space,
is a research topic of using text-only data to improve end-to-end automatic
speech recognition (ASR) performance in new domains. However, the length of
speech representation and text representation is inconsistent. Although the
previous method up-samples the text representation to align with acoustic
modality, it may not match the expected actual duration. In this paper, we
proposed novel representations match strategy through down-sampling acoustic
representation to align with text modality. By introducing a continuous
integrate-and-fire (CIF) module generating acoustic representations consistent
with token length, our ASR model can learn unified representations from both
modalities better, allowing for domain adaptation using text-only data of the
target domain. Experiment results of new domain data demonstrate the
effectiveness of the proposed method.Comment: Accepted by INTERSPEECH 2023. arXiv admin note: text overlap with
arXiv:2309.0143
FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework
This paper integrates graph-to-sequence into an end-to-end text-to-speech
framework for syntax-aware modelling with syntactic information of input text.
Specifically, the input text is parsed by a dependency parsing module to form a
syntactic graph. The syntactic graph is then encoded by a graph encoder to
extract the syntactic hidden information, which is concatenated with phoneme
embedding and input to the alignment and flow-based decoding modules to
generate the raw audio waveform. The model is experimented on two languages,
English and Mandarin, using single-speaker, few samples of target speakers, and
multi-speaker datasets, respectively. Experimental results show better prosodic
consistency performance between input text and generated audio, and also get
higher scores in the subjective prosodic evaluation, and show the ability of
voice conversion. Besides, the efficiency of the model is largely boosted
through the design of the AI chip operator with 5x acceleration.Comment: Accepted by The 35th IEEE International Conference on Tools with
Artificial Intelligence. (ICTAI 2023
Exploring Timbre Disentanglement in Non-Autoregressive Cross-Lingual Text-to-Speech
In this paper, we present a FastPitch-based non-autoregressive cross-lingual
Text-to-Speech (TTS) model built with language independent input representation
and monolingual force aligners. We propose a phoneme length regulator that
solves the length mismatch problem between language-independent phonemes and
monolingual alignment results. Our experiments show that (1) an increasing
number of training speakers encourages non-autoregressive cross-lingual TTS
model to disentangle speaker and language representations, and (2) variance
adaptors of FastPitch model can help disentangle speaker identity from learned
representations in cross-lingual TTS. The subjective evaluation shows that our
proposed model is able to achieve decent speaker consistency and similarity. We
further improve the naturalness of Mandarin-dominated mixed-lingual utterances
by utilizing the controllability of our proposed model.Comment: Submitted to ICASSP 202
Nonparallel Emotional Speech Conversion
We propose a nonparallel data-driven emotional speech conversion method. It
enables the transfer of emotion-related characteristics of a speech signal
while preserving the speaker's identity and linguistic content. Most existing
approaches require parallel data and time alignment, which is not available in
most real applications. We achieve nonparallel training based on an
unsupervised style transfer technique, which learns a translation model between
two distributions instead of a deterministic one-to-one mapping between paired
examples. The conversion model consists of an encoder and a decoder for each
emotion domain. We assume that the speech signal can be decomposed into an
emotion-invariant content code and an emotion-related style code in latent
space. Emotion conversion is performed by extracting and recombining the
content code of the source speech and the style code of the target emotion. We
tested our method on a nonparallel corpora with four emotions. Both subjective
and objective evaluations show the effectiveness of our approach.Comment: Published in INTERSPEECH 2019, 5 pages, 6 figures. Simulation
available at http://www.jian-gao.org/emoga
Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis
Zero-shot speaker adaptation aims to clone an unseen speaker's voice without
any adaptation time and parameters. Previous researches usually use a speaker
encoder to extract a global fixed speaker embedding from reference speech, and
several attempts have tried variable-length speaker embedding. However, they
neglect to transfer the personal pronunciation characteristics related to
phoneme content, leading to poor speaker similarity in terms of detailed
speaking styles and pronunciation habits. To improve the ability of the speaker
encoder to model personal pronunciation characteristics, we propose
content-dependent fine-grained speaker embedding for zero-shot speaker
adaptation. The corresponding local content embeddings and speaker embeddings
are extracted from a reference speech, respectively. Instead of modeling the
temporal relations, a reference attention module is introduced to model the
content relevance between the reference speech and the input text, and to
generate the fine-grained speaker embedding for each phoneme encoder output.
The experimental results show that our proposed method can improve speaker
similarity of synthesized speeches, especially for unseen speakers.Comment: Submitted to Interspeech 202
Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation
Automatic Speech Recognition (ASR) in conversational settings presents unique
challenges, including extracting relevant contextual information from previous
conversational turns. Due to irrelevant content, error propagation, and
redundancy, existing methods struggle to extract longer and more effective
contexts. To address this issue, we introduce a novel Conversational ASR
system, extending the Conformer encoder-decoder model with cross-modal
conversational representation. Our approach leverages a cross-modal extractor
that combines pre-trained speech and text models through a specialized encoder
and a modal-level mask input. This enables the extraction of richer historical
speech context without explicit error propagation. We also incorporate
conditional latent variational modules to learn conversational level attributes
such as role preference and topic coherence. By introducing both cross-modal
and conversational representations into the decoder, our model retains context
over longer sentences without information loss, achieving relative accuracy
improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and
MagicData-RAMC, respectively, compared to the standard Conformer model.Comment: Submitted to TASL
- …