15 research outputs found
Towards Hierarchical Spoken Language Dysfluency Modeling
Speech disfluency modeling is the bottleneck for both speech therapy and
language learning. However, there is no effective AI solution to systematically
tackle this problem. We solidify the concept of disfluent speech and disfluent
speech modeling. We then present Hierarchical Unconstrained Disfluency Modeling
(H-UDM) approach, the hierarchical extension of UDM that addresses both
disfluency transcription and detection to eliminate the need for extensive
manual annotation. Our experimental findings serve as clear evidence of the
effectiveness and reliability of the methods we have introduced, encompassing
both transcription and detection tasks.Comment: 2024 EACL. Hierarchical extension of our previous workshop paper
arXiv:2312.1281
Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion
Traditional studies on voice conversion (VC) have made progress with parallel
training data and known speakers. Good voice conversion quality is obtained by
exploring better alignment modules or expressive mapping functions. In this
study, we investigate zero-shot VC from a novel perspective of self-supervised
disentangled speech representation learning. Specifically, we achieve the
disentanglement by balancing the information flow between global speaker
representation and time-varying content representation in a sequential
variational autoencoder (VAE). A zero-shot voice conversion is performed by
feeding an arbitrary speaker embedding and content embeddings to the VAE
decoder. Besides that, an on-the-fly data augmentation training strategy is
applied to make the learned representation noise invariant. On TIMIT and VCTK
datasets, we achieve state-of-the-art performance on both objective evaluation,
i.e., speaker verification (SV) on speaker embedding and content embedding, and
subjective evaluation, i.e., voice naturalness and similarity, and remains to
be robust even with noisy source/target utterances.Comment: Accepted to 2022 ICASS
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations
Self-supervision has shown great potential for audio-visual speech
recognition by vastly reducing the amount of labeled data required to build
good systems. However, existing methods are either not entirely end-to-end or
do not train joint representations of both modalities. In this paper, we
introduce AV-data2vec which addresses these challenges and builds audio-visual
representations based on predicting contextualized representations which has
been successful in the uni-modal case. The model uses a shared transformer
encoder for both audio and video and can combine both modalities to improve
speech recognition. Results on LRS3 show that AV-data2vec consistently
outperforms existing methods under all settings with the same amount of data
and model size.Comment: 2023 ASR
Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition
Most of the research on data-driven speech representation learning has
focused on raw audios in an end-to-end manner, paying little attention to their
internal phonological or gestural structure. This work, investigating the
speech representations derived from articulatory kinematics signals, uses a
neural implementation of convolutive sparse matrix factorization to decompose
the articulatory data into interpretable gestures and gestural scores. By
applying sparse constraints, the gestural scores leverage the discrete
combinatorial properties of phonological gestures. Phoneme recognition
experiments were additionally performed to show that gestural scores indeed
code phonological information successfully. The proposed work thus makes a
bridge between articulatory phonology and deep neural networks to leverage
informative, intelligible, interpretable,and efficient speech representations.Comment: Submitted to 2022 Interspeec
Enhancing GAN-Based Vocoders with Contrastive Learning Under Data-limited Condition
Vocoder models have recently achieved substantial progress in generating
authentic audio comparable to human quality while significantly reducing memory
requirement and inference time. However, these data-hungry generative models
require large-scale audio data for learning good representations. In this
paper, we apply contrastive learning methods in training the vocoder to improve
the perceptual quality of the vocoder without modifying its architecture or
adding more data. We design an auxiliary task with mel-spectrogram contrastive
learning to enhance the utterance-level quality of the vocoder model under
data-limited conditions. We also extend the task to include waveforms to
improve the multi-modality comprehension of the model and address the
discriminator overfitting problem. We optimize the additional task
simultaneously with GAN training objectives. Our result shows that the tasks
improve model performance substantially in data-limited settings. Our analysis
based on the result indicates that the proposed design successfully alleviates
discriminator overfitting and produces audio of higher fidelity
VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis
Achieving nuanced and accurate emulation of human voice has been a
longstanding goal in artificial intelligence. Although significant progress has
been made in recent years, the mainstream of speech synthesis models still
relies on supervised speaker modeling and explicit reference utterances.
However, there are many aspects of human voice, such as emotion, intonation,
and speaking style, for which it is hard to obtain accurate labels. In this
paper, we propose VoxGenesis, a novel unsupervised speech synthesis framework
that can discover a latent speaker manifold and meaningful voice editing
directions without supervision. VoxGenesis is conceptually simple. Instead of
mapping speech features to waveforms deterministically, VoxGenesis transforms a
Gaussian distribution into speech distributions conditioned and aligned by
semantic tokens. This forces the model to learn a speaker distribution
disentangled from the semantic content. During the inference, sampling from the
Gaussian distribution enables the creation of novel speakers with distinct
characteristics. More importantly, the exploration of latent space uncovers
human-interpretable directions associated with specific speaker characteristics
such as gender attributes, pitch, tone, and emotion, allowing for voice editing
by manipulating the latent codes along these identified directions. We conduct
extensive experiments to evaluate the proposed VoxGenesis using both subjective
and objective metrics, finding that it produces significantly more diverse and
realistic speakers with distinct characteristics than the previous approaches.
We also show that latent space manipulation produces consistent and
human-identifiable effects that are not detrimental to the speech quality,
which was not possible with previous approaches. Audio samples of VoxGenesis
can be found at: \url{https://bit.ly/VoxGenesis}.Comment: preprin
Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and Detection
Dysfluent speech modeling requires time-accurate and silence-aware
transcription at both the word-level and phonetic-level. However, current
research in dysfluency modeling primarily focuses on either transcription or
detection, and the performance of each aspect remains limited. In this work, we
present an unconstrained dysfluency modeling (UDM) approach that addresses both
transcription and detection in an automatic and hierarchical manner. UDM
eliminates the need for extensive manual annotation by providing a
comprehensive solution. Furthermore, we introduce a simulated dysfluent dataset
called VCTK++ to enhance the capabilities of UDM in phonetic transcription. Our
experimental results demonstrate the effectiveness and robustness of our
proposed methods in both transcription and detection tasks.Comment: 2023 ASR