17 research outputs found
Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder
Accent plays a significant role in speech communication, influencing
understanding capabilities and also conveying a person's identity. This paper
introduces a novel and efficient framework for accented Text-to-Speech (TTS)
synthesis based on a Conditional Variational Autoencoder. It has the ability to
synthesize a selected speaker's speech that is converted to any desired target
accent. Our thorough experiments validate the effectiveness of our proposed
framework using both objective and subjective evaluations. The results also
show remarkable performance in terms of the ability to manipulate accents in
the synthesized speech and provide a promising avenue for future accented TTS
research.Comment: preprint submitted to a conference, under revie
Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model
The immense scale of the recent large language models (LLM) allows many
interesting properties, such as, instruction- and chain-of-thought-based
fine-tuning, that has significantly improved zero- and few-shot performance in
many natural language processing (NLP) tasks. Inspired by such successes, we
adopt such an instruction-tuned LLM Flan-T5 as the text encoder for
text-to-audio (TTA) generation -- a task where the goal is to generate an audio
from its textual description. The prior works on TTA either pre-trained a joint
text-audio encoder or used a non-instruction-tuned model, such as, T5.
Consequently, our latent diffusion model (LDM)-based approach TANGO outperforms
the state-of-the-art AudioLDM on most metrics and stays comparable on the rest
on AudioCaps test set, despite training the LDM on a 63 times smaller dataset
and keeping the text encoder frozen. This improvement might also be attributed
to the adoption of audio pressure level-based sound mixing for training set
augmentation, whereas the prior methods take a random mix.Comment: https://github.com/declare-lab/tang
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation
There are significant challenges for speaker adaptation in text-to-speech for
languages that are not widely spoken or for speakers with accents or dialects
that are not well-represented in the training data. To address this issue, we
propose the use of the "mixture of adapters" method. This approach involves
adding multiple adapters within a backbone-model layer to learn the unique
characteristics of different speakers. Our approach outperforms the baseline,
with a noticeable improvement of 5% observed in speaker preference tests when
using only one minute of data for each new speaker. Moreover, following the
adapter paradigm, we fine-tune only the adapter parameters (11% of the total
model parameters). This is a significant achievement in parameter-efficient
speaker adaptation, and one of the first models of its kind. Overall, our
proposed approach offers a promising solution to the speech synthesis
techniques, particularly for adapting to speakers from diverse backgrounds.Comment: Interspeech 202
Partnering with women collectives for delivering essential women\u2019s nutrition interventions in tribal areas of eastern India: a scoping study
Background: We examined the feasibility of engaging women collectives
in delivering a package of women\u2019s nutrition messages/services as
a funded stakeholder in three tribal-dominated districts of Odisha,
Jharkhand and Chhattisgarh States, in eastern India. These districts
have high prevalence of child stunting and poor government service
outreach. Methods: Conducted between July 2014 and March 2015, an
exploratory mix-methods design was adopted (review of coverage data and
government reports, field interviews and focus group discussion with
multiple stakeholders and intended communities) to assess coverage of
women\u2019s nutrition services. A capacity assessment tool was
developed to map all types of community collectives and assess their
awareness, institutional and programme capacity as a funded stakeholder
for delivering women\u2019s nutrition services/behaviour promotion.
Results: Limited targeting of pre-pregnancy period, delays in first
trimester registration of pregnant women, and low micronutrient
supplementation supply and awareness issues emerged as key bottlenecks
in improving women\u2019s nutrition in these districts. Amongst the 18
different types of community collectives mapped, Self Help Groups
(SHGs) and their federations (tier 2 and tier 3), with total membership
of over 650,000, emerged as the most promising community collective due
to their vast network, governance structure, bank linkage, and regular
interface. Nearly 400,000 (or 20% of women) in these districts can be
reached through the mapped 31,919 SHGs. SHGs with organisational
readiness for receiving and managing grants for income generation and
community development activities varied from 41 to 94% across study
districts. Stakeholders perceived that SHGs federations managing grants
from government and be engaged for nutrition promotion and service
delivery and SHG weekly meetings can serve as community interface for
discussing/resolving local issues impeding access to services.
Conclusions: Women SHGs (with tier 2 and tier 3) can become direct
grantees for strengthening coverage of women\u2019s nutrition
interventions in these tribal districts/pockets, provided they are
capacitated, supervised and given safe guards against exploitation and
violence
SPEAKER EMBEDDINGS FOR DIARIZATION OF BROADCAST DATA IN THE ALLIES CHALLENGE
International audienceDiarization consists in the segmentation of speech signals and the clustering of homogeneous speaker segments. State-of-the-art systems typically operate upon speaker embeddings, such as ivectors or neural x-vectors, extracted from mel cepstral coefficients (MFCCs) or spectrograms. The recent SincNet architecture extracts x-vectors directly from raw speech signals. The work reported in this paper compares the performance of different embeddings extracted from MFCCs or the raw signal for speaker diarization and broadcast media treated with compression and sub-sampling, operations which typically degrade performance. Experiments are performed with the new ALLIES database that was designed to complement existing, publicly available French corpora of broadcast radio and TV shows. Results show that, in adverse conditions, with compression and sampling mismatch, SincNet x-vectors outperform i-vectors and x-vectors by relative DERs of 43% and 73% respectively. Additionally we found that SincNet x-vectors are not the absolute best embeddings but are more robust to data mismatch than others
Partnering with women collectives for delivering essential women’s nutrition interventions in tribal areas of eastern India: a scoping study
Abstract Background We examined the feasibility of engaging women collectives in delivering a package of women’s nutrition messages/services as a funded stakeholder in three tribal-dominated districts of Odisha, Jharkhand and Chhattisgarh States, in eastern India. These districts have high prevalence of child stunting and poor government service outreach. Methods Conducted between July 2014 and March 2015, an exploratory mix-methods design was adopted (review of coverage data and government reports, field interviews and focus group discussion with multiple stakeholders and intended communities) to assess coverage of women’s nutrition services. A capacity assessment tool was developed to map all types of community collectives and assess their awareness, institutional and programme capacity as a funded stakeholder for delivering women’s nutrition services/behaviour promotion. Results Limited targeting of pre-pregnancy period, delays in first trimester registration of pregnant women, and low micronutrient supplementation supply and awareness issues emerged as key bottlenecks in improving women’s nutrition in these districts. Amongst the 18 different types of community collectives mapped, Self Help Groups (SHGs) and their federations (tier 2 and tier 3), with total membership of over 650,000, emerged as the most promising community collective due to their vast network, governance structure, bank linkage, and regular interface. Nearly 400,000 (or 20% of women) in these districts can be reached through the mapped 31,919 SHGs. SHGs with organisational readiness for receiving and managing grants for income generation and community development activities varied from 41 to 94% across study districts. Stakeholders perceived that SHGs federations managing grants from government and be engaged for nutrition promotion and service delivery and SHG weekly meetings can serve as community interface for discussing/resolving local issues impeding access to services. Conclusions Women SHGs (with tier 2 and tier 3) can become direct grantees for strengthening coverage of women’s nutrition interventions in these tribal districts/pockets, provided they are capacitated, supervised and given safe guards against exploitation and violence