17 research outputs found

    Retaggio

    Get PDF

    Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

    Full text link
    Accent plays a significant role in speech communication, influencing understanding capabilities and also conveying a person's identity. This paper introduces a novel and efficient framework for accented Text-to-Speech (TTS) synthesis based on a Conditional Variational Autoencoder. It has the ability to synthesize a selected speaker's speech that is converted to any desired target accent. Our thorough experiments validate the effectiveness of our proposed framework using both objective and subjective evaluations. The results also show remarkable performance in terms of the ability to manipulate accents in the synthesized speech and provide a promising avenue for future accented TTS research.Comment: preprint submitted to a conference, under revie

    Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

    Full text link
    The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation -- a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen. This improvement might also be attributed to the adoption of audio pressure level-based sound mixing for training set augmentation, whereas the prior methods take a random mix.Comment: https://github.com/declare-lab/tang

    A Review of Deep Learning Techniques for Speech Processing

    Full text link
    The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field

    ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation

    Full text link
    There are significant challenges for speaker adaptation in text-to-speech for languages that are not widely spoken or for speakers with accents or dialects that are not well-represented in the training data. To address this issue, we propose the use of the "mixture of adapters" method. This approach involves adding multiple adapters within a backbone-model layer to learn the unique characteristics of different speakers. Our approach outperforms the baseline, with a noticeable improvement of 5% observed in speaker preference tests when using only one minute of data for each new speaker. Moreover, following the adapter paradigm, we fine-tune only the adapter parameters (11% of the total model parameters). This is a significant achievement in parameter-efficient speaker adaptation, and one of the first models of its kind. Overall, our proposed approach offers a promising solution to the speech synthesis techniques, particularly for adapting to speakers from diverse backgrounds.Comment: Interspeech 202

    Partnering with women collectives for delivering essential women\u2019s nutrition interventions in tribal areas of eastern India: a scoping study

    Get PDF
    Background: We examined the feasibility of engaging women collectives in delivering a package of women\u2019s nutrition messages/services as a funded stakeholder in three tribal-dominated districts of Odisha, Jharkhand and Chhattisgarh States, in eastern India. These districts have high prevalence of child stunting and poor government service outreach. Methods: Conducted between July 2014 and March 2015, an exploratory mix-methods design was adopted (review of coverage data and government reports, field interviews and focus group discussion with multiple stakeholders and intended communities) to assess coverage of women\u2019s nutrition services. A capacity assessment tool was developed to map all types of community collectives and assess their awareness, institutional and programme capacity as a funded stakeholder for delivering women\u2019s nutrition services/behaviour promotion. Results: Limited targeting of pre-pregnancy period, delays in first trimester registration of pregnant women, and low micronutrient supplementation supply and awareness issues emerged as key bottlenecks in improving women\u2019s nutrition in these districts. Amongst the 18 different types of community collectives mapped, Self Help Groups (SHGs) and their federations (tier 2 and tier 3), with total membership of over 650,000, emerged as the most promising community collective due to their vast network, governance structure, bank linkage, and regular interface. Nearly 400,000 (or 20% of women) in these districts can be reached through the mapped 31,919 SHGs. SHGs with organisational readiness for receiving and managing grants for income generation and community development activities varied from 41 to 94% across study districts. Stakeholders perceived that SHGs federations managing grants from government and be engaged for nutrition promotion and service delivery and SHG weekly meetings can serve as community interface for discussing/resolving local issues impeding access to services. Conclusions: Women SHGs (with tier 2 and tier 3) can become direct grantees for strengthening coverage of women\u2019s nutrition interventions in these tribal districts/pockets, provided they are capacitated, supervised and given safe guards against exploitation and violence

    SPEAKER EMBEDDINGS FOR DIARIZATION OF BROADCAST DATA IN THE ALLIES CHALLENGE

    No full text
    International audienceDiarization consists in the segmentation of speech signals and the clustering of homogeneous speaker segments. State-of-the-art systems typically operate upon speaker embeddings, such as ivectors or neural x-vectors, extracted from mel cepstral coefficients (MFCCs) or spectrograms. The recent SincNet architecture extracts x-vectors directly from raw speech signals. The work reported in this paper compares the performance of different embeddings extracted from MFCCs or the raw signal for speaker diarization and broadcast media treated with compression and sub-sampling, operations which typically degrade performance. Experiments are performed with the new ALLIES database that was designed to complement existing, publicly available French corpora of broadcast radio and TV shows. Results show that, in adverse conditions, with compression and sampling mismatch, SincNet x-vectors outperform i-vectors and x-vectors by relative DERs of 43% and 73% respectively. Additionally we found that SincNet x-vectors are not the absolute best embeddings but are more robust to data mismatch than others

    Partnering with women collectives for delivering essential women’s nutrition interventions in tribal areas of eastern India: a scoping study

    No full text
    Abstract Background We examined the feasibility of engaging women collectives in delivering a package of women’s nutrition messages/services as a funded stakeholder in three tribal-dominated districts of Odisha, Jharkhand and Chhattisgarh States, in eastern India. These districts have high prevalence of child stunting and poor government service outreach. Methods Conducted between July 2014 and March 2015, an exploratory mix-methods design was adopted (review of coverage data and government reports, field interviews and focus group discussion with multiple stakeholders and intended communities) to assess coverage of women’s nutrition services. A capacity assessment tool was developed to map all types of community collectives and assess their awareness, institutional and programme capacity as a funded stakeholder for delivering women’s nutrition services/behaviour promotion. Results Limited targeting of pre-pregnancy period, delays in first trimester registration of pregnant women, and low micronutrient supplementation supply and awareness issues emerged as key bottlenecks in improving women’s nutrition in these districts. Amongst the 18 different types of community collectives mapped, Self Help Groups (SHGs) and their federations (tier 2 and tier 3), with total membership of over 650,000, emerged as the most promising community collective due to their vast network, governance structure, bank linkage, and regular interface. Nearly 400,000 (or 20% of women) in these districts can be reached through the mapped 31,919 SHGs. SHGs with organisational readiness for receiving and managing grants for income generation and community development activities varied from 41 to 94% across study districts. Stakeholders perceived that SHGs federations managing grants from government and be engaged for nutrition promotion and service delivery and SHG weekly meetings can serve as community interface for discussing/resolving local issues impeding access to services. Conclusions Women SHGs (with tier 2 and tier 3) can become direct grantees for strengthening coverage of women’s nutrition interventions in these tribal districts/pockets, provided they are capacitated, supervised and given safe guards against exploitation and violence
    corecore