11 research outputs found
Attentive Adversarial Learning for Domain-Invariant Training
Adversarial domain-invariant training (ADIT) proves to be effective in
suppressing the effects of domain variability in acoustic modeling and has led
to improved performance in automatic speech recognition (ASR). In ADIT, an
auxiliary domain classifier takes in equally-weighted deep features from a deep
neural network (DNN) acoustic model and is trained to improve their
domain-invariance by optimizing an adversarial loss function. In this work, we
propose an attentive ADIT (AADIT) in which we advance the domain classifier
with an attention mechanism to automatically weight the input deep features
according to their importance in domain classification. With this attentive
re-weighting, AADIT can focus on the domain normalization of phonetic
components that are more susceptible to domain variability and generates deep
features with improved domain-invariance and senone-discriminativity over ADIT.
Most importantly, the attention block serves only as an external component to
the DNN acoustic model and is not involved in ASR, so AADIT can be used to
improve the acoustic modeling with any DNN architectures. More generally, the
same methodology can improve any adversarial learning system with an auxiliary
discriminator. Evaluated on CHiME-3 dataset, the AADIT achieves 13.6% and 9.3%
relative WER improvements, respectively, over a multi-conditional model and a
strong ADIT baseline.Comment: 5 pages, 1 figure, ICASSP 201
Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis
Neural Text-to-speech (TTS) synthesis is a powerful technology that can
generate speech using neural networks. One of the most remarkable features of
TTS synthesis is its capability to produce speech in the voice of different
speakers. This paper introduces voice cloning and speech synthesis
https://pypi.org/project/voice-cloning/ an open-source python package for
helping speech disorders to communicate more effectively as well as for
professionals seeking to integrate voice cloning or speech synthesis
capabilities into their projects. This package aims to generate synthetic
speech that sounds like the natural voice of an individual, but it does not
replace the natural human voice. The architecture of the system comprises a
speaker verification system, a synthesizer, a vocoder, and noise reduction.
Speaker verification system trained on a varied set of speakers to achieve
optimal generalization performance without relying on transcriptions.
Synthesizer is trained using both audio and transcriptions that generate Mel
spectrogram from a text and vocoder which converts the generated Mel
Spectrogram into corresponding audio signal. Then the audio signal is processed
by a noise reduction algorithm to eliminate unwanted noise and enhance speech
clarity. The performance of synthesized speech from seen and unseen speakers
are then evaluated using subjective and objective evaluation such as Mean
Opinion Score (MOS), Gross Pitch Error (GPE), and Spectral distortion (SD). The
model can create speech in distinct voices by including speaker characteristics
that are chosen randomly