77 research outputs found
Generic Indic Text-to-speech Synthesisers with Rapid Adaptation in an End-to-end Framework
Building text-to-speech (TTS) synthesisers for Indian languages is a
difficult task owing to a large number of active languages. Indian languages
can be classified into a finite set of families, prominent among them,
Indo-Aryan and Dravidian. The proposed work exploits this property to build a
generic TTS system using multiple languages from the same family in an
end-to-end framework. Generic systems are quite robust as they are capable of
capturing a variety of phonotactics across languages. These systems are then
adapted to a new language in the same family using small amounts of adaptation
data. Experiments indicate that good quality TTS systems can be built using
only 7 minutes of adaptation data. An average degradation mean opinion score of
3.98 is obtained for the adapted TTSes.
Extensive analysis of systematic interactions between languages in the
generic TTSes is carried out. x-vectors are included as speaker embedding to
synthesise text in a particular speaker's voice. An interesting observation is
that the prosody of the target speaker's voice is preserved. These results are
quite promising as they indicate the capability of generic TTSes to handle
speaker and language switching seamlessly, along with the ease of adaptation to
a new language
A Generative Model For Zero Shot Learning Using Conditional Variational Autoencoders
Zero shot learning in Image Classification refers to the setting where images
from some novel classes are absent in the training data but other information
such as natural language descriptions or attribute vectors of the classes are
available. This setting is important in the real world since one may not be
able to obtain images of all the possible classes at training. While previous
approaches have tried to model the relationship between the class attribute
space and the image space via some kind of a transfer function in order to
model the image space correspondingly to an unseen class, we take a different
approach and try to generate the samples from the given attributes, using a
conditional variational autoencoder, and use the generated samples for
classification of the unseen classes. By extensive testing on four benchmark
datasets, we show that our model outperforms the state of the art, particularly
in the more realistic generalized setting, where the training classes can also
appear at the test time along with the novel classes
Zero resource speech synthesis using transcripts derived from perceptual acoustic units
Zerospeech synthesis is the task of building vocabulary independent speech
synthesis systems, where transcriptions are not available for training data. It
is, therefore, necessary to convert training data into a sequence of
fundamental acoustic units that can be used for synthesis during the test. This
paper attempts to discover, and model perceptual acoustic units consisting of
steady-state, and transient regions in speech. The transients roughly
correspond to CV, VC units, while the steady-state corresponds to sonorants and
fricatives. The speech signal is first preprocessed by segmenting the same into
CVC-like units using a short-term energy-like contour. These CVC segments are
clustered using a connected components-based graph clustering technique. The
clustered CVC segments are initialized such that the onset (CV) and decays (VC)
correspond to transients, and the rhyme corresponds to steady-states. Following
this initialization, the units are allowed to re-organise on the continuous
speech into a final set of AUs in an HMM-GMM framework. AU sequences thus
obtained are used to train synthesis models. The performance of the proposed
approach is evaluated on the Zerospeech 2019 challenge database. Subjective and
objective scores show that reasonably good quality synthesis with low bit rate
encoding can be achieved using the proposed AUs
Spoof detection using time-delay shallow neural network and feature switching
Detecting spoofed utterances is a fundamental problem in voice-based
biometrics. Spoofing can be performed either by logical accesses like speech
synthesis, voice conversion or by physical accesses such as replaying the
pre-recorded utterance. Inspired by the state-of-the-art \emph{x}-vector based
speaker verification approach, this paper proposes a time-delay shallow neural
network (TD-SNN) for spoof detection for both logical and physical access. The
novelty of the proposed TD-SNN system vis-a-vis conventional DNN systems is
that it can handle variable length utterances during testing. Performance of
the proposed TD-SNN systems and the baseline Gaussian mixture models (GMMs) is
analyzed on the ASV-spoof-2019 dataset. The performance of the systems is
measured in terms of the minimum normalized tandem detection cost function
(min-t-DCF). When studied with individual features, the TD-SNN system
consistently outperforms the GMM system for physical access. For logical
access, GMM surpasses TD-SNN systems for certain individual features. When
combined with the decision-level feature switching (DLFS) paradigm, the best
TD-SNN system outperforms the best baseline GMM system on evaluation data with
a relative improvement of 48.03\% and 49.47\% for both logical and physical
access, respectively
Using Signal Processing in Tandem With Adapted Mixture Models for Classifying Genomic Signals
Genomic signal processing has been used successfully in bioinformatics to
analyze biomolecular sequences and gain varied insights into DNA structure,
gene organization, protein binding, sequence evolution, etc. But challenges
remain in finding the appropriate spectral representation of a biomolecular
sequence, especially when multiple variable-length sequences need to be handled
consistently. In this study, we address this challenge in the context of the
well-studied problem of classifying genomic sequences into different taxonomic
units (strain, phyla, order, etc.). We propose a novel technique that employs
signal processing in tandem with Gaussian mixture models to improve the
spectral representation of a sequence and subsequently the taxonomic
classification accuracies. The sequences are first transformed into spectra,
and projected to a subspace, where sequences belonging to different taxons are
better distinguishable. Our method outperforms a similar state-of-the-art
method on established benchmark datasets by an absolute margin of 6.06%
accuracy
Evidence of Task-Independent Person-Specific Signatures in EEG using Subspace Techniques
Electroencephalography (EEG) signals are promising as alternatives to other
biometrics owing to their protection against spoofing. Previous studies have
focused on capturing individual variability by analyzing
task/condition-specific EEG. This work attempts to model biometric signatures
independent of task/condition by normalizing the associated variance. Toward
this goal, the paper extends ideas from subspace-based text-independent speaker
recognition and proposes novel modifications for modeling multi-channel EEG
data. The proposed techniques assume that biometric information is present in
the entire EEG signal and accumulate statistics across time in a high
dimensional space. These high dimensional statistics are then projected to a
lower dimensional space where the biometric information is preserved. The lower
dimensional embeddings obtained using the proposed approach are shown to be
task-independent. The best subspace system identifies individuals with
accuracies of 86.4% and 35.9% on datasets with 30 and 920 subjects,
respectively, using just nine EEG channels. The paper also provides insights
into the subspace model's scalability to unseen tasks and individuals during
training and the number of channels needed for subspace modeling.Comment: \copyright 2021 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
- …