28 research outputs found
Refining a Deep Learning-based Formant Tracker using Linear Prediction Methods
In this study, formant tracking is investigated by refining the formants
tracked by an existing data-driven tracker, DeepFormants, using the formants
estimated in a model-driven manner by linear prediction (LP)-based methods. As
LP-based formant estimation methods, conventional covariance analysis (LP-COV)
and the recently proposed quasi-closed phase forward-backward (QCP-FB) analysis
are used. In the proposed refinement approach, the contours of the three lowest
formants are first predicted by the data-driven DeepFormants tracker, and the
predicted formants are replaced frame-wise with local spectral peaks shown by
the model-driven LP-based methods. The refinement procedure can be plugged into
the DeepFormants tracker with no need for any new data learning. Two refined
DeepFormants trackers were compared with the original DeepFormants and with
five known traditional trackers using the popular vocal tract resonance (VTR)
corpus. The results indicated that the data-driven DeepFormants trackers
outperformed the conventional trackers and that the best performance was
obtained by refining the formants predicted by DeepFormants using QCP-FB
analysis. In addition, by tracking formants using VTR speech that was corrupted
by additive noise, the study showed that the refined DeepFormants trackers were
more resilient to noise than the reference trackers. In general, these results
suggest that LP-based model-driven approaches, which have traditionally been
used in formant estimation, can be combined with a modern data-driven tracker
easily with no further training to improve the tracker's performance.Comment: Computer Speech and Language, Vol. 81, Article 101515, June 202
Time-Varying Quasi-Closed-Phase Analysis for Accurate Formant Tracking in Speech Signals
In this paper, we propose a new method for the accurate estimation and
tracking of formants in speech signals using time-varying quasi-closed-phase
(TVQCP) analysis. Conventional formant tracking methods typically adopt a
two-stage estimate-and-track strategy wherein an initial set of formant
candidates are estimated using short-time analysis (e.g., 10--50 ms), followed
by a tracking stage based on dynamic programming or a linear state-space model.
One of the main disadvantages of these approaches is that the tracking stage,
however good it may be, cannot improve upon the formant estimation accuracy of
the first stage. The proposed TVQCP method provides a single-stage formant
tracking that combines the estimation and tracking stages into one. TVQCP
analysis combines three approaches to improve formant estimation and tracking:
(1) it uses temporally weighted quasi-closed-phase analysis to derive
closed-phase estimates of the vocal tract with reduced interference from the
excitation source, (2) it increases the residual sparsity by using the
optimization and (3) it uses time-varying linear prediction analysis over long
time windows (e.g., 100--200 ms) to impose a continuity constraint on the vocal
tract model and hence on the formant trajectories. Formant tracking experiments
with a wide variety of synthetic and natural speech signals show that the
proposed TVQCP method performs better than conventional and popular formant
tracking tools, such as Wavesurfer and Praat (based on dynamic programming),
the KARMA algorithm (based on Kalman filtering), and DeepFormants (based on
deep neural networks trained in a supervised manner). Matlab scripts for the
proposed method can be found at: https://github.com/njaygowda/ftrac
Genetic diversity study in tropical carrot (Daucus carota L.)
Genetic diversity study was conducted at ICAR- Indian institute of Horticultural Research, Bengaluru during 2018-19. In this study, 80 accessions were evaluated for 16 yield and yield attributing traits. The Mahalanobis’ D2 analysis grouped these accessions into seven clusters. Cluster I was the largest with 69 genotypes followed by cluster III comprising six genotypes while, the clusters II, IV, V, VI and VII contained one genotype each. Among the traits studied, yield contributed maximum (38.04 %) towards diversity, followed by root weight (26.58%), root color (9.18%) and plant height (6.7%). As far as root weight (g) [d1], leaf weight (g), root weight (g), number of leaves, TSS(°Brix), leaf weight (g), root diameter (mm), core diameter (mm), and root cracking are concerned, they contributed 3.45, 2.09, 1.77, 1.71, 1.55, 1.52, 1.46, 1.33, 1.01 and 0.82 percent respectively. Diversity analysis has given an indication about the genetic variation among the carrot accessions which will prove useful in selection of diverse parents in crop improvement programme
Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech
Grapheme-to-Phoneme (G2P) is an essential first step in any modern,
high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely
on carefully hand-crafted lexicons developed by experts. This poses a two-fold
problem. Firstly, the lexicons are generated using a fixed phoneme set,
usually, ARPABET or IPA, which might not be the most optimal way to represent
phonemes for all languages. Secondly, the man-hours required to produce such an
expert lexicon are very high. In this paper, we eliminate both of these issues
by using recent advances in self-supervised learning to obtain data-driven
phoneme representations instead of fixed representations. We compare our
lexicon-free approach against strong baselines that utilize a well-crafted
lexicon. Furthermore, we show that our data-driven lexicon-free method performs
as good or even marginally better than the conventional rule-based or
lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no
prior language lexicon or phoneme set, i.e. no linguistic expertise.Comment: Accepted at ICASSP 202
On the compression of shallow non-causal ASR models using knowledge distillation and tied-and-reduced decoder for low-latency on-device speech recognition
Recently, the cascaded two-pass architecture has emerged as a strong
contender for on-device automatic speech recognition (ASR). A cascade of causal
and shallow non-causal encoders coupled with a shared decoder enables operation
in both streaming and look-ahead modes. In this paper, we propose shallow
cascaded model by combining various model compression techniques such as
knowledge distillation, shared decoder, and tied-and-reduced transducer network
in order to reduce the model footprint. The shared decoder is changed into a
tied-and-reduced network. The cascaded two-pass model is further compressed
using knowledge distillation using a Kullback-Leibler divergence loss on the
model posteriors. We demonstrate a 50% reduction in the size of a 41 M
parameter cascaded teacher model with no noticeable degradation in ASR accuracy
and a 30% reduction in latenc
Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction
Text-to-Text Transfer Transformer (T5) has recently been considered for the
Grapheme-to-Phoneme (G2P) transduction. As a follow-up, a tokenizer-free
byte-level model based on T5 referred to as ByT5, recently gave promising
results on word-level G2P conversion by representing each input character with
its corresponding UTF-8 encoding. Although it is generally understood that
sentence-level or paragraph-level G2P can improve usability in real-world
applications as it is better suited to perform on heteronyms and linking sounds
between words, we find that using ByT5 for these scenarios is nontrivial. Since
ByT5 operates on the character level, it requires longer decoding steps, which
deteriorates the performance due to the exposure bias commonly observed in
auto-regressive generation models. This paper shows that the performance of
sentence-level and paragraph-level G2P can be improved by mitigating such
exposure bias using our proposed loss-based sampling method.Comment: INTERSPEECH 202
Quasi-closed phase forward-backward linear prediction analysis of speech for accurate formant detection and estimation
Recently, a quasi-closed phase (QCP) analysis of speech signals for accurate glottal inverse filtering was proposed. However, the QCP analysis which belongs to the family of temporally weighted linear prediction (WLP) methods uses the conventional forward type of sample prediction. This may not be the best choice especially in computing WLP models with a hard-limiting weighting function. A sample selective minimization of the prediction error in WLP reduces the effective number of samples available within a given window frame. To counter this problem, a modified quasi-closed phase forward-backward (QCP-FB) analysis is proposed, wherein each sample is predicted based on its past as well as future samples thereby utilizing the available number of samples more effectively. Formant detection and estimation experiments on synthetic vowels generated using a physical modeling approach as well as natural speech utterances show that the proposed QCP-FB method yields statistically significant improvements over the conventional linear prediction and QCP methods.Peer reviewe