1,221 research outputs found
Physiologically-Motivated Feature Extraction Methods for Speaker Recognition
Speaker recognition has received a great deal of attention from the speech community, and significant gains in robustness and accuracy have been obtained over the past decade. However, the features used for identification are still primarily representations of overall spectral characteristics, and thus the models are primarily phonetic in nature, differentiating speakers based on overall pronunciation patterns. This creates difficulties in terms of the amount of enrollment data and complexity of the models required to cover the phonetic space, especially in tasks such as identification where enrollment and testing data may not have similar phonetic coverage. This dissertation introduces new features based on vocal source characteristics intended to capture physiological information related to the laryngeal excitation energy of a speaker. These features, including RPCC, GLFCC and TPCC, represent the unique characteristics of speech production not represented in current state-of-the-art speaker identification systems. The proposed features are evaluated through three experimental paradigms including cross-lingual speaker identification, cross song-type avian speaker identification and mono-lingual speaker identification. The experimental results show that the proposed features provide information about speaker characteristics that is significantly different in nature from the phonetically-focused information present in traditional spectral features. The incorporation of the proposed glottal source features offers significant overall improvement to the robustness and accuracy of speaker identification tasks
Time-Varying Modeling of Glottal Source and Vocal Tract and Sequential Bayesian Estimation of Model Parameters for Speech Synthesis
abstract: Speech is generated by articulators acting on
a phonatory source. Identification of this
phonatory source and articulatory geometry are
individually challenging and ill-posed
problems, called speech separation and
articulatory inversion, respectively.
There exists a trade-off
between decomposition and recovered
articulatory geometry due to multiple
possible mappings between an
articulatory configuration
and the speech produced. However, if measurements
are obtained only from a microphone sensor,
they lack any invasive insight and add
additional challenge to an already difficult
problem.
A joint non-invasive estimation
strategy that couples articulatory and
phonatory knowledge would lead to better
articulatory speech synthesis. In this thesis,
a joint estimation strategy for speech
separation and articulatory geometry recovery
is studied. Unlike previous
periodic/aperiodic decomposition methods that
use stationary speech models within a
frame, the proposed model presents a
non-stationary speech decomposition method.
A parametric glottal source model and an
articulatory vocal tract response are
represented in a dynamic state space formulation.
The unknown parameters of the
speech generation components are estimated
using sequential Monte Carlo methods
under some specific assumptions.
The proposed approach is compared with other
glottal inverse filtering methods,
including iterative adaptive inverse filtering,
state-space inverse filtering, and
the quasi-closed phase method.Dissertation/ThesisMasters Thesis Electrical Engineering 201
Glottal-synchronous speech processing
Glottal-synchronous speech processing is a field of speech science where the pseudoperiodicity
of voiced speech is exploited. Traditionally, speech processing involves segmenting
and processing short speech frames of predefined length; this may fail to exploit the inherent
periodic structure of voiced speech which glottal-synchronous speech frames have
the potential to harness. Glottal-synchronous frames are often derived from the glottal
closure instants (GCIs) and glottal opening instants (GOIs).
The SIGMA algorithm was developed for the detection of GCIs and GOIs from
the Electroglottograph signal with a measured accuracy of up to 99.59%. For GCI and
GOI detection from speech signals, the YAGA algorithm provides a measured accuracy
of up to 99.84%. Multichannel speech-based approaches are shown to be more robust to
reverberation than single-channel algorithms.
The GCIs are applied to real-world applications including speech dereverberation,
where SNR is improved by up to 5 dB, and to prosodic manipulation where the importance
of voicing detection in glottal-synchronous algorithms is demonstrated by subjective
testing. The GCIs are further exploited in a new area of data-driven speech modelling,
providing new insights into speech production and a set of tools to aid deployment into
real-world applications. The technique is shown to be applicable in areas of speech coding,
identification and artificial bandwidth extension of telephone speec
Modeling biomechanical influence of epilaryngeal stricture on the vocal folds: A low-dimensional model of vocal-ventricular coupling
Purpose: Physiological and phonetic studies suggest that, at moderate levels of epilaryngeal stricture, the ventricular folds impinge upon the vocal folds and influence their dynamical behavior, which is thought to be responsible for constricted laryngeal sounds. In this work, the authors examine this hypothesis through biomechanical modeling. Method: The dynamical response of a low-dimensional, lumped-element model of the vocal folds under the influence of vocal-ventricular fold coupling was evaluated. The model was assessed for F0 and cover-mass phase difference. Case studies of simulations of different constricted phonation types and of glottal stop illustrate various additional aspects of model performance. Results: Simulated vocal-ventricular fold coupling lowers F0 and perturbs the mucosal wave. It also appears to reinforce irregular patterns of oscillation, and it can enhance laryngeal closure in glottal stop production. Conclusion: The effects of simulated vocal-ventricular fold coupling are consistent with sounds, such as creaky voice, harsh voice, and glottal stop, that have been observed to involve epilaryngeal stricture and apparent contact between the vocal folds and ventricular folds. This supports the view that vocal-ventricular fold coupling is important in the vibratory dynamics of such sounds and, furthermore, suggests that these sounds may intrinsically require epilaryngeal strictur
- …