52 research outputs found
Analysis of nonmodal glottal event patterns with application to automatic speaker recognition
Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2008.Includes bibliographical references (p. 211-215).Regions of phonation exhibiting nonmodal characteristics are likely to contain information about speaker identity, language, dialect, and vocal-fold health. As a basis for testing such dependencies, we develop a representation of patterns in the relative timing and height of nonmodal glottal pulses. To extract the timing and height of candidate pulses, we investigate a variety of inverse-filtering schemes including maximum-entropy deconvolution that minimizes predictability of a signal and minimum-entropy deconvolution that maximizes pulse-likeness. Hybrid formulations of these methods are also considered. we then derive a theoretical framework for understanding frequency- and time-domain properties of a pulse sequence, a process that sheds light on the transformation of nonmodal pulse trains into useful parameters. In the frequency domain, we introduce the first comprehensive mathematical derivation of the effect of deterministic and stochastic source perturbation on the short-time spectrum. We also propose a pitch representation of nonmodality that provides an alternative viewpoint on the frequency content that does not rely on Fourier bases. In developing time-domain properties, we use projected low-dimensional histograms of feature vectors derived from pulse timing and height parameters. For these features, we have found clusters of distinct pulse patterns, reflecting a wide variety of glottal-pulse phenomena including near-modal phonation, shimmer and jitter, diplophonia and triplophonia, and aperiodicity. Using temporal relationships between successive feature vectors, an algorithm by which to separate these different classes of glottal-pulse characteristics has also been developed.(cont.) We have used our glottal-pulse-pattern representation to automatically test for one signal dependency: speaker dependence of glottal-pulse sequences. This choice is motivated by differences observed between talkers in our separated feature space. Using an automatic speaker verification experiment, we investigate tradeoffs in speaker dependency for short-time pulse patterns, reflecting local irregularity, as well as long-time patterns related to higher-level cyclic variations. Results, using speakers with a broad array of modal and nonmodal behaviors, indicate a high accuracy in speaker recognition performance, complementary to the use of conventional mel-cepstral features. These results suggest that there is rich structure to the source excitation that provides information about a particular speaker's identity.by Nicolas Malyska.Ph.D
Development of acoustic analysis techniques for use in diagnosis of vocal pathology
Acoustic analysis as used in the vocal pathology literature has come to mean any spectrum or waveform measurement taken from the digitised speech signal. The purpose of the work as set out in the present thesis is to investigate the currently available acoustic measures, to test their validity and to introduce new measures. More specifically, pitch extraction techniques and perturbation measures have been tested, several harmonic to noise ratio techniques have been implemented and thoroughly investigated (three of which are new) and cepstral and other spectral measures have been examined. Also, ratios relevant to voice source characteristics and perceptual correlation have been considered in addition to the tradition harmonic to noise ratios. A study of these approaches has revealed that many measurement problems arise and that the separation of the indices into independent measures is not a simple issue. The most commonly used acoustic measures for diagnosis o f vocal pathology are jitter, shimmer and the harmonic to noise ratio. However, several researchers have shown that these measures are not independent and therefore may give ambiguous information. For example, the addition of random noise causes increased jitter measurements and the introduction of jitter causes a reduced harmonic to noise ratio. Recent studies have shown that the glottal waveform and hence vibratory pattern of the vocal folds may be estimated in terms of spectral measurements. However, in order to provide spectral characterisation of the vibratory pattern in pathological voice types the effects of jitter and shimmer on the speech spectrum must firstly be removed. These issues are thoroughly addressed in this thesis. The foundation has been laid for future studies that will investigate the vibratory pattern of the vocal folds based on spectral evaluation of tape recorded data. All analysis techniques are tested by initially running them on specially designed synthesis data files and on a group of 13 patients with varying pathologies and a group of twelve normals. Finally, the possibility of using digital spectrograms for speaker identification purposes has been addressed
Broadcast speech and the effect of voice quality on the listener : a study of the various components which categorise listener perception by vocal characteristics.
Voice quality is crucial to the art of the broadcast speaker. Acceptable voice
quality is a necessity for an acceptable microphone voice and essential therefore for
employment as a broadcaster. This thesis investigates the characteristics of the
voice which provide that acceptability; and categorises the features which lead the
listener to make judgements about their vocal likes and dislikes. These subjective
judgements are explored by investigating the psychological, medical, and innate
features contributing to the vocal perceptions of the listener. Voice quality is
related to the efficiency of the larynx and its importance to voice production; and
to the various vocal disorders which can affect the broadcaster.
It becomes evident throughout the thesis that each listener receives a clear
impression of the personality of the speaker through the features present in the
voice. Many of these impressions however are based on stereotypes. The thesis
relates these stereotypical judgements to accents, investigating their relationship to
the 'BBC' voice, the 'World Service' voice, the 'ILR' voice and the 'reporter's
voice' . It is shown that the listener's subjective impression of the voice and the
broadcaster personality is formed by the presentational and physical aspects of voice
quality.
Listener perceptions of voice acceptability are tested and discussed. The data is
analysed to provide a set of dominant characteristics from which are drawn voice
histograms and frequency polygons.
The result is a set of preferred voice characteristics which apply specifically to the
broadcast speaker and which can be sought during the selection process
Text-Independent Voice Conversion
This thesis deals with text-independent solutions for voice conversion. It first introduces the use of vocal tract length normalization (VTLN) for voice conversion. The presented variants of VTLN allow for easily changing speaker characteristics by means of a few trainable parameters. Furthermore, it is shown how VTLN can be expressed in time domain strongly reducing the computational costs while keeping a high speech quality. The second text-independent voice conversion paradigm is residual prediction. In particular, two proposed techniques, residual smoothing and the application of unit selection, result in essential improvement of both speech quality and voice similarity. In order to apply the well-studied linear transformation paradigm to text-independent voice conversion, two text-independent speech alignment techniques are introduced. One is based on automatic segmentation and mapping of artificial phonetic classes and the other is a completely data-driven approach with unit selection. The latter achieves a performance very similar to the conventional text-dependent approach in terms of speech quality and similarity. It is also successfully applied to cross-language voice conversion. The investigations of this thesis are based on several corpora of three different languages, i.e., English, Spanish, and German. Results are also presented from the multilingual voice conversion evaluation in the framework of the international speech-to-speech translation project TC-Star
- …