4,013 research outputs found
Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers
We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality
Speech vocoding for laboratory phonology
Using phonological speech vocoding, we propose a platform for exploring
relations between phonology and speech processing, and in broader terms, for
exploring relations between the abstract and physical structures of a speech
signal. Our goal is to make a step towards bridging phonology and speech
processing and to contribute to the program of Laboratory Phonology. We show
three application examples for laboratory phonology: compositional phonological
speech modelling, a comparison of phonological systems and an experimental
phonological parametric text-to-speech (TTS) system. The featural
representations of the following three phonological systems are considered in
this work: (i) Government Phonology (GP), (ii) the Sound Pattern of English
(SPE), and (iii) the extended SPE (eSPE). Comparing GP- and eSPE-based vocoded
speech, we conclude that the latter achieves slightly better results than the
former. However, GP - the most compact phonological speech representation -
performs comparably to the systems with a higher number of phonological
features. The parametric TTS based on phonological speech representation, and
trained from an unlabelled audiobook in an unsupervised manner, achieves
intelligibility of 85% of the state-of-the-art parametric speech synthesis. We
envision that the presented approach paves the way for researchers in both
fields to form meaningful hypotheses that are explicitly testable using the
concepts developed and exemplified in this paper. On the one hand, laboratory
phonologists might test the applied concepts of their theoretical models, and
on the other hand, the speech processing community may utilize the concepts
developed for the theoretical phonological models for improvements of the
current state-of-the-art applications
Kalman tracking of linear predictor and harmonic noise models for noisy speech enhancement
This paper presents a speech enhancement method based on the tracking and denoising of the formants of a linear prediction (LP) model of the spectral envelope of speech and the parameters of a harmonic noise model (HNM) of its excitation. The main advantages of tracking and denoising the prominent energy contours of speech are the efficient use of the spectral and temporal structures of successive speech frames and a mitigation of processing artefact known as the âmusical noiseâ or âmusical tonesâ.The formant-tracking linear prediction (FTLP) model estimation consists of three stages: (a) speech pre-cleaning based on a spectral amplitude estimation, (b) formant-tracking across successive speech frames using the Viterbi method, and (c) Kalman filtering of the formant trajectories across successive speech frames.The HNM parameters for the excitation signal comprise; voiced/unvoiced decision, the fundamental frequency, the harmonicsâ amplitudes and the variance of the noise component of excitation. A frequency-domain pitch extraction method is proposed that searches for the peak signal to noise ratios (SNRs) at the harmonics. For each speech frame several pitch candidates are calculated. An estimate of the pitch trajectory across successive frames is obtained using a Viterbi decoder. The trajectories of the noisy excitation harmonics across successive speech frames are modeled and denoised using Kalman filters.The proposed method is used to deconstruct noisy speech, de-noise its model parameters and then reconstitute speech from its cleaned parts. Experimental evaluations show the performance gains of the formant tracking, pitch extraction and noise reduction stages
Recommended from our members
Timbre space as synthesis space: towards a navigation based approach to timbre specification
Much research into timbre, its perception and classification over the last forty years has modelled timbre as an n-dimensional co-ordinate space or timbre space, whose axes are measurable acoustical quantities (variously, spectral density, simultaneity of partial onsets etc). Typically, these spaces have been constructed from data generated from similarity/dissimilarity listening tests, using multidimensional scaling (MDS) analysis techniques. Our current research is the computer assisted synthesis of new timbres using a timbre space search strategy, in which a previously constructed simple timbre space is used as a search space by an algorithm designed to synthesize desired new timbres steered by iterative user input. The success of such an algorithm clearly depends on establishing suitable mapping between its quantifiable features and its perceptual features. We therefore present here, firstly, some of the findings of a series of listening tests aimed at establishing the perceptual topography and granularity of a simple, predefined timbre space, and secondly, the results of preliminary tests of two search strategies designed to navigate this space. The behaviour of these strategies in a circumscribed space of this kind, together with the corresponding user experience is intended to provide a baseline to applications in a more complex space
Speaker-normalized sound representations in the human auditory cortex
The acoustic dimensions that distinguish speech sounds (like the vowel differences in âbootâ and âboatâ) also differentiate speakersâ voices. Therefore, listeners must normalize across speakers without losing linguistic information. Past behavioral work suggests an important role for auditory contrast enhancement in normalization: preceding context affects listenersâ perception of subsequent speech sounds. Here, using intracranial electrocorticography in humans, we investigate whether and how such context effects arise in auditory cortex. Participants identified speech sounds that were preceded by phrases from two different speakers whose voices differed along the same acoustic dimension as target words (the lowest resonance of the vocal tract). In every participant, target vowels evoke a speaker-dependent neural response that is consistent with the listenerâs perception, and which follows from a contrast enhancement model. Auditory cortex processing thus displays a critical feature of normalization, allowing listeners to extract meaningful content from the voices of diverse speakers
Speech Development by Imitation
The Double Cone Model (DCM) is a model
of how the brain transforms sensory input to
motor commands through successive stages of
data compression and expansion. We have
tested a subset of the DCM on speech recognition, production and imitation. The experiments show that the DCM is a good candidate
for an artificial speech processing system that
can develop autonomously. We show that the
DCM can learn a repertoire of speech sounds
by listening to speech input. It is also able to
link the individual elements of speech to sequences that can be recognized or reproduced,
thus allowing the system to imitate spoken
language
Perceptually smooth timbral guides by state-space analysis of phase-vocoder parameters
Sculptor is a phase-vocoder-based package of programs
that allows users to explore timbral manipulation
of sound in real time. It is the product
of a research program seeking ultimately to perform
gestural capture by analysis of the sound a
performer makes using a conventional instrument.
Since the phase-vocoder output is of high dimensionality â
typically more than 1,000 channels per
analysis frameâmapping phase-vocoder output to
appropriate input parameters for a synthesizer is
only feasible in theory
- âŠ