26 research outputs found
Models and analysis of vocal emissions for biomedical applications
This book of Proceedings collects the papers presented at the 3rd International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, MAVEBA 2003, held 10-12 December 2003, Firenze, Italy. The workshop is organised every two years, and aims to stimulate contacts between specialists active in research and industrial developments, in the area of voice analysis for biomedical applications. The scope of the Workshop includes all aspects of voice modelling and analysis, ranging from fundamental research to all kinds of biomedical applications and related established and advanced technologies
A maximum margin dynamic model with its application to brain signal analysis
Ph.DDOCTOR OF PHILOSOPH
Features of hearing: applications of machine learning to uncover the building blocks of hearing
Recent advances in machine learning have instigated a renewed interest in using machine learning approaches to better understand human sensory processing. This line of research is particularly interesting for speech research since speech comprehension is uniquely human, which complicates obtaining detailed neural recordings. In this thesis, I explore how machine learning can be used to uncover new knowledge about the auditory system, with a focus on discovering robust auditory features. The resulting increased understanding of the noise robustness of human hearing may help to better assist those with hearing loss and improve Automatic Speech Recognition (ASR) systems. First, I show how computational neuroscience and machine learning can be combined to generate hypotheses about auditory features. I introduce a neural feature detection model with a modest number of parameters that is compatible with auditory physiology. By testing feature detector variants in a speech classification task, I confirm the importance of both well-studied and lesser-known auditory features. Second, I investigate whether ASR software is a good candidate model of the human auditory system. By comparing several state-of-the-art ASR systems to the results from humans on a range of psychometric experiments, I show that these ASR systems diverge markedly from humans in at least some psychometric tests. This implies that none of these systems act as a strong proxy for human speech recognition, although some may be useful when asking more narrowly defined questions. For neuroscientists, this thesis exemplifies how machine learning can be used to generate new hypotheses about human hearing, while also highlighting the caveats of investigating systems that may work fundamentally differently from the human brain. For machine learning engineers, I point to tangible directions for improving ASR systems. To motivate the continued cross-fertilization between these fields, a toolbox that allows researchers to assess new ASR systems has been released.Open Acces
Model-Based Speech Enhancement
Abstract
A method of speech enhancement is developed that reconstructs clean speech from
a set of acoustic features using a harmonic plus noise model of speech. This is a significant
departure from traditional filtering-based methods of speech enhancement.
A major challenge with this approach is to estimate accurately the acoustic features
(voicing, fundamental frequency, spectral envelope and phase) from noisy speech.
This is achieved using maximum a-posteriori (MAP) estimation methods that operate
on the noisy speech. In each case a prior model of the relationship between the
noisy speech features and the estimated acoustic feature is required. These models
are approximated using speaker-independent GMMs of the clean speech features
that are adapted to speaker-dependent models using MAP adaptation and for noise
using the Unscented Transform.
Objective results are presented to optimise the proposed system and a set of subjective
tests compare the approach with traditional enhancement methods. Threeway
listening tests examining signal quality, background noise intrusiveness and
overall quality show the proposed system to be highly robust to noise, performing
significantly better than conventional methods of enhancement in terms of background
noise intrusiveness. However, the proposed method is shown to reduce signal
quality, with overall quality measured to be roughly equivalent to that of the Wiener
filter
Low-dimensional representations of neural time-series data with applications to peripheral nerve decoding
Bioelectronic medicines, implanted devices that influence physiological states by peripheral neuromodulation, have promise as a new way of treating diverse conditions from rheumatism to diabetes. We here explore ways of creating nerve-based feedback for the implanted systems to act in a dynamically adapting closed loop.
In a first empirical component, we carried out decoding studies on in vivo recordings of cat and rat bladder afferents. In a low-resolution data-set, we selected informative frequency bands of the neural activity using information theory to then relate to bladder pressure. In a second high-resolution dataset, we analysed the population code for bladder pressure, again using information theory, and proposed an informed decoding approach that promises enhanced robustness and automatic re-calibration by creating a low-dimensional population vector.
Coming from a different direction of more general time-series analysis, we embedded a set of peripheral nerve recordings in a space of main firing characteristics by dimensionality reduction in a high-dimensional feature-space and automatically proposed single efficiently implementable estimators for each identified characteristic. For bioelectronic medicines, this feature-based pre-processing method enables an online signal characterisation of low-resolution data where spike sorting is impossible but simple power-measures discard informative structure. Analyses were based on surrogate data from a self-developed and flexibly adaptable computer model that we made publicly available.
The wider utility of two feature-based analysis methods developed in this work was demonstrated on a variety of datasets from across science and industry. (1) Our feature-based generation of interpretable low-dimensional embeddings for unknown time-series datasets answers a need for simplifying and harvesting the growing body of sequential data that characterises modern science. (2) We propose an additional, supervised pipeline to tailor feature subsets to collections of classification problems. On a literature standard library of time-series classification tasks, we distilled 22 generically useful estimators and made them easily accessible.Open Acces
Models and analysis of vocal emissions for biomedical applications: 5th International Workshop: December 13-15, 2007, Firenze, Italy
The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies. The Workshop has the sponsorship of: Ente Cassa Risparmio di Firenze, COST Action 2103, Biomedical Signal Processing and Control Journal (Elsevier Eds.), IEEE Biomedical Engineering Soc. Special Issues of International Journals have been, and will be, published, collecting selected papers from the conference
Intelligibility of synthetic speech in noise and reverberation
Synthetic speech is a valuable means of output, in a range of application contexts,
for people with visual, cognitive, or other impairments or for situations were other
means are not practicable. Noise and reverberation occur in many of these application
contexts and are known to have devastating effects on the intelligibility of natural
speech, yet very little was known about the effects on synthetic speech based on unit
selection or hidden Markov models.
In this thesis, we put forward an approach for assessing the intelligibility of
synthetic and natural speech in noise, reverberation, or a combination of the two.
The approach uses an experimental methodology consisting of Amazon Mechanical
Turk, Matrix sentences, and noises that approximate the real-world, evaluated with
generalized linear mixed models.
The experimental methodologies were assessed against their traditional counterparts
and were found to provide a number of additional benefits, whilst maintaining
equivalent measures of relative performance. Subsequent experiments were carried
out to establish the efficacy of the approach in measuring intelligibility in noise and
then reverberation. Finally, the approach was applied to natural speech and the two
synthetic speech systems in combinations of noise and reverberation.
We have examine and report on the intelligibility of current synthesis systems in
real-life noises and reverberation using techniques that bridge the gap between the
audiology and speech synthesis communities and using Amazon Mechanical Turk. In
the process, we establish Amazon Mechanical Turk and Matrix sentences as valuable
tools in the assessment of synthetic speech intelligibility
Discovering Dynamic Visemes
Abstract
This thesis introduces a set of new, dynamic units of visual speech which are learnt
using computer vision and machine learning techniques. Rather than clustering
phoneme labels as is done traditionally, the visible articulators of a speaker are
tracked and automatically segmented into short, visually intuitive speech gestures
based on the dynamics of the articulators. The segmented gestures are clustered
into dynamic visemes, such that movements relating to the same visual function
appear within the same cluster. Speech animation can then be generated on any
facial model by mapping a phoneme sequence to a sequence of dynamic visemes,
and stitching together an example of each viseme in the sequence. Dynamic visemes
model coarticulation and maintain the dynamics of the original speech, so simple
blending at the concatenation boundaries ensures a smooth transition. The efficacy
of dynamic visemes for computer animation is formally evaluated both objectively
and subjectively, and compared with traditional phoneme to static lip-pose interpolation