118 research outputs found
Calibrated Prediction Intervals for Neural Network Regressors
Ongoing developments in neural network models are continually advancing the
state of the art in terms of system accuracy. However, the predicted labels
should not be regarded as the only core output; also important is a
well-calibrated estimate of the prediction uncertainty. Such estimates and
their calibration are critical in many practical applications. Despite their
obvious aforementioned advantage in relation to accuracy, contemporary neural
networks can, generally, be regarded as poorly calibrated and as such do not
produce reliable output probability estimates. Further, while post-processing
calibration solutions can be found in the relevant literature, these tend to be
for systems performing classification. In this regard, we herein present two
novel methods for acquiring calibrated predictions intervals for neural network
regressors: empirical calibration and temperature scaling. In experiments using
different regression tasks from the audio and computer vision domains, we find
that both our proposed methods are indeed capable of producing calibrated
prediction intervals for neural network regressors with any desired confidence
level, a finding that is consistent across all datasets and neural network
architectures we experimented with. In addition, we derive an additional
practical recommendation for producing more accurate calibrated prediction
intervals. We release the source code implementing our proposed methods for
computing calibrated predicted intervals. The code for computing calibrated
predicted intervals is publicly available
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
The development of a new rating scale for the perceptual assessment of tracheoesophageal voice quality outcome following total laryngectomy
PhD ThesisPerceptual assessment of voice in people with surgical voice restoration (SVR) is essential to evaluate surgical and other interventions aimed at delivering optimal voice quality. Currently there are no tools to measure this that do not have issues of validity and reliability.
This work describes the development and trialling of investigatory versions of three scales to address this situation: a) the Sunderland Tracheoesophageal Perceptual Scale (SToPS) for professional raters, b) the Naïve Rater Scale for non-specialist raters and c) the Patient and Carer Scale.
In the final testing of the pilot version 55 speakers using tracheoesophageal voice were evaluated by twelve Speech and Language Therapists (SLT’s) and ten Ear, Nose and Throat (ENT) surgeons, divided into experienced or not at assessing voice.
Ten naïve raters assessed the voice stimuli within a test-retest design. Forty tracheoesophageal speakers and thirty-seven carers attended an interview to rate their own or their relative’s voice. Inter rater agreement was then calculated between SLT, ENT, naïve, patient and carer groups with weighted kappa co-efficients
Strength of agreement values (Landis and Koch 1977) were compared to profession and expertise. Expert SLT’s achieved “good” agreement for nine of fourteen parameters. Naïve judges attained “good” levels of inter and intra-rater agreement for the parameters Overall Grade and Social Acceptability. The greatest inter group consensus was for patients and carers, with “good” agreement for Intelligibility, Volume and Wetness. The
only other “good” agreement was between naïve/ENT and naïve/ SLT groups for Overall Grade.
The scales are ready for clinical use with the proviso that future work will determine whether it is possible to enhance agreement so less experienced judges can achieve “good” levels of agreement for more parameters and examine which perceptual parameters might be more prominent or vital for outcomes for different groups.City Hospitals Sunderland NHS Foundation Trust
Towards uncertainty-aware and label-efficient machine learning of human expressive behaviour
The ability to recognise emotional expressions from non-verbal behaviour plays a key role in human-human interaction. Endowing machines with the same ability is critical to enriching human-computer interaction. Despite receiving widespread attention so far, human-level automatic recognition of affective expressions is still an elusive task for machines. Towards improving the current state of machine learning methods applied to affect recognition, this thesis identifies two challenges: label ambiguity and label scarcity.
Firstly, this thesis notes that it is difficult to establish a clear one-to-one mapping between inputs (face images or speech segments) and their target emotion labels, considering that emotion perception is inherently subjective. As a result, the problem of label ambiguity naturally arises in the manual annotations of affect. Ignoring this fundamental problem, most existing affect recognition methods implicitly assume a one-to-one input-target mapping and use deterministic function learning. In contrast, this thesis proposes to learn non-deterministic functions based on uncertainty-aware probabilistic models, as they can naturally accommodate the one-to-many input-target mapping. Besides improving the affect recognition performance, the proposed uncertainty-aware models in this thesis demonstrate three important applications: adaptive multimodal affect fusion, human-in-the-loop learning of affect, and improved performance on downstream behavioural analysis tasks like personality traits estimation.
Secondly, this thesis aims to address the challenge of scarcity of affect labelled datasets, caused by the cumbersome and time-consuming nature of the affect annotation process. To this end, this thesis notes that audio and visual feature encoders used in the existing models are label-inefficient i.e. learning them requires large amounts of labelled training data. As a solution, this thesis proposes to pre-train the feature encoders using unlabelled data to make them more label-efficient i.e. using as few labelled training examples as possible to achieve good emotion recognition performance. A novel self-supervised pre-training method is proposed in this thesis by posing hand-engineered emotion features as task-specific representation learning priors. By leveraging large amounts of unlabelled audiovisual data, the proposed self-supervised pre-training method demonstrates much better label efficiency compared to the commonly employed pre-training methods
A speaker classification framework for non-intrusive user modeling : speech-based personalization of in-car services
Speaker Classification, i.e. the automatic detection of certain characteristics of a person based on his or her voice, has a variety of applications in modern computer technology and artificial intelligence: As a non-intrusive source for user modeling, it can be employed for personalization of human-machine interfaces in numerous domains. This dissertation presents a principled approach to the design of a novel Speaker Classification system for automatic age and gender recognition which meets these demands. Based on literature studies, methods and concepts dealing with the underlying pattern recognition task are developed. The final system consists of an incremental GMM-SVM supervector architecture with several optimizations. An extensive data-driven experiment series explores the parameter space and serves as evaluation of the component. Further experiments investigate the language-independence of the approach. As an essential part of this thesis, a framework is developed that implements all tasks associated with the design and evaluation of Speaker Classification in an integrated development environment that is able to generate efficient runtime modules for multiple platforms. Applications from the automotive field and other domains demonstrate the practical benefit of the technology for personalization, e.g. by increasing local danger warning lead time for elderly drivers.Die Sprecherklassifikation, also die automatische Erkennung bestimmter Merkmale einer Person anhand ihrer Stimme, besitzt eine Vielzahl von Anwendungsmöglichkeiten in der modernen Computertechnik und Künstlichen Intelligenz: Als nicht-intrusive Wissensquelle für die Benutzermodellierung kann sie zur Personalisierung in vielen Bereichen eingesetzt werden. In dieser Dissertation wird ein fundierter Ansatz zum Entwurf eines neuartigen Sprecherklassifikationssystems zur automatischen Bestimmung von Alter und Geschlecht vorgestellt, welches diese Anforderungen erfüllt. Ausgehend von Literaturstudien werden Konzepte und Methoden zur Behandlung des zugrunde liegenden Mustererkennungsproblems entwickelt, welche zu einer inkrementell arbeitenden GMM-SVM-Supervector-Architektur mit diversen Optimierungen führen. Eine umfassende datengetriebene Experimentalreihe dient der Erforschung des Parameterraumes und zur Evaluierung der Komponente. Weitere Studien untersuchen die Sprachunabhängigkeit des Ansatzes. Als wesentlicher Bestandteil der Arbeit wird ein Framework entwickelt, das alle im Zusammenhang mit Entwurf und Evaluierung von Sprecherklassifikation anfallenden Aufgaben in einer integrierten Entwicklungsumgebung implementiert, welche effiziente Laufzeitmodule für verschiedene Plattformen erzeugen kann. Anwendungen aus dem Automobilbereich und weiteren Domänen demonstrieren den praktischen Nutzen der Technologie zur Personalisierung, z.B. indem die Vorlaufzeit von lokalen Gefahrenwarnungen für ältere Fahrer erhöht wird
Towards uncertainty-aware and label-efficient machine learning of human expressive behaviour
The ability to recognise emotional expressions from non-verbal behaviour plays a key role in human-human interaction. Endowing machines with the same ability is critical to enriching human-computer interaction. Despite receiving widespread attention so far, human-level automatic recognition of affective expressions is still an elusive task for machines. Towards improving the current state of machine learning methods applied to affect recognition, this thesis identifies two challenges: label ambiguity and label scarcity.
Firstly, this thesis notes that it is difficult to establish a clear one-to-one mapping between inputs (face images or speech segments) and their target emotion labels, considering that emotion perception is inherently subjective. As a result, the problem of label ambiguity naturally arises in the manual annotations of affect. Ignoring this fundamental problem, most existing affect recognition methods implicitly assume a one-to-one input-target mapping and use deterministic function learning. In contrast, this thesis proposes to learn non-deterministic functions based on uncertainty-aware probabilistic models, as they can naturally accommodate the one-to-many input-target mapping. Besides improving the affect recognition performance, the proposed uncertainty-aware models in this thesis demonstrate three important applications: adaptive multimodal affect fusion, human-in-the-loop learning of affect, and improved performance on downstream behavioural analysis tasks like personality traits estimation.
Secondly, this thesis aims to address the challenge of scarcity of affect labelled datasets, caused by the cumbersome and time-consuming nature of the affect annotation process. To this end, this thesis notes that audio and visual feature encoders used in the existing models are label-inefficient i.e. learning them requires large amounts of labelled training data. As a solution, this thesis proposes to pre-train the feature encoders using unlabelled data to make them more label-efficient i.e. using as few labelled training examples as possible to achieve good emotion recognition performance. A novel self-supervised pre-training method is proposed in this thesis by posing hand-engineered emotion features as task-specific representation learning priors. By leveraging large amounts of unlabelled audiovisual data, the proposed self-supervised pre-training method demonstrates much better label efficiency compared to the commonly employed pre-training methods
Ubiquitous Technologies for Emotion Recognition
Emotions play a very important role in how we think and behave. As such, the emotions we feel every day can compel us to act and influence the decisions and plans we make about our lives. Being able to measure, analyze, and better comprehend how or why our emotions may change is thus of much relevance to understand human behavior and its consequences. Despite the great efforts made in the past in the study of human emotions, it is only now, with the advent of wearable, mobile, and ubiquitous technologies, that we can aim to sense and recognize emotions, continuously and in real time. This book brings together the latest experiences, findings, and developments regarding ubiquitous sensing, modeling, and the recognition of human emotions
- …