129 research outputs found
A Survey of Available Corpora For Building Data-Driven Dialogue Systems: The Journal Version
During the past decade, several areas of speech and language understanding have witnessed substantial breakthroughs from the use of data-driven models. In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. Nevertheless, several recent results suggest that data-driven approaches are feasible and quite promising. To facilitate research in this area, we have carried out a wide survey of publicly available datasets suitable for data-driven learning of dialogue systems. We discuss important characteristics of these datasets, how they can be used to learn diverse dialogue strategies, and their other potential uses. We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective
Accurate and budget-efficient text, image, and video analysis systems powered by the crowd
Crowdsourcing systems empower individuals and companies to outsource labor-intensive tasks that cannot currently be solved by automated methods and are expensive to tackle by domain experts. Crowdsourcing platforms are traditionally used to provide training labels for supervised machine learning algorithms. Crowdsourced tasks are distributed among internet workers who typically have a range of skills and knowledge, differing previous exposure to the task at hand, and biases that may influence their work. This inhomogeneity of the workforce makes the design of accurate and efficient crowdsourcing systems challenging. This dissertation presents solutions to improve existing crowdsourcing systems in terms of accuracy and efficiency. It explores crowdsourcing tasks in two application areas, political discourse and annotation of biomedical and everyday images. The first part of the dissertation investigates how workers' behavioral factors and their unfamiliarity with data can be leveraged by crowdsourcing systems to control quality. Through studies that involve familiar and unfamiliar image content, the thesis demonstrates the benefit of explicitly accounting for a worker's familiarity with the data when designing annotation systems powered by the crowd. The thesis next presents Crowd-O-Meter, a system that automatically predicts the vulnerability of crowd workers to believe \enquote{fake news} in text and video. The second part of the dissertation explores the reversed relationship between machine learning and crowdsourcing by incorporating machine learning techniques for quality control of crowdsourced end products. In particular, it investigates if machine learning can be used to improve the quality of crowdsourced results and also consider budget constraints. The thesis proposes an image analysis system called ICORD that utilizes behavioral cues of the crowd worker, augmented by automated evaluation of image features, to infer the quality of a worker-drawn outline of a cell in a microscope image dynamically. ICORD determines the need to seek additional annotations from other workers in a budget-efficient manner. Next, the thesis proposes a budget-efficient machine learning system that uses fewer workers to analyze easy-to-label data and more workers for data that require extra scrutiny. The system learns a mapping from data features to number of allocated crowd workers for two case studies, sentiment analysis of twitter messages and segmentation of biomedical images. Finally, the thesis uncovers the potential for design of hybrid crowd-algorithm methods by describing an interactive system for cell tracking in time-lapse microscopy videos, based on a prediction model that determines when automated cell tracking algorithms fail and human interaction is needed to ensure accurate tracking
Speech segmentation and speaker diarisation for transcription and translation
This dissertation outlines work related to Speech Segmentation – segmenting an audio
recording into regions of speech and non-speech, and Speaker Diarization – further
segmenting those regions into those pertaining to homogeneous speakers.
Knowing not only what was said but also who said it and when, has many useful
applications. As well as providing a richer level of transcription for speech, we will
show how such knowledge can improve Automatic Speech Recognition (ASR) system
performance and can also benefit downstream Natural Language Processing (NLP)
tasks such as machine translation and punctuation restoration.
While segmentation and diarization may appear to be relatively simple tasks to
describe, in practise we find that they are very challenging and are, in general, ill-defined
problems. Therefore, we first provide a formalisation of each of the problems
as the sub-division of speech within acoustic space and time. Here, we see that the
task can become very difficult when we want to partition this domain into our target
classes of speakers, whilst avoiding other classes that reside in the same space, such as
phonemes. We present a theoretical framework for describing and discussing the tasks
as well as introducing existing state-of-the-art methods and research.
Current Speaker Diarization systems are notoriously sensitive to hyper-parameters
and lack robustness across datasets. Therefore, we present a method which uses a series
of oracle experiments to expose the limitations of current systems and to which
system components these limitations can be attributed. We also demonstrate how Diarization
Error Rate (DER), the dominant error metric in the literature, is not a comprehensive
or reliable indicator of overall performance or of error propagation to subsequent
downstream tasks. These results inform our subsequent research.
We find that, as a precursor to Speaker Diarization, the task of Speech Segmentation
is a crucial first step in the system chain. Current methods typically do not account
for the inherent structure of spoken discourse. As such, we explored a novel method
which exploits an utterance-duration prior in order to better model the segment distribution
of speech. We show how this method improves not only segmentation, but also
the performance of subsequent speech recognition, machine translation and speaker
diarization systems.
Typical ASR transcriptions do not include punctuation and the task of enriching
transcriptions with this information is known as ‘punctuation restoration’. The benefit
is not only improved readability but also better compatibility with NLP systems
that expect sentence-like units such as in conventional machine translation. We show
how segmentation and diarization are related tasks that are able to contribute acoustic
information that complements existing linguistically-based punctuation approaches.
There is a growing demand for speech technology applications in the broadcast media
domain. This domain presents many new challenges including diverse noise and
recording conditions. We show that the capacity of existing GMM-HMM based speech
segmentation systems is limited for such scenarios and present a Deep Neural Network
(DNN) based method which offers a more robust speech segmentation method resulting
in improved speech recognition performance for a television broadcast dataset.
Ultimately, we are able to show that the speech segmentation is an inherently ill-defined
problem for which the solution is highly dependent on the downstream task
that it is intended for
Nesting optimization with adversarial games, meta-learning, and deep equilibrium models
Nested optimization, whereby an optimization problem is constrained by the solutions of other optimization problems, has recently seen a surge in its application to Deep Learning.
While the study of such problems started nearly a century ago in the context of market theory, many of the algorithms developed since do not scale to modern Deep Learning applications. In this thesis, I push the understanding and applicability of nested optimization to three machine learning domains: 1) adversarial games, 2) meta-learning and 3) deep equilibrium models. For each domain, I tackle a particular goal.
In 1) I adversarially learn model compression, in the case where training data isn't available, in 2) I meta-learn hyperparameters for long optimization processes without introducing greediness, and in 3) I use deep equilibrium models to improve temporal coherence in video landmark detection.
The first part of my thesis deals with casting model compression as an adversarial game. Performing knowledge transfer from a large teacher network to a smaller student is a popular task in deep learning. However, due to growing dataset sizes and stricter privacy regulations, it is increasingly common not to have access to the data that was used to train the teacher. I propose a novel method which trains a student to match the predictions of its teacher without using any data or metadata. This is achieved by nesting the training optimization of the student with that of an adversarial generator, which searches for images on which the student poorly matches the teacher. These images are used to train the student in an online fashion. The student closely approximates its teacher for simple datasets like SVHN, and on CIFAR10 I improve on the state-of-the-art for few-shot distillation (with images per class), despite using no data. Finally, I also propose a metric to quantify the degree of belief matching between teacher and student in the vicinity of decision boundaries, and observe a significantly higher match between the zero-shot student and the teacher, than between a student distilled with real data and the teacher.
The second part of my thesis deals with meta-learning hyperparameters in the case when the nested optimization to be differentiated is itself solved by many gradient steps. Gradient-based hyperparameter optimization has earned a widespread popularity in the context of few-shot meta-learning, but remains broadly impractical for tasks with long horizons (many gradient steps), due to memory scaling and gradient degradation issues. A common workaround is to learn hyperparameters online, but this introduces greediness which comes with a significant performance drop. I propose forward-mode differentiation with sharing (FDS), a simple and efficient algorithm which tackles memory scaling issues with forward-mode differentiation, and gradient degradation issues by sharing hyperparameters that are contiguous in time. I provide theoretical guarantees about the noise reduction properties of my algorithm, and demonstrate its efficiency empirically by differentiating through gradient steps of unrolled optimization. I consider large hyperparameter search ranges on CIFAR-10 where I significantly outperform greedy gradient-based alternatives, while achieving speedups compared to the state-of-the-art black-box methods.
The third part of my thesis deals with converting deep equilibrium models to a form of nested optimization in order to perform robust video landmark detection. Cascaded computation, whereby predictions are recurrently refined over several stages, has been a persistent theme throughout the development of landmark detection models. I show that the recently proposed deep equilibrium model (DEQ) can be naturally adapted to this form of computation, given appropriate regularization. My landmark model achieves state-of-the-art performance on the challenging WFLW facial landmark dataset, reaching normalized mean error with fewer parameters and a training memory cost of in the number of recurrent modules. Furthermore, I show that DEQs are particularly suited for landmark detection in videos. In this setting, it is typical to train on still images due to the lack of labeled videos. This can lead to a ``flickering'' effect at inference time on video, whereby a model can rapidly oscillate between different plausible solutions across consecutive frames. I show that the DEQ root solving problem can be turned into a constrained optimization problem in a way that emulates recurrence at inference time, despite not having access to temporal data at training time. I call this "Recurrence without Recurrence'', and demonstrate that it helps reduce landmark flicker by introducing a new metric, and contributing a new facial landmark video dataset targeting landmark uncertainty. On the hard subset of this new dataset, made up of videos, my model improves the accuracy and temporal coherence by and respectively, compared to the strongest previously published model using a hand-tuned conventional filter
Automatic recognition of multiparty human interactions using dynamic Bayesian networks
Relating statistical machine learning approaches to the automatic analysis of multiparty
communicative events, such as meetings, is an ambitious research area. We
have investigated automatic meeting segmentation both in terms of “Meeting Actions”
and “Dialogue Acts”. Dialogue acts model the discourse structure at a fine
grained level highlighting individual speaker intentions. Group meeting actions describe
the same process at a coarse level, highlighting interactions between different
meeting participants and showing overall group intentions.
A framework based on probabilistic graphical models such as dynamic Bayesian
networks (DBNs) has been investigated for both tasks. Our first set of experiments
is concerned with the segmentation and structuring of meetings (recorded using
multiple cameras and microphones) into sequences of group meeting actions such
as monologue, discussion and presentation. We outline four families of multimodal
features based on speaker turns, lexical transcription, prosody, and visual motion
that are extracted from the raw audio and video recordings. We relate these lowlevel
multimodal features to complex group behaviours proposing a multistreammodelling
framework based on dynamic Bayesian networks. Later experiments are
concerned with the automatic recognition of Dialogue Acts (DAs) in multiparty
conversational speech. We present a joint generative approach based on a switching
DBN for DA recognition in which segmentation and classification of DAs are
carried out in parallel. This approach models a set of features, related to lexical
content and prosody, and incorporates a weighted interpolated factored language
model. In conjunction with this joint generative model, we have also investigated
the use of a discriminative approach, based on conditional random fields, to perform
a reclassification of the segmented DAs.
The DBN based approach yielded significant improvements when applied both
to the meeting action and the dialogue act recognition task. On both tasks, the DBN
framework provided an effective factorisation of the state-space and a flexible infrastructure
able to integrate a heterogeneous set of resources such as continuous
and discrete multimodal features, and statistical language models. Although our
experiments have been principally targeted on multiparty meetings; features, models,
and methodologies developed in this thesis can be employed for a wide range
of applications. Moreover both group meeting actions and DAs offer valuable insights about the current conversational context providing valuable cues and features
for several related research areas such as speaker addressing and focus of attention
modelling, automatic speech recognition and understanding, topic and decision detection
From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought
How does language inform our downstream thinking? In particular, how do
humans make meaning from language -- and how can we leverage a theory of
linguistic meaning to build machines that think in more human-like ways? In
this paper, we propose \textit{rational meaning construction}, a computational
framework for language-informed thinking that combines neural models of
language with probabilistic models for rational inference. We frame linguistic
meaning as a context-sensitive mapping from natural language into a
\textit{probabilistic language of thought} (PLoT) -- a general-purpose symbolic
substrate for probabilistic, generative world modeling. Our architecture
integrates two powerful computational tools that have not previously come
together: we model thinking with \textit{probabilistic programs}, an expressive
representation for flexible commonsense reasoning; and we model meaning
construction with \textit{large language models} (LLMs), which support
broad-coverage translation from natural language utterances to code expressions
in a probabilistic programming language. We illustrate our framework in action
through examples covering four core domains from cognitive science:
probabilistic reasoning, logical and relational reasoning, visual and physical
reasoning, and social reasoning about agents and their plans. In each, we show
that LLMs can generate context-sensitive translations that capture
pragmatically-appropriate linguistic meanings, while Bayesian inference with
the generated programs supports coherent and robust commonsense reasoning. We
extend our framework to integrate cognitively-motivated symbolic modules to
provide a unified commonsense thinking interface from language. Finally, we
explore how language can drive the construction of world models themselves
Video coding for compression and content-based functionality
The lifetime of this research project has seen two dramatic developments in the area of digital video coding. The first has been the progress of compression research leading to a factor of two improvement over existing standards, much wider deployment possibilities and the development of the new international ITU-T Recommendation H.263. The second has been a radical change in the approach to video content production with the introduction of the content-based coding concept and the addition of scene composition information to the encoded bit-stream. Content-based coding is central to the latest international standards efforts from the ISO/IEC MPEG working group.
This thesis reports on extensions to existing compression techniques exploiting a priori knowledge about scene content. Existing, standardised, block-based compression coding techniques were extended with work on arithmetic entropy coding and intra-block prediction. These both form part of the H.263 and MPEG-4 specifications respectively. Object-based coding techniques were developed within a collaborative simulation model, known as SIMOC, then extended with ideas on grid motion vector modelling and vector accuracy confidence estimation. An improved confidence measure for encouraging motion smoothness is proposed.
Object-based coding ideas, with those from other model and layer-based coding approaches, influenced the development of content-based coding within MPEG-4. This standard made considerable progress in this newly adopted content based video coding field defining normative techniques for arbitrary shape and texture coding. The means to generate this information, the analysis problem, for the content to be coded was intentionally not specified. Further research work in this area concentrated on video segmentation and analysis techniques to exploit the benefits of content based coding for generic frame based video. The work reported here introduces the use of a clustering algorithm on raw data features for providing initial segmentation of video data and subsequent tracking of those image regions through video sequences. Collaborative video analysis frameworks from COST 21 l qual and MPEG-4, combining results from many other segmentation schemes, are also introduced
- …