7,439 research outputs found
Student-teacher training with diverse decision tree ensembles
Student-teacher training allows a large teacher model or ensemble of teachers to be compressed into a single student model, for the purpose of efficient decoding. However, current approaches in automatic speech recognition assume that the state clusters, often defined by Phonetic Decision Trees (PDT), are the same across all models. This limits the diversity that can be captured within the ensemble, and also the flexibility when selecting the complexity of the student model output. This paper examines an extension to student-teacher training that allows for the possibility of having different PDTs between teachers, and also for the student to have a different PDT from the teacher. The proposal is to train the student to emulate the logical context dependent state posteriors of the teacher, instead of the frame posteriors. This leads to a method of mapping frame posteriors from one PDT to another. This approach is evaluated on three speech recognition tasks: the Tok Pisin and Javanese low resource conversational telephone speech tasks from the IARPA Babel programme, and the HUB4 English broadcast news task
Recommended from our members
General sequence teacher-student learning
In automatic speech recognition, performance gains can often be obtained by combining an ensemble of multiple models. However, this can be computationally expensive when performing recognition. Teacher-student learning alleviates this cost by training a single student model to emulate the combined ensemble behaviour. Only this student needs to be used for recognition. Previously investigated teacher-student criteria often limit the forms of diversity allowed in the ensemble, and only propagate information from the teachers to the student at the frame level. This paper addresses both of these issues by examining teacher-student learning within a sequence-level framework, and assessing the flexibility that these approaches offer. Various sequence-level teacher-student criteria are examined in this work, to propagate sequence posterior information. A training criterion based on the KL-divergence between context-dependent state sequence posteriors is proposed that allows for a diversity of state cluster sets to be present in the ensemble. This criterion is shown to be an upper bound to a more general KL-divergence between word sequence posteriors, which places even fewer restrictions on the ensemble diversity, but whose gradient can be expensive to compute. These methods are evaluated on the AMI meeting transcription and MGB-3 television broadcast audio tasks.This research was partly funded under the ALTA Institute, University of Cambridge. Thanks to Cambridge Assessment English, University of
Cambridge, for supporting this research
Born Again Neural Networks
Knowledge distillation (KD) consists of transferring knowledge from one
machine learning model (the teacher}) to another (the student). Commonly, the
teacher is a high-capacity model with formidable performance, while the student
is more compact. By transferring knowledge, one hopes to benefit from the
student's compactness. %we desire a compact model with performance close to the
teacher's. We study KD from a new perspective: rather than compressing models,
we train students parameterized identically to their teachers. Surprisingly,
these {Born-Again Networks (BANs), outperform their teachers significantly,
both on computer vision and language modeling tasks. Our experiments with BANs
based on DenseNets demonstrate state-of-the-art performance on the CIFAR-10
(3.5%) and CIFAR-100 (15.5%) datasets, by validation error. Additional
experiments explore two distillation objectives: (i) Confidence-Weighted by
Teacher Max (CWTM) and (ii) Dark Knowledge with Permuted Predictions (DKPP).
Both methods elucidate the essential components of KD, demonstrating a role of
the teacher outputs on both predicted and non-predicted classes. We present
experiments with students of various capacities, focusing on the under-explored
case where students overpower teachers. Our experiments show significant
advantages from transferring knowledge between DenseNets and ResNets in either
direction.Comment: Published @ICML 201
Recommended from our members
Ensemble generation and compression for speech recognition
For many tasks in machine learning, performance gains can often be obtained by combining together an ensemble of multiple systems. In Automatic Speech Recognition (ASR), a range of approaches can be used to combine an ensemble when performing recognition. However, many of these have computational costs that scale linearly with the ensemble size. One method to address this is teacher-student learning, which compresses the ensemble into a single student. The student is trained to emulate the combined ensemble, and only the student needs to be used when performing recognition. This thesis investigates both methods for ensemble generation and methods for ensemble compression.
The first contribution of this thesis is to explore approaches of generating multiple systems for an ensemble. The combined ensemble performance depends on both the accuracy of the individual members of the ensemble, as well as the diversity between their behaviours. The structured nature of speech allows for many ways that systems can be made different from each other. The experiments suggest that significant combination gains can be obtained by combining systems with different acoustic models, sets of state clusters, and sets of sub-word units. When performing recognition, these ensembles can be combined at the hypothesis and frame levels. However, these combination methods can be computationally expensive, as data is processed by multiple systems.
This thesis also considers approaches to compress an ensemble, and reduce the computational cost when performing recognition. Teacher-student learning is one such method. In standard teacher-student learning, information about the per-frame state cluster posteriors is propagated from the teacher ensemble to the student, to train the student to emulate the ensemble. However, this has two limitations. First, it requires that the teachers and student all use the same set of state clusters. This limits the allowed forms of diversities that the ensemble can have. Second, ASR is a sequence modelling task, and the frame-level posteriors that are propagated may not effectively convey all information about the sequence-level behaviours of the teachers. This thesis addresses both of these limitations.
The second contribution of this thesis is to address the first limitation, and allow for different sets of state clusters between systems. The proposed method maps the state cluster posteriors from the teachers' sets of state clusters to that of the student. The map is derived by considering a distance measure between posteriors of unclustered logical context-dependent states, instead of the usual state cluster. The experiments suggest that this proposed method can allow a student to effectively learn from an ensemble that has a diversity of state cluster sets. However, the experiments also suggest that the student may need to have a large set of state clusters to effectively emulate this ensemble. This thesis proposes to use a student with a multi-task topology, with an output layer for each of the different sets of state clusters. This can capture the phonetic resolution of having multiple sets of state clusters, while having fewer parameters than a student with a single large output layer.
The third contribution of this thesis is to address the second limitation of standard teacher-student learning, that only frame-level information is propagated to emulate the ensemble behaviour for the sequence modelling ASR task. This thesis proposes to generalise teacher-student learning to the sequence level, and propagate sequence posterior information. The proposed methods can also allow for many forms of ensemble diversities. The experiments suggest that by using these sequence-level methods, a student can learn to emulate the ensemble better. Recently, the lattice-free method has been proposed to train a system directly toward a sequence discriminative criterion. Ensembles of these systems can exhibit highly diverse behaviours, because the systems are not biased toward any cross-entropy forced alignments. It is difficult to apply standard frame-level teacher-student learning with these lattice-free systems, as they are often not designed to produce state cluster posteriors. Sequence-level teacher-student learning operates directly on the sequence posteriors, and can therefore be used directly with these lattice-free systems.
The proposals in this thesis are assessed on four ASR tasks. These are the augmented multi-party interaction meeting transcription, IARPA Babel Tok Pisin conversational telephone speech, English broadcast news, and multi-genre broadcast tasks. These datasets provide a variety of quantities of training data, recording environments, and speaking styles
Robust Model Compression Using Deep Hypotheses
Machine Learning models should ideally be compact and robust. Compactness
provides efficiency and comprehensibility whereas robustness provides
resilience. Both topics have been studied in recent years but in isolation.
Here we present a robust model compression scheme which is independent of model
types: it can compress ensembles, neural networks and other types of models
into diverse types of small models. The main building block is the notion of
depth derived from robust statistics. Originally, depth was introduced as a
measure of the centrality of a point in a sample such that the median is the
deepest point. This concept was extended to classification functions which
makes it possible to define the depth of a hypothesis and the median
hypothesis. Algorithms have been suggested to approximate the median but they
have been limited to binary classification. In this study, we present a new
algorithm, the Multiclass Empirical Median Optimization (MEMO) algorithm that
finds a deep hypothesis in multi-class tasks, and prove its correctness. This
leads to our Compact Robust Estimated Median Belief Optimization (CREMBO)
algorithm for robust model compression. We demonstrate the success of this
algorithm empirically by compressing neural networks and random forests into
small decision trees, which are interpretable models, and show that they are
more accurate and robust than other comparable methods. In addition, our
empirical study shows that our method outperforms Knowledge Distillation on DNN
to DNN compression
Multi-teacher knowledge distillation as an effective method for compressing ensembles of neural networks
Deep learning has contributed greatly to many successes in artificial
intelligence in recent years. Today, it is possible to train models that have
thousands of layers and hundreds of billions of parameters. Large-scale deep
models have achieved great success, but the enormous computational complexity
and gigantic storage requirements make it extremely difficult to implement them
in real-time applications. On the other hand, the size of the dataset is still
a real problem in many domains. Data are often missing, too expensive, or
impossible to obtain for other reasons. Ensemble learning is partially a
solution to the problem of small datasets and overfitting. However, ensemble
learning in its basic version is associated with a linear increase in
computational complexity. We analyzed the impact of the ensemble
decision-fusion mechanism and checked various methods of sharing the decisions
including voting algorithms. We used the modified knowledge distillation
framework as a decision-fusion mechanism which allows in addition compressing
of the entire ensemble model into a weight space of a single model. We showed
that knowledge distillation can aggregate knowledge from multiple teachers in
only one student model and, with the same computational complexity, obtain a
better-performing model compared to a model trained in the standard manner. We
have developed our own method for mimicking the responses of all teachers at
the same time, simultaneously. We tested these solutions on several benchmark
datasets. In the end, we presented a wide application use of the efficient
multi-teacher knowledge distillation framework. In the first example, we used
knowledge distillation to develop models that could automate corrosion
detection on aircraft fuselage. The second example describes detection of smoke
on observation cameras in order to counteract wildfires in forests.Comment: Doctoral dissertation in the field of computer science, machine
learning. Application of knowledge distillation as aggregation of ensemble
models. Along with several uses. 140 pages, 67 figures, 13 table
Knowledge Base Population using Semantic Label Propagation
A crucial aspect of a knowledge base population system that extracts new
facts from text corpora, is the generation of training data for its relation
extractors. In this paper, we present a method that maximizes the effectiveness
of newly trained relation extractors at a minimal annotation cost. Manual
labeling can be significantly reduced by Distant Supervision, which is a method
to construct training data automatically by aligning a large text corpus with
an existing knowledge base of known facts. For example, all sentences
mentioning both 'Barack Obama' and 'US' may serve as positive training
instances for the relation born_in(subject,object). However, distant
supervision typically results in a highly noisy training set: many training
sentences do not really express the intended relation. We propose to combine
distant supervision with minimal manual supervision in a technique called
feature labeling, to eliminate noise from the large and noisy initial training
set, resulting in a significant increase of precision. We further improve on
this approach by introducing the Semantic Label Propagation method, which uses
the similarity between low-dimensional representations of candidate training
instances, to extend the training set in order to increase recall while
maintaining high precision. Our proposed strategy for generating training data
is studied and evaluated on an established test collection designed for
knowledge base population tasks. The experimental results show that the
Semantic Label Propagation strategy leads to substantial performance gains when
compared to existing approaches, while requiring an almost negligible manual
annotation effort.Comment: Submitted to Knowledge Based Systems, special issue on Knowledge
Bases for Natural Language Processin
Toward Transparent Sequence Models with Model-Based Tree Markov Model
In this study, we address the interpretability issue in complex, black-box
Machine Learning models applied to sequence data. We introduce the Model-Based
tree Hidden Semi-Markov Model (MOB-HSMM), an inherently interpretable model
aimed at detecting high mortality risk events and discovering hidden patterns
associated with the mortality risk in Intensive Care Units (ICU). This model
leverages knowledge distilled from Deep Neural Networks (DNN) to enhance
predictive performance while offering clear explanations. Our experimental
results indicate the improved performance of Model-Based trees (MOB trees) via
employing LSTM for learning sequential patterns, which are then transferred to
MOB trees. Integrating MOB trees with the Hidden Semi-Markov Model (HSMM) in
the MOB-HSMM enables uncovering potential and explainable sequences using
available information
- …