163 research outputs found
DeepKey: Towards End-to-End Physical Key Replication From a Single Photograph
This paper describes DeepKey, an end-to-end deep neural architecture capable
of taking a digital RGB image of an 'everyday' scene containing a pin tumbler
key (e.g. lying on a table or carpet) and fully automatically inferring a
printable 3D key model. We report on the key detection performance and describe
how candidates can be transformed into physical prints. We show an example
opening a real-world lock. Our system is described in detail, providing a
breakdown of all components including key detection, pose normalisation,
bitting segmentation and 3D model inference. We provide an in-depth evaluation
and conclude by reflecting on limitations, applications, potential security
risks and societal impact. We contribute the DeepKey Datasets of 5, 300+ images
covering a few test keys with bounding boxes, pose and unaligned mask data.Comment: 14 pages, 12 figure
Recommended from our members
Optimisation Methods For Training Deep Neural Networks in Speech Recognition
Automatic Speech Recognition (ASR) is an example of a sequence to sequence level classification task where, given an acoustic waveform, the goal is to produce the correct word level hypotheses. In machine learning, a classification problem such as ASR is solved in two stages: an inference stage that models the uncertainty associated with the choice of hypothesis given the acoustic waveform using a mathematical model, and a decision stage which employs the inference model in conjunction with decision theory to make optimal class assignments. With the advent of careful network initialisation and GPU computing, hybrid Hidden Markov Models (HMMs) augmented with Deep Neural Networks (DNNs) have shown to outperform traditional HMMs using Gaussian Mixture Models (GMMs) in solving the inference problem for ASR. In comparison to GMMs, DNNs possess a better capability to model the underlying non-linear data manifold due to their deep and complex structure. While the structure of such models gives rich modelling capability, it also creates complex dependencies between the parameters which can make learning difficult via first order stochastic gradient descent (SGD). The task of finding the best procedure to train DNNs continues to be an active area of research and has been made even more challenging by the availability of ever more training data. This thesis focuses on designing better optimisation approaches to train hybrid HMM-DNN models using sequence level discriminative criterion which is a natural loss function that preserves the sequential ordering of frames within a spoken utterance. The thesis presents an implementation of the second order Hessian Free (HF) optimisation method, and shows how the method can made efficient through appropriate modifications to the Conjugate Gradient algorithm. To achieve better convergence than SGD, this work explores the Natural Gradient method to train DNNs with discriminative sequence training. In the DNN literature, the method has been applied to train models for the Maximum Likelihood objective criterion. A novel contribution of this thesis is to extend this approach to the domain of Minimum Bayes Risk objective functions for discriminative sequence training. With sigmoid models trained on a 50hr and 200hr training set from the Multi-Genre Broadcast 1 (MGB1) transcription task, the NG method applied in a HF styled optimisation framework is shown to achieve better Word Error Rate (WER) reductions on the MGB1 development set than SGD from sequence training.
This thesis also addresses the particular issue of overfitting between the training criterion and WER, that primarily arises during sequence training of DNN models that use Rectified Linear Units (ReLUs) as activation functions. It is shown how by scaling with the Gauss Newton matrix, the HF method unlike other approaches can overcome this issue. Seeing that different optimisers work best with different models, it is attractive to have a consistent optimisation framework that is agnostic to the choice of activation function. To address the issue, this thesis develops the geometry of the underlying function space captured by different realisations of DNN model parameters, and presents the design considerations for an optimisation algorithm to be well defined on this space. Building on this analysis, a novel optimisation technique called NGHF is presented that uses both the direction of steepest descent on a probabilistic manifold and local curvature information to effectively probe the error surface. The basis of the method relies on an alternative derivation of Taylor’s theorem using the concepts of manifolds, tangent vectors and directional derivatives from the perspective of Information Geometry. Apart from being well defined on the function space, when framed within a HF style optimisation framework, the method of NGHF is shown to achieve the greatest WER reductions from sequence training on the MGB1 development set with both sigmoid and ReLU based models trained on the 200hr MGB1 training set. The evaluation of the above optimisation methods in training different DNN model architectures is also presented.IDB Cambridge International Scholarshi
Generalization of Extended Baum-Welch Parameter Estimation for Discriminative Training and Decoding
We demonstrate the generalizability of the Extended Baum-Welch (EBW) algorithm not only for HMM parameter estimation but for decoding as well.\ud
We show that there can exist a general function associated with the objective function under EBW that reduces to the well-known auxiliary function used in the Baum-Welch algorithm for maximum likelihood estimates.\ud
We generalize representation for the updates of model parameters by making use of a differentiable function (such as arithmetic or geometric\ud
mean) on the updated and current model parameters and describe their effect on the learning rate during HMM parameter estimation. Improvements on speech recognition tasks are also presented here
Automatic speech recognition: from study to practice
Today, automatic speech recognition (ASR) is widely used for different purposes such as robotics, multimedia, medical and industrial application. Although many researches have been performed in this field in the past decades, there is still a lot of room to work. In order to start working in this area, complete knowledge of ASR systems as well as their weak points and problems is inevitable. Besides that, practical experience improves the theoretical knowledge understanding in a reliable way. Regarding to these facts, in this master thesis, we have first reviewed the principal structure of the standard HMM-based ASR systems from technical point of view. This includes, feature extraction, acoustic modeling, language modeling and decoding. Then, the most significant challenging points in ASR systems is discussed. These challenging points address different internal components characteristics or external agents which affect the ASR systems performance. Furthermore, we have implemented a Spanish language recognizer using HTK toolkit. Finally, two open research lines according to the studies of different sources in the field of ASR has been suggested for future work
Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition
End-to-end training of deep learning-based models allows for implicit
learning of intermediate representations based on the final task loss. However,
the end-to-end approach ignores the useful domain knowledge encoded in explicit
intermediate-level supervision. We hypothesize that using intermediate
representations as auxiliary supervision at lower levels of deep networks may
be a good way of combining the advantages of end-to-end training and more
traditional pipeline approaches. We present experiments on conversational
speech recognition where we use lower-level tasks, such as phoneme recognition,
in a multitask training approach with an encoder-decoder model for direct
character transcription. We compare multiple types of lower-level tasks and
analyze the effects of the auxiliary tasks. Our results on the Switchboard
corpus show that this approach improves recognition accuracy over a standard
encoder-decoder model on the Eval2000 test set
Large-margin Gaussian mixture modeling for automatic speech recognition
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 101-103).Discriminative training for acoustic models has been widely studied to improve the performance of automatic speech recognition systems. To enhance the generalization ability of discriminatively trained models, a large-margin training framework has recently been proposed. This work investigates large-margin training in detail, integrates the training with more flexible classifier structures such as hierarchical classifiers and committee-based classifiers, and compares the performance of the proposed modeling scheme with existing discriminative methods such as minimum classification error (MCE) training. Experiments are performed on a standard phonetic classification task and a large vocabulary speech recognition (LVCSR) task. In the phonetic classification experiments, the proposed modeling scheme yields about 1.5% absolute error reduction over the current state of the art. In the LVCSR experiments on the MIT lecture corpus, the large-margin model has about 6.0% absolute word error rate reduction over the baseline model and about 0.6% absolute error rate reduction over the MCE model.by Hung-An Chang.S.M
- …