Search CORE

1 research outputs found

GEOMETRIC APPROACHES TO DISCRIMINATIVE TRAINING

Author: Cao Yuan
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 15/12/2016
Field of study

Discriminative training as a general machine learning approach has wide applications in tasks like Natural Language Processing (NLP) and Automatic Speech Recognition (ASR). In this thesis, we are interested in online methods for discriminative training due to their simplicity, efficiency and scalabililty. The novel methods we propose are summarized as follows. First, an interesting subclass of online learning algorithms adopts multiplicative instead of additive strategies to update parameters of linear models, but none of them can be directly used for structured prediction as required by many NLP tasks. We extend the multiplicative Winnow algorithm to a structured version, and the additive MIRA algorithm to a multiplicative version, and apply the them to NLP tasks. We also give interpretations to the relationship between EG and prod, two multiplicative algorithms, from an information geometric perspective. Secondly, although general online learning frameworks, notably the Online Mirror Descent (OMD), exist and subsume many specific algorithms, they are not suitable for deriving multiplicative algorithms. We therefore propose a new general framework named Generalized Multiplicative Update (GMU) that is multiplicative in nature and easily derives many specific multiplicative algorithms. We then propose a subclass of GMU, named the q-Exponentiated Gradient (qEG) method, that elucidates the relationship among several of the algorithms. To better understand the difference between OMD and GMU, we give further analysis of these algorithms from a Riemannian geometric perspective. We also extend OMD and GMU to accelerated versions by adding momentum terms. Thirdly, although natural gradient descent (NGD) is often hard to be applied in practice due its computational difficulty, we propose a novel approach for CRF training which allows efficient application of NGD. The loss functions, defined by Bregman divergence, generalizes the log-likelihood objective and can be easily coupled with NGD for optimization. The proposed framework is flexible, allowing us to choose proper convex functions that leads to better training performance. Finally, traditional vector space linear models require estimating as many parameters as the number of model features. In the presence of millions of features, a common phenomenon in many NLP tasks, this may complicate the training procedure especially when labeled training data is scarce. We propose a novel online learning approach by shifting from vector space to tensor space, which dramatically reduces the number of parameters to be estimated. The resulting model is highly regularized and is particularly suitable for training in low-resource environments

JScholarship