160 research outputs found

    Large Scale Kernel Methods for Fun and Profit

    Get PDF
    Kernel methods are among the most flexible classes of machine learning models with strong theoretical guarantees. Wide classes of functions can be approximated arbitrarily well with kernels, while fast convergence and learning rates have been formally shown to hold. Exact kernel methods are known to scale poorly with increasing dataset size, and we believe that one of the factors limiting their usage in modern machine learning is the lack of scalable and easy to use algorithms and software. The main goal of this thesis is to study kernel methods from the point of view of efficient learning, with particular emphasis on large-scale data, but also on low-latency training, and user efficiency. We improve the state-of-the-art for scaling kernel solvers to datasets with billions of points using the Falkon algorithm, which combines random projections with fast optimization. Running it on GPUs, we show how to fully utilize available computing power for training kernel machines. To boost the ease-of-use of approximate kernel solvers, we propose an algorithm for automated hyperparameter tuning. By minimizing a penalized loss function, a model can be learned together with its hyperparameters, reducing the time needed for user-driven experimentation. In the setting of multi-class learning, we show that – under stringent but realistic assumptions on the separation between classes – a wide set of algorithms needs much fewer data points than in the more general setting (without assumptions on class separation) to reach the same accuracy. The first part of the thesis develops a framework for efficient and scalable kernel machines. This raises the question of whether our approaches can be used successfully in real-world applications, especially compared to alternatives based on deep learning which are often deemed hard to beat. The second part aims to investigate this question on two main applications, chosen because of the paramount importance of having an efficient algorithm. First, we consider the problem of instance segmentation of images taken from the iCub robot. Here Falkon is used as part of a larger pipeline, but the efficiency afforded by our solver is essential to ensure smooth human-robot interactions. In the second instance, we consider time-series forecasting of wind speed, analysing the relevance of different physical variables on the predictions themselves. We investigate different schemes to adapt i.i.d. learning to the time-series setting. Overall, this work aims to demonstrate, through novel algorithms and examples, that kernel methods are up to computationally demanding tasks, and that there are concrete applications in which their use is warranted and more efficient than that of other, more complex, and less theoretically grounded models

    Hash kernels and structured learning

    Get PDF
    Vast amounts of data being generated, how to process massive data remains a challenge for machine learning algorithms. We propose hash kernels to facilitate efficient kernels which can deal with massive multi-class problems. We show a principled way to compute the kernel matrix for data streams and sparse feature spaces. We further generalise it via sampling to graphs. Later we exploit the connection between hash kernels with compressed sensing, and apply hashing to face recognition which significantly speeds up over the state-of-the-art with competitive accuracy. And we give a recovery rate on the sparse representation and a bounded recognition rate. As hash kernels can deal with data with structures in the input such as graphs and face images, the second part of the thesis moves on to an even more challenging task - dealing with data with structures in the output. Recent advances in machine learning exploit the dependency among data output, hence dealing with complex, structured data becomes possible. We study the most popular structured learning algorithms and categorise them into two categories - probabilistic approaches and Max Margin approaches. We show the connections of different algorithms, reformulate them in the empirical risk minimisation framework, and compare their advantages and disadvantages, which help choose suitable algorithms according to the characteristics of the application. We have made practical and theoretical contributions in this thesis. We show some real-world applications using structured learning as follows: a) We propose a novel approach for automatic paragraph segmentation, namely training Semi-Markov models discriminatively using a Max-Margin method. This method allows us to model the sequential nature of the problem and to incorporate features of a whole paragraph, such as paragraph coherence which cannot be used in previous models. b) We jointly segment and recognise actions in video sequences with a discriminative semi-Markov model framework, which incorporates features that capture the characteristics on boundary frames, action segments and neighbouring action segments. A Viterbi-like algorithm is devised to help efficiently solve the induced optimisation problem. c) We propose a novel hybrid loss of Conditional Random Fields (CRFs) and Support Vector Machines (SVMs). We apply the hybrid loss to various applications such as Text chunking, Named Entity Recognition and Joint Image Categorisation. We have made the following theoretical contributions: a) We study the recent advance in PAC-Bayes bounds, and apply it to structured learning. b) We propose a more refined notion of Fisher consistency, namely Conditional Fisher Consistency for Classification (CFCC), that conditions on the knowledge of the true distribution of class labels. c) We show that the hybrid loss has the advantages of both CRFs and SVMs - it is consistent and has a tight PAC-Bayes bound which shrinks as the margin increases. d) We also introduce Probabilistic margins which take the label distribution into account. And we show that many existing algorithms can be viewed as special cases of the new margin concept which may help understand existing algorithms as well as design new algorithms. At last, we discuss some future directions such as tightening PAC-Bayes bounds, adaptive hybrid losses and graphical model inference via Compressed Sensing

    The Emerging Trends of Multi-Label Learning

    Full text link
    Exabytes of data are generated daily by humans, leading to the growing need for new efforts in dealing with the grand challenges for multi-label learning brought by big data. For example, extreme multi-label classification is an active and rapidly growing research area that deals with classification tasks with an extremely large number of classes or labels; utilizing massive data with limited supervision to build a multi-label classification model becomes valuable for practical applications, etc. Besides these, there are tremendous efforts on how to harvest the strong learning capability of deep learning to better capture the label dependencies in multi-label learning, which is the key for deep learning to address real-world classification tasks. However, it is noted that there has been a lack of systemic studies that focus explicitly on analyzing the emerging trends and new challenges of multi-label learning in the era of big data. It is imperative to call for a comprehensive survey to fulfill this mission and delineate future research directions and new applications.Comment: Accepted to TPAMI 202

    Advanced models of supervised structural clustering

    Get PDF
    The strength and power of structured prediction approaches in machine learning originates from a proper recognition and exploitation of inherent structural dependencies within complex objects, which structural models are trained to output. Among the complex tasks that benefited from structured prediction approaches, clustering is of a special interest. Structured output models based on representing clusters by latent graph structures made the task of supervised clustering tractable. While in practice these models proved effective in solving the complex NLP task of coreference resolution, in this thesis, we aim at exploring their capacity to be extended to other tasks and domains, as well as the methods for performing such adaptation and for improvement in general, which, as a result, go beyond clustering and are commonly applicable in structured prediction. Studying the extensibility of the structural approaches for supervised clustering, we apply them to two different domains in two different ways. First, in the networking domain, we do clustering of network traffic by adapting the model, taking into account the continuity of incoming data. Our experiments demonstrate that the structural clustering approach is not only effective in such a scenario, but also, if changing the perspective, provides a novel potentially useful tool for detecting anomalies. The other part of our work is dedicated to assessing the amenability of the structural clustering model to joint learning with another structural model, for ranking. Our preliminary analysis in the context of the task of answer-passage reranking in question answering reveals a potential benefit of incorporating auxiliary clustering structures. Due to the intrinsic complexity of the clustering task and, respectively, its evaluation scenarios, it gave us grounds for studying the possibility and the effect from optimizing task-specific complex measures in structured prediction algorithms. It is common for structured prediction approaches to optimize surrogate loss functions, rather than the actual task-specific ones, in or- der to facilitate inference and preserve efficiency. In this thesis, we, first, study when surrogate losses are sufficient and, second, make a step towards enabling direct optimization of complex structural loss functions. We propose to learn an approximation of a complex loss by a regressor from data. We formulate a general structural framework for learning with a learned loss, which, applied to a particular case of a clustering problem – coreference resolution, i) enables the optimization of a coreference metric, by itself, having high computational complexity, and ii) delivers an improvement over the standard structural models optimizing simple surrogate objectives. We foresee this idea being helpful in many structured prediction applications, also as a means of adaptation to specific evaluation scenarios, and especially when a good loss approximation is found by a regressor from an induced feature space allowing good factorization over the underlying structure

    Gradient boosting in automatic machine learning: feature selection and hyperparameter optimization

    Get PDF
    Das Ziel des automatischen maschinellen Lernens (AutoML) ist es, alle Aspekte der Modellwahl in prädiktiver Modellierung zu automatisieren. Diese Arbeit beschäftigt sich mit Gradienten Boosting im Kontext von AutoML mit einem Fokus auf Gradient Tree Boosting und komponentenweisem Boosting. Beide Techniken haben eine gemeinsame Methodik, aber ihre Zielsetzung ist unterschiedlich. Während Gradient Tree Boosting im maschinellen Lernen als leistungsfähiger Vorhersagealgorithmus weit verbreitet ist, wurde komponentenweises Boosting im Rahmen der Modellierung hochdimensionaler Daten entwickelt. Erweiterungen des komponentenweisen Boostings auf multidimensionale Vorhersagefunktionen werden in dieser Arbeit ebenfalls untersucht. Die Herausforderung der Hyperparameteroptimierung wird mit Fokus auf Bayesianische Optimierung und effiziente Stopping-Strategien diskutiert. Ein groß angelegter Benchmark über Hyperparameter verschiedener Lernalgorithmen, zeigt den kritischen Einfluss von Hyperparameter Konfigurationen auf die Qualität der Modelle. Diese Daten können als Grundlage für neue AutoML- und Meta-Lernansätze verwendet werden. Darüber hinaus werden fortgeschrittene Strategien zur Variablenselektion zusammengefasst und eine neue Methode auf Basis von permutierten Variablen vorgestellt. Schließlich wird ein AutoML-Ansatz vorgeschlagen, der auf den Ergebnissen und Best Practices für die Variablenselektion und Hyperparameteroptimierung basiert. Ziel ist es AutoML zu vereinfachen und zu stabilisieren sowie eine hohe Vorhersagegenauigkeit zu gewährleisten. Dieser Ansatz wird mit AutoML-Methoden, die wesentlich komplexere Suchräume und Ensembling Techniken besitzen, verglichen. Vier Softwarepakete für die statistische Programmiersprache R sind Teil dieser Arbeit, die neu entwickelt oder erweitert wurden: mlrMBO: Ein generisches Paket für die Bayesianische Optimierung; autoxgboost: Ein AutoML System, das sich vollständig auf Gradient Tree Boosting fokusiert; compboost: Ein modulares, in C++ geschriebenes Framework für komponentenweises Boosting; gamboostLSS: Ein Framework für komponentenweises Boosting additiver Modelle für Location, Scale und Shape.The goal of automatic machine learning (AutoML) is to automate all aspects of model selection in (supervised) predictive modeling. This thesis deals with gradient boosting techniques in the context of AutoML with a focus on gradient tree boosting and component-wise gradient boosting. Both techniques have a common methodology, but their goal is quite different. While gradient tree boosting is widely used in machine learning as a powerful prediction algorithm, component-wise gradient boosting strength is in feature selection and modeling of high-dimensional data. Extensions of component-wise gradient boosting to multidimensional prediction functions are considered as well. Focusing on Bayesian optimization and efficient early stopping strategies the challenge of hyperparameter optimization for these algorithms is discussed. Difficulty in the optimization of these algorithms is shown by a large scale random search on hyperparameters for machine learning algorithms, that can build the foundation of new AutoML and metalearning approaches. Furthermore, advanced feature selection strategies are summarized and a new method based on shadow features is introduced. Finally, an AutoML approach based on the results and best practices for feature selection and hyperparameter optimization is proposed, with the goal of simplifying and stabilizing AutoML while maintaining high prediction accuracy. This is compared to AutoML approaches using much more complex search spaces and ensembling techniques. Four software packages for the statistical programming language R have been newly developed or extended as a part of this thesis: mlrMBO: A general framework for Bayesian optimization; autoxgboost: An automatic machine learning framework that heavily utilizes gradient tree boosting; compboost: A modular framework for component-wise boosting written in C++; gamboostLSS: A framework for component-wise boosting for generalized additive models for location scale and shape
    • …
    corecore