    Training Support Vector Machines Using Frank-Wolfe Optimization Methods

    Training a Support Vector Machine (SVM) requires the solution of a quadratic programming problem (QP) whose computational complexity becomes prohibitively expensive for large scale datasets. Traditional optimization methods cannot be directly applied in these cases, mainly due to memory restrictions. By adopting a slightly different objective function and under mild conditions on the kernel used within the model, efficient algorithms to train SVMs have been devised under the name of Core Vector Machines (CVMs). This framework exploits the equivalence of the resulting learning problem with the task of building a Minimal Enclosing Ball (MEB) problem in a feature space, where data is implicitly embedded by a kernel function. In this paper, we improve on the CVM approach by proposing two novel methods to build SVMs based on the Frank-Wolfe algorithm, recently revisited as a fast method to approximate the solution of a MEB problem. In contrast to CVMs, our algorithms do not require to compute the solutions of a sequence of increasingly complex QPs and are defined by using only analytic optimization steps. Experiments on a large collection of datasets show that our methods scale better than CVMs in most cases, sometimes at the price of a slightly lower accuracy. As CVMs, the proposed methods can be easily extended to machine learning problems other than binary classification. However, effective classifiers are also obtained using kernels which do not satisfy the condition required by CVMs and can thus be used for a wider set of problems

    A Novel Frank-Wolfe Algorithm. Analysis and Applications to Large-Scale SVM Training

    Recently, there has been a renewed interest in the machine learning community for variants of a sparse greedy approximation procedure for concave optimization known as {the Frank-Wolfe (FW) method}. In particular, this procedure has been successfully applied to train large-scale instances of non-linear Support Vector Machines (SVMs). Specializing FW to SVM training has allowed to obtain efficient algorithms but also important theoretical results, including convergence analysis of training algorithms and new characterizations of model sparsity. In this paper, we present and analyze a novel variant of the FW method based on a new way to perform away steps, a classic strategy used to accelerate the convergence of the basic FW procedure. Our formulation and analysis is focused on a general concave maximization problem on the simplex. However, the specialization of our algorithm to quadratic forms is strongly related to some classic methods in computational geometry, namely the Gilbert and MDM algorithms. On the theoretical side, we demonstrate that the method matches the guarantees in terms of convergence rate and number of iterations obtained by using classic away steps. In particular, the method enjoys a linear rate of convergence, a result that has been recently proved for MDM on quadratic forms. On the practical side, we provide experiments on several classification datasets, and evaluate the results using statistical tests. Experiments show that our method is faster than the FW method with classic away steps, and works well even in the cases in which classic away steps slow down the algorithm. Furthermore, these improvements are obtained without sacrificing the predictive accuracy of the obtained SVM model.Comment: REVISED VERSION (October 2013) -- Title and abstract have been revised. Section 5 was added. Some proofs have been summarized (full-length proofs available in the previous version

    Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

    Due to its causal semantics, Bayesian networks (BN) have been widely employed to discover the underlying data relationship in exploratory studies, such as brain research. Despite its success in modeling the probability distribution of variables, BN is naturally a generative model, which is not necessarily discriminative. This may cause the ignorance of subtle but critical network changes that are of investigation values across populations. In this paper, we propose to improve the discriminative power of BN models for continuous variables from two different perspectives. This brings two general discriminative learning frameworks for Gaussian Bayesian networks (GBN). In the first framework, we employ Fisher kernel to bridge the generative models of GBN and the discriminative classifiers of SVMs, and convert the GBN parameter learning to Fisher kernel learning via minimizing a generalization error bound of SVMs. In the second framework, we employ the max-margin criterion and build it directly upon GBN models to explicitly optimize the classification performance of the GBNs. The advantages and disadvantages of the two frameworks are discussed and experimentally compared. Both of them demonstrate strong power in learning discriminative parameters of GBNs for neuroimaging based brain network analysis, as well as maintaining reasonable representation capacity. The contributions of this paper also include a new Directed Acyclic Graph (DAG) constraint with theoretical guarantee to ensure the graph validity of GBN.Comment: 16 pages and 5 figures for the article (excluding appendix

    Geometric Approach to Support Vector Machines Learning for Large Datasets

    The dissertation introduces Sphere Support Vector Machines (SphereSVM) and Minimal Norm Support Vector Machines (MNSVM) as the new fast classification algorithms that use geometrical properties of the underlying classification problems to efficiently obtain models describing training data. SphereSVM is based on combining minimal enclosing ball approach, state of the art nearest point problem solvers and probabilistic techniques. The blending of the three speeds up the training phase of SVMs significantly and reaches similar (i.e., practically the same) accuracy as the other classification models over several big and large real data sets within the strict validation frame of a double (nested) cross-validation (CV). MNSVM is further simplification of SphereSVM algorithm. Here, relatively complex classification task was converted into one of the simplest geometrical problems -- minimal norm problem. This resulted in additional speedup compared to SphereSVM. The results shown are promoting both SphereSVM and MNSVM as outstanding alternatives for handling large and ultra-large datasets in a reasonable time without switching to various parallelization schemes for SVMs algorithms proposed recently. The variants of both algorithms, which work without explicit bias term, are also presented. In addition, other techniques aiming to improve the time efficiency are discussed (such as over-relaxation and improved support vector selection scheme). Finally, the accuracy and performance of all these modifications are carefully analyzed and results based on nested cross-validation procedure are shown

    Definition and learning of logic-based kernels for categorical data, and application to collaborative filtering

    The continuous pursuit of better prediction quality has gradually led to the development of increasingly complex machine learning models, e.g., deep neural networks. Despite the great success in many domains, the black-box nature of these models makes them not suitable for applications in which the model understanding is at least as important as the prediction accuracy, such as medical applications. On the other hand, more interpretable models, as decision trees, are in general much less accurate. In this thesis, we try to merge the positive aspects of these two realities, by injecting interpretable elements inside complex methods. We focus on kernel methods which have an elegant framework that decouples learning algorithms from data representations. In particular, the first main contribution of this thesis is the proposal of a new family of Boolean kernels, i.e., kernels defined on binary data, with the aim of creating interpretable feature spaces. Assuming binary input vectors, the core idea is to build embedding spaces in which the dimensions represent logical formulas (of a specific form) of the input variables. As a result the solution of a kernel machine can be represented as a weighted sum of logical propositions, and this allows to extract from it human-readable rules. Our framework provides a constructive and efficient way to calculate Boolean kernels of different forms (e.g., disjunctive, conjunctive, DNF, CNF). We show that on binary classification tasks over categorical datasets the proposed kernels achieve state-of-the-art performances. We also provide some theoretical properties about the expressiveness of such kernels. The second main contribution consists in the development of a new multiple kernel learning algorithm to automatically learn the best representation (avoiding the validation). We start from a theoretical result which states that, under mild conditions, any dot-product kernel can be seen as a linear non-negative combination of Boolean conjunctive kernels. Then, from this combination, our MKL algorithm learns non-parametrically the best combination of the conjunctive kernels. This algorithm is designed to optimize the radius-margin ratio of the combined kernel, which has been demonstrated of being an upper bound of the Leave-One-Out error. An extensive empirical evaluation, on several binary classification tasks, shows how our MKL technique is able to outperform state-of-the-art MKL approaches. A third contribution is the proposal of another kernel family for binary input data, which aims to overcome the limitations of the Boolean kernels. In this case the focus is not exclusively on the interpretability, but also on the expressivity. With this new framework, that we dubbed propositional kernel framework, is possible to build kernel functions able to create feature spaces containing almost any kind of logical propositions. Finally, the last contribution is the application of the Boolean kernels to Recommender Systems, specifically, on top-N recommendation tasks. First of all, we propose a novel kernel-based collaborative filtering method and we apply on top of it our Boolean kernels. Empirical results on several collaborative filtering datasets show how less expressive kernels can alleviate the sparsity issue, which is peculiar in this kind of applications

    Design of Machine Learning Algorithms with Applications to Breast Cancer Detection

    Machine learning is concerned with the design and development of algorithms and techniques that allow computers to 'learn' from experience with respect to some class of tasks and performance measure. One application of machine learning is to improve the accuracy and efficiency of computer-aided diagnosis systems to assist physician, radiologists, cardiologists, neuroscientists, and health-care technologists. This thesis focuses on machine learning and the applications to breast cancer detection. Emphasis is laid on preprocessing of features, pattern classification, and model selection. Before the classification task, feature selection and feature transformation may be performed to reduce the dimensionality of the features and to improve the classification performance. Genetic algorithm (GA) can be employed for feature selection based on different measures of data separability or the estimated risk of a chosen classifier. A separate nonlinear transformation can be performed by applying kernel principal component analysis and kernel partial least squares. Different classifiers are proposed in this work: The SOM-RBF network combines self-organizing maps (SOMs) and radial basis function (RBF) networks, with the RBF centers set as the weight vectors of neurons from the competitive layer of a trained SaM. The pairwise Rayleigh quotient (PRQ) classifier seeks one discriminating boundary by maximizing an unconstrained optimization objective, named as the PRQ criterion, formed with a set of pairwise const~aints instead of individual training samples. The strict 2-surface proximal (S2SP) classifier seeks two proximal planes that are not necessary parallel to fit the distribution of the samples in the original feature space or a kernel-defined feature space, by ma-ximizing two strict optimization objectives with a 'square of sum' optimization factor. Two variations of the support vector data description (SVDD) with negative samples (NSVDD) are proposed by involving different forms of slack vectors, which learn a closed spherically shaped boundary, named as the supervised compact hypersphere (SCH), around a set of samples in the target class. \Ve extend the NSVDDs to solve the multi-class classification problems based on distances between the samples and the centers of the learned SCHs in a kernel-defined feature space, using a combination of linear discriminant analysis and the nearest-neighbor rule. The problem of model selection is studied to pick the best values of the hyperparameters for a parametric classifier. To choose the optimal kernel or regularization parameters of a classifier, we investigate different criteria, such as the validation error estimate and the leave-out-out bound, as well as different optimization methods, such as grid search, gradient descent, and GA. By viewing the tuning problem of the multiple parameters of an 2-norm support vector machine (SVM) as an identification problem of a nonlinear dynamic system, we design a tuning system by employing the extended Kalman filter based on cross validation. Independent kernel optimization based on different measures of data separability are a~so investigated for different kernel-based classifiers. Numerous computer experiments using the benchmark datasets verify the theoretical results, make comparisons among the techniques in measures of classification accuracy or area under the receiver operating characteristics curve. Computational requirements, such as the computing time and the number of hyper-parameters, are also discussed. All of the presented methods are applied to breast cancer detection from fine-needle aspiration and in mammograms, as well as screening of knee-joint vibroarthrographic signals and automatic monitoring of roller bearings with vibration signals. Experimental results demonstrate the excellence of these methods with improved classification performance. For breast cancer detection, instead of only providing a binary diagnostic decision of 'malignant' or 'benign', we propose methods to assign a measure of confidence of malignancy to an individual mass, by calculating probabilities of being benign and malignant with a single classifier or a set of classifiers

    Weakly supervised learning via statistical sufficiency

    The Thesis introduces a novel algorithmic framework for weakly supervised learn- ing, namely, for any any problem in between supervised and unsupervised learning, from the labels standpoint. Weak supervision is the reality in many applications of machine learning where training is performed with partially missing, aggregated- level and/or noisy labels. The approach is grounded on the concept of statistical suf- ficiency and its transposition to loss functions. Our solution is problem-agnostic yet constructive as it boils down to a simple two-steps procedure. First, estimate a suffi- cient statistic for the labels from weak supervision. Second, plug the estimate into a (newly defined) linear-odd loss function and learn the model by any gradient-based solver, with a simple adaptation. We apply the same approach to several challeng- ing learning problems: (i) learning from label proportions, (ii) learning with noisy labels for both linear classifiers and deep neural networks, and (iii) learning from feature-wise distributed datasets where the entity matching function is unknown

     Ocean Remote Sensing with Synthetic Aperture Radar

    The ocean covers approximately 71% of the Earth’s surface, 90% of the biosphere and contains 97% of Earth’s water. The Synthetic Aperture Radar (SAR) can image the ocean surface in all weather conditions and day or night. SAR remote sensing on ocean and coastal monitoring has become a research hotspot in geoscience and remote sensing. This book—Progress in SAR Oceanography—provides an update of the current state of the science on ocean remote sensing with SAR. Overall, the book presents a variety of marine applications, such as, oceanic surface and internal waves, wind, bathymetry, oil spill, coastline and intertidal zone classification, ship and other man-made objects’ detection, as well as remotely sensed data assimilation. The book is aimed at a wide audience, ranging from graduate students, university teachers and working scientists to policy makers and managers. Efforts have been made to highlight general principles as well as the state-of-the-art technologies in the field of SAR Oceanography

    Mineral identification using data-mining in hyperspectral infrared imagery

    Les applications de l’imagerie infrarouge dans le domaine de la géologie sont principalement des applications hyperspectrales. Elles permettent entre autre l’identification minérale, la cartographie, ainsi que l’estimation de la portée. Le plus souvent, ces acquisitions sont réalisées in-situ soit à l’aide de capteurs aéroportés, soit à l’aide de dispositifs portatifs. La découverte de minéraux indicateurs a permis d’améliorer grandement l’exploration minérale. Ceci est en partie dû à l’utilisation d’instruments portatifs. Dans ce contexte le développement de systèmes automatisés permettrait d’augmenter à la fois la qualité de l’exploration et la précision de la détection des indicateurs. C’est dans ce cadre que s’inscrit le travail mené dans ce doctorat. Le sujet consistait en l’utilisation de méthodes d’apprentissage automatique appliquées à l’analyse (au traitement) d’images hyperspectrales prises dans les longueurs d’onde infrarouge. L’objectif recherché étant l’identification de grains minéraux de petites tailles utilisés comme indicateurs minéral -ogiques. Une application potentielle de cette recherche serait le développement d’un outil logiciel d’assistance pour l’analyse des échantillons lors de l’exploration minérale. Les expériences ont été menées en laboratoire dans la gamme relative à l’infrarouge thermique (Long Wave InfraRed, LWIR) de 7.7m à 11.8 m. Ces essais ont permis de proposer une méthode pour calculer l’annulation du continuum. La méthode utilisée lors de ces essais utilise la factorisation matricielle non négative (NMF). En utlisant une factorisation du premier ordre on peut déduire le rayonnement de pénétration, lequel peut ensuite être comparé et analysé par rapport à d’autres méthodes plus communes. L’analyse des résultats spectraux en comparaison avec plusieurs bibliothèques existantes de données a permis de mettre en évidence la suppression du continuum. Les expérience ayant menés à ce résultat ont été conduites en utilisant une plaque Infragold ainsi qu’un objectif macro LWIR. L’identification automatique de grains de différents matériaux tels que la pyrope, l’olivine et le quartz a commencé. Lors d’une phase de comparaison entre des approches supervisées et non supervisées, cette dernière s’est montrée plus approprié en raison du comportement indépendant par rapport à l’étape d’entraînement. Afin de confirmer la qualité de ces résultats quatre expériences ont été menées. Lors d’une première expérience deux algorithmes ont été évalués pour application de regroupements en utilisant l’approche FCC (False Colour Composite). Cet essai a permis d’observer une vitesse de convergence, jusqu’a vingt fois plus rapide, ainsi qu’une efficacité significativement accrue concernant l’identification en comparaison des résultats de la littérature. Cependant des essais effectués sur des données LWIR ont montré un manque de prédiction de la surface du grain lorsque les grains étaient irréguliers avec présence d’agrégats minéraux. La seconde expérience a consisté, en une analyse quantitaive comparative entre deux bases de données de Ground Truth (GT), nommée rigid-GT et observed-GT (rigide-GT: étiquet manuel de la région, observée-GT:étiquetage manuel les pixels). La précision des résultats était 1.5 fois meilleur lorsque l’on a utlisé la base de données observed-GT que rigid-GT. Pour les deux dernières epxérience, des données venant d’un MEB (Microscope Électronique à Balayage) ainsi que d’un microscopie à fluorescence (XRF) ont été ajoutées. Ces données ont permis d’introduire des informations relatives tant aux agrégats minéraux qu’à la surface des grains. Les résultats ont été comparés par des techniques d’identification automatique des minéraux, utilisant ArcGIS. Cette dernière a montré une performance prometteuse quand à l’identification automatique et à aussi été utilisée pour la GT de validation. Dans l’ensemble, les quatre méthodes de cette thèse représentent des méthodologies bénéfiques pour l’identification des minéraux. Ces méthodes présentent l’avantage d’être non-destructives, relativement précises et d’avoir un faible coût en temps calcul ce qui pourrait les qualifier pour être utilisée dans des conditions de laboratoire ou sur le terrain.The geological applications of hyperspectral infrared imagery mainly consist in mineral identification, mapping, airborne or portable instruments, and core logging. Finding the mineral indicators offer considerable benefits in terms of mineralogy and mineral exploration which usually involves application of portable instrument and core logging. Moreover, faster and more mechanized systems development increases the precision of identifying mineral indicators and avoid any possible mis-classification. Therefore, the objective of this thesis was to create a tool to using hyperspectral infrared imagery and process the data through image analysis and machine learning methods to identify small size mineral grains used as mineral indicators. This system would be applied for different circumstances to provide an assistant for geological analysis and mineralogy exploration. The experiments were conducted in laboratory conditions in the long-wave infrared (7.7μm to 11.8μm - LWIR), with a LWIR-macro lens (to improve spatial resolution), an Infragold plate, and a heating source. The process began with a method to calculate the continuum removal. The approach is the application of Non-negative Matrix Factorization (NMF) to extract Rank-1 NMF and estimate the down-welling radiance and then compare it with other conventional methods. The results indicate successful suppression of the continuum from the spectra and enable the spectra to be compared with spectral libraries. Afterwards, to have an automated system, supervised and unsupervised approaches have been tested for identification of pyrope, olivine and quartz grains. The results indicated that the unsupervised approach was more suitable due to independent behavior against training stage. Once these results obtained, two algorithms were tested to create False Color Composites (FCC) applying a clustering approach. The results of this comparison indicate significant computational efficiency (more than 20 times faster) and promising performance for mineral identification. Finally, the reliability of the automated LWIR hyperspectral infrared mineral identification has been tested and the difficulty for identification of the irregular grain’s surface along with the mineral aggregates has been verified. The results were compared to two different Ground Truth(GT) (i.e. rigid-GT and observed-GT) for quantitative calculation. Observed-GT increased the accuracy up to 1.5 times than rigid-GT. The samples were also examined by Micro X-ray Fluorescence (XRF) and Scanning Electron Microscope (SEM) in order to retrieve information for the mineral aggregates and the grain’s surface (biotite, epidote, goethite, diopside, smithsonite, tourmaline, kyanite, scheelite, pyrope, olivine, and quartz). The results of XRF imagery compared with automatic mineral identification techniques, using ArcGIS, and represented a promising performance for automatic identification and have been used for GT validation. In overall, the four methods (i.e. 1.Continuum removal methods; 2. Classification or clustering methods for mineral identification; 3. Two algorithms for clustering of mineral spectra; 4. Reliability verification) in this thesis represent beneficial methodologies to identify minerals. These methods have the advantages to be a non-destructive, relatively accurate and have low computational complexity that might be used to identify and assess mineral grains in the laboratory conditions or in the field