88 research outputs found

    Unsupervised Structural Embedding Methods for Efficient Collective Network Mining

    Full text link
    How can we align accounts of the same user across social networks? Can we identify the professional role of an email user from their patterns of communication? Can we predict the medical effects of chemical compounds from their atomic network structure? Many problems in graph data mining, including all of the above, are defined on multiple networks. The central element to all of these problems is cross-network comparison, whether at the level of individual nodes or entities in the network or at the level of entire networks themselves. To perform this comparison meaningfully, we must describe the entities in each network expressively in terms of patterns that generalize across the networks. Moreover, because the networks in question are often very large, our techniques must be computationally efficient. In this thesis, we propose scalable unsupervised methods that embed nodes in vector space by mapping nodes with similar structural roles in their respective networks, even if they come from different networks, to similar parts of the embedding space. We perform network alignment by matching nodes across two or more networks based on the similarity of their embeddings, and refine this process by reinforcing the consistency of each node’s alignment with those of its neighbors. By characterizing the distribution of node embeddings in a graph, we develop graph-level feature vectors that are highly effective for graph classification. With principled sparsification and randomized approximation techniques, we make all our methods computationally efficient and able to scale to graphs with millions of nodes or edges. We demonstrate the effectiveness of structural node embeddings on industry-scale applications, and propose an extensive set of embedding evaluation techniques that lay the groundwork for further methodological development and application.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162895/1/mheimann_1.pd

    Large Scale Kernel Methods for Fun and Profit

    Get PDF
    Kernel methods are among the most flexible classes of machine learning models with strong theoretical guarantees. Wide classes of functions can be approximated arbitrarily well with kernels, while fast convergence and learning rates have been formally shown to hold. Exact kernel methods are known to scale poorly with increasing dataset size, and we believe that one of the factors limiting their usage in modern machine learning is the lack of scalable and easy to use algorithms and software. The main goal of this thesis is to study kernel methods from the point of view of efficient learning, with particular emphasis on large-scale data, but also on low-latency training, and user efficiency. We improve the state-of-the-art for scaling kernel solvers to datasets with billions of points using the Falkon algorithm, which combines random projections with fast optimization. Running it on GPUs, we show how to fully utilize available computing power for training kernel machines. To boost the ease-of-use of approximate kernel solvers, we propose an algorithm for automated hyperparameter tuning. By minimizing a penalized loss function, a model can be learned together with its hyperparameters, reducing the time needed for user-driven experimentation. In the setting of multi-class learning, we show that – under stringent but realistic assumptions on the separation between classes – a wide set of algorithms needs much fewer data points than in the more general setting (without assumptions on class separation) to reach the same accuracy. The first part of the thesis develops a framework for efficient and scalable kernel machines. This raises the question of whether our approaches can be used successfully in real-world applications, especially compared to alternatives based on deep learning which are often deemed hard to beat. The second part aims to investigate this question on two main applications, chosen because of the paramount importance of having an efficient algorithm. First, we consider the problem of instance segmentation of images taken from the iCub robot. Here Falkon is used as part of a larger pipeline, but the efficiency afforded by our solver is essential to ensure smooth human-robot interactions. In the second instance, we consider time-series forecasting of wind speed, analysing the relevance of different physical variables on the predictions themselves. We investigate different schemes to adapt i.i.d. learning to the time-series setting. Overall, this work aims to demonstrate, through novel algorithms and examples, that kernel methods are up to computationally demanding tasks, and that there are concrete applications in which their use is warranted and more efficient than that of other, more complex, and less theoretically grounded models

    PSSA: PCA-domain superpixelwise singular spectral analysis for unsupervised hyperspectral image classification.

    Get PDF
    Although supervised classification of hyperspectral images (HSI) has achieved success in remote sensing, its applications in real scenarios are often constrained, mainly due to the insufficiently available or lack of labelled data. As a result, unsupervised HSI classification based on data clustering is highly desired, yet it generally suffers from high computational cost and low classification accuracy, especially in large datasets. To tackle these challenges, a novel unsupervised spatial-spectral HSI classification method is proposed. By combining the entropy rate superpixel segmentation (ERS), superpixel-based principal component analysis (PCA), and PCA-domain 2D singular spectral analysis (SSA), both the efficacy and efficiency of feature extraction are improved, followed by the anchor-based graph clustering (AGC) for effective classification. Experiments on three publicly available and five self-collected aerial HSI datasets have fully demonstrated the efficacy of the proposed PCA-domain superpixelwise SSA (PSSA) method, with a gain of 15–20% in terms of the overall accuracy, in comparison to a few state-of-the-art methods. In addition, as an extra outcome, the HSI dataset we acquired is provided freely online

    Efficient second-order online kernel learning with adaptive embedding

    Get PDF
    International audienceOnline kernel learning (OKL) is a flexible framework to approach prediction problems, since the large approximation space provided by reproducing kernel Hilbert spaces can contain an accurate function for the problem. Nonetheless, optimizing over this space is computationally expensive. Not only first order methods accumulate O( sqrt T ) more loss than the optimal function, but the curse of kernelization results in a O(t) per step complexity. Second-order methods get closer to the optimum much faster, suffering only O( log(T)) regret, but second-order updates are even more expensive, with a O(t 2) per-step cost. Existing approximate OKL methods try to reduce this complexity either by limiting the Support Vectors (SV) introduced in the predictor, or by avoiding the kernelization process altogether using embedding. Nonetheless, as long as the size of the approximation space or the number of SV does not grow over time, an adversary can always exploit the approximation process. In this paper, we propose PROS-N-KONS, a method that combines Nystrom sketching to project the input point in a small, accurate embedded space, and performs efficient second-order updates in this space. The embedded space is continuously updated to guarantee that the embedding remains accurate, and we show that the per-step cost only grows with the effective dimension of the problem and not with T . Moreover, the second-order updated allows us to achieve the logarithmic regret. We empirically compare our algorithm on recent large-scales benchmarks and show it performs favorably

    A machine learning approach to the unsupervised segmentation of mitochondria in subcellular electron microscopy data

    Get PDF
    Recent advances in cellular and subcellular microscopy demonstrated its potential towards unravelling the mechanisms of various diseases at the molecular level. The biggest challenge in both human- and computer-based visual analysis of micrographs is the variety of nanostructures and mitochondrial morphologies. The state-of-the-art is, however, dominated by supervised manual data annotation and early attempts to automate the segmentation process were based on supervised machine learning techniques which require large datasets for training. Given a minimal number of training sequences or none at all, unsupervised machine learning formulations, such as spectral dimensionality reduction, are known to be superior in detecting salient image structures. This thesis presents three major contributions developed around the spectral clustering framework which is proven to capture perceptual organization features. Firstly, we approach the problem of mitochondria localization. We propose a novel grouping method for the extracted line segments which describes the normal mitochondrial morphology. Experimental findings show that the clusters obtained successfully model the inner mitochondrial membrane folding and therefore can be used as markers for the subsequent segmentation approaches. Secondly, we developed an unsupervised mitochondria segmentation framework. This method follows the evolutional ability of human vision to extrapolate salient membrane structures in a micrograph. Furthermore, we designed robust non-parametric similarity models according to Gestaltic laws of visual segregation. Experiments demonstrate that such models automatically adapt to the statistical structure of the biological domain and return optimal performance in pixel classification tasks under the wide variety of distributional assumptions. The last major contribution addresses the computational complexity of spectral clustering. Here, we introduced a new anticorrelation-based spectral clustering formulation with the objective to improve both: speed and quality of segmentation. The experimental findings showed the applicability of our dimensionality reduction algorithm to very large scale problems as well as asymmetric, dense and non-Euclidean datasets

    Diffusion, methods and applications

    Full text link
    Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingeniería Informática. Fecha de lectura: junio de 2014Big Data, an important problem nowadays, can be understood in terms of a very large number of patterns, a very large pattern dimension or, often, both. In this thesis, we will concentrate on the high dimensionality issue, applying manifold learning techniques for visualizing and analyzing such patterns. The core technique will be Di usion Maps (DM) and its Anisotropic Di usion (AD) version, introduced by Ronald R. Coifman and his school at Yale University, and of which we will give a complete, systematic, compact and self-contained treatment. This will be done after a brief survey of previous manifold learning methods. The algorithmic contributions of the thesis will be centered in two computational challenges of di usion methods: the potential high cost of the similarity matrix eigenanalysis that is needed to define the di usion embedding coordinates, and the di culty of computing this embedding over new patterns not available for the initial eigenanalysis. With respect to the first issue, we will show how the AD set up can be used to skip it when looking for local models. In this case, local patterns will be selected through a k-Nearest Neighbors search using a properly defined local Mahalanobis distance, that enables neighbors to be found over the latent variable space underlying the AD model while we can work directly with the observable patterns and, thus, avoiding the potentially costly similarity matrix eigenanalysis. The second proposed algorithm, that we will call Auto-adaptative Laplacian Pyramids (ALP), focuses in the out-of-sample embedding extension and consists in a modification of the classical Laplacian Pyramids (LP) method. In this new algorithm the LP iterations will be combined with an estimate of the Leave One Out CV error, something that makes possible to directly define during training a criterion to estimate the optimal stopping point of this iterative algorithm. This thesis will also present several application contributions to important problems in renewable energy and medical imaging. More precisely, we will show how DM is a good method for dimensionality reduction of meteorological weather predictions, providing tools to visualize and describe these data, as well as to cluster them in order to define local models. In turn, we will apply our AD-based localized search method first to find the location in the human body of CT scan images and then to predict wind energy ramps on both individual farms and over the whole of Spain. We will see that, in both cases, our results improve on the current state of the art methods. Finally, we will compare our ALP proposal with the well-known Nyström method as well as with LP on two large dimensional problems, the time compression of meteorological data and the analysis of meteorological variables relevant in daily radiation forecasts. In both cases we will show that ALP compares favorably with the other approaches for out-of-sample extension problemsBig Data es un problema importante hoy en día, que puede ser entendido en términos de un amplio número de patrones, una alta dimensión o, como sucede normalmente, de ambos. Esta tesis se va a centrar en problemas de alta dimensión, aplicando técnicas de aprendizaje de subvariedades para visualizar y analizar dichos patrones. La técnica central será Di usion Maps (DM) y su versión anisotrópica, Anisotropic Di usion (AD), introducida por Ronald R. Coifman y su escuela en la Universidad de Yale, la cual va a ser tratada de manera completa, sistemática, compacta y auto-contenida. Esto se llevará a cabo tras un breve repaso de métodos previos de aprendizaje de subvariedades. Las contribuciones algorítmicas de esta tesis estarán centradas en dos de los grandes retos en métodos de difusión: el potencial alto coste que tiene el análisis de autovalores de la matriz de similitud, necesaria para definir las coordenadas embebidas; y la dificultad para calcular este mismo embedding sobre nuevos datos que no eran accesibles cuando se realizó el análisis de autovalores inicial. Respecto al primer tema, se mostrará cómo la aproximación AD se puede utilizar para evitar el cálculo del embedding cuando estamos interesados en definir modelos locales. En este caso, se seleccionarán patrones cercanos por medio de una búsqueda de vecinos próximos (k-NN), usando como distancia una medida de Mahalanobis local que permita encontrar vecinos sobre las variables latentes existentes bajo el modelo de AD. Todo esto se llevará a cabo trabajando directamente sobre los patrones observables y, por tanto, evitando el costoso cálculo que supone el cálculo de autovalores de la matriz de similitud. El segundo algoritmo propuesto, que llamaremos Auto-adaptative Laplacian Pyramids (ALP), se centra en la extensión del embedding para datos fuera de la muestra, y se trata de una modificación del método denominado Laplacian Pyramids (LP). En este nuevo algoritmo, las iteraciones de LP se combinarán con una estimación del error de Leave One Out CV, permitiendo definir directamente durante el periodo de entrenamiento, un criterio para estimar el criterio de parada óptimo para este método iterativo. En esta tesis se presentarán también una serie de contribuciones de aplicación de estas técnicas a importantes problemas en energías renovables e imágenes médicas. Más concretamente, se muestra como DM es un buen método para reducir la dimensión de predicciones del tiempo meteorológico, sirviendo por tanto de herramienta de visualización y descripción, así como de clasificación de los datos con vistas a definir modelos locales sobre cada grupo descrito. Posteriormente, se aplicará nuestro método de búsqueda localizada basado en AD tanto a la búsqueda de la correspondiente posición de tomografías en el cuerpo humano, como para la detección de rampas de energía eólica en parques individuales o de manera global en España. En ambos casos se verá como los resultados obtenidos mejoran los métodos del estado del arte actual. Finalmente se comparará el algoritmo de ALP propuesto frente al conocido método de Nyström y al método de LP, en dos problemas de alta dimensión: el problema de compresión temporal de datos meteorológicos y el análisis de variables meteorológicas relevantes para la predicción de la radiación diaria. En ambos casos se mostrará cómo ALP es comparativamente mejor que otras aproximaciones existentes para resolver el problema de extensión del embedding a puntos fuera de la muestr
    corecore