29 research outputs found

    Unsupervised Discretization by Two-dimensional MDL-based Histogram

    Full text link
    Unsupervised discretization is a crucial step in many knowledge discovery tasks. The state-of-the-art method for one-dimensional data infers locally adaptive histograms using the minimum description length (MDL) principle, but the multi-dimensional case is far less studied: current methods consider the dimensions one at a time (if not independently), which result in discretizations based on rectangular cells of adaptive size. Unfortunately, this approach is unable to adequately characterize dependencies among dimensions and/or results in discretizations consisting of more cells (or bins) than is desirable. To address this problem, we propose an expressive model class that allows for far more flexible partitions of two-dimensional data. We extend the state of the art for the one-dimensional case to obtain a model selection problem based on the normalised maximum likelihood, a form of refined MDL. As the flexibility of our model class comes at the cost of a vast search space, we introduce a heuristic algorithm, named PALM, which partitions each dimension alternately and then merges neighbouring regions, all using the MDL principle. Experiments on synthetic data show that PALM 1) accurately reveals ground truth partitions that are within the model class (i.e., the search space), given a large enough sample size; 2) approximates well a wide range of partitions outside the model class; 3) converges, in contrast to its closest competitor IPD; and 4) is self-adaptive with regard to both sample size and local density structure of the data despite being parameter-free. Finally, we apply our algorithm to two geographic datasets to demonstrate its real-world potential.Comment: 30 pages, 9 figure

    Non-parametric Methods for Correlation Analysis in Multivariate Data with Applications in Data Mining

    Get PDF
    In this thesis, we develop novel methods for correlation analysis in multivariate data, with a special focus on mining correlated subspaces. Our methods handle major open challenges arisen when combining correlation analysis with subspace mining. Besides traditional correlation analysis, we explore interaction-preserving discretization of multivariate data and causality analysis. We conduct experiments on a variety of real-world data sets. The results validate the benefits of our methods

    Distributed multi-label learning on Apache Spark

    Get PDF
    This thesis proposes a series of multi-label learning algorithms for classification and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios. The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of individual information measures, and a method that selects the subset of features maximizing the geometric mean. The results indicate that each method excels in different scenarios depending on type of features and the number of labels. Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets confirm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art

    Image segmentation, evaluation, and applications

    Get PDF
    This thesis aims to advance research in image segmentation by developing robust techniques for evaluating image segmentation algorithms. The key contributions of this work are as follows. First, we investigate the characteristics of existing measures for supervised evaluation of automatic image segmentation algorithms. We show which of these measures is most effective at distinguishing perceptually accurate image segmentation from inaccurate segmentation. We then apply these measures to evaluating four state-of-the-art automatic image segmentation algorithms, and establish which best emulates human perceptual grouping. Second, we develop a complete framework for evaluating interactive segmentation algorithms by means of user experiments. Our system comprises evaluation measures, ground truth data, and implementation software. We validate our proposed measures by showing their correlation with perceived accuracy. We then use our framework to evaluate four popular interactive segmentation algorithms, and demonstrate their performance. Finally, acknowledging that user experiments are sometimes prohibitive in practice, we propose a method of evaluating interactive segmentation by algorithmically simulating the user interactions. We explore four strategies for this simulation, and demonstrate that the best of these produces results very similar to those from the user experiments

    Teak: A Novel Computational And Gui Software Pipeline For Reconstructing Biological Networks, Detecting Activated Biological Subnetworks, And Querying Biological Networks.

    Get PDF
    As high-throughput gene expression data becomes cheaper and cheaper, researchers are faced with a deluge of data from which biological insights need to be extracted and mined since the rate of data accumulation far exceeds the rate of data analysis. There is a need for computational frameworks to bridge the gap and assist researchers in their tasks. The Topology Enrichment Analysis frameworK (TEAK) is an open source GUI and software pipeline that seeks to be one of many tools that fills in this gap and consists of three major modules. The first module, the Gene Set Cultural Algorithm, de novo infers biological networks from gene sets using the KEGG pathways as prior knowledge. The second and third modules query against the KEGG pathways using molecular profiling data and query graphs, respectively. In particular, the second module, also called TEAK, is a network partitioning module that partitions the KEGG pathways into both linear and nonlinear subpathways. In conjunction with molecular profiling data, the subpathways are ranked and displayed to the user within the TEAK GUI. Using a public microarray yeast data set, previously unreported fitness defects for dpl1 delta and lag1 delta mutants under conditions of nitrogen limitation were found using TEAK. Finally, the third module, the Query Structure Enrichment Analysis framework, is a network query module that allows researchers to query their biological hypotheses in the form of Directed Acyclic Graphs against the KEGG pathways

    Latent Disentanglement for the Analysis and Generation of Digital Human Shapes

    Get PDF
    Analysing and generating digital human shapes is crucial for a wide variety of applications ranging from movie production to healthcare. The most common approaches for the analysis and generation of digital human shapes involve the creation of statistical shape models. At the heart of these techniques is the definition of a mapping between shapes and a low-dimensional representation. However, making these representations interpretable is still an open challenge. This thesis explores latent disentanglement as a powerful technique to make the latent space of geometric deep learning based statistical shape models more structured and interpretable. In particular, it introduces two novel techniques to disentangle the latent representation of variational autoencoders and generative adversarial networks with respect to the local shape attributes characterising the identity of the generated body and head meshes. This work was inspired by a shape completion framework that was proposed as a viable alternative to intraoperative registration in minimally invasive surgery of the liver. In addition, one of these methods for latent disentanglement was also applied to plastic surgery, where it was shown to improve the diagnosis of craniofacial syndromes and aid surgical planning

    Image Partitioning based on Semidefinite Programming

    Full text link
    Many tasks in computer vision lead to combinatorial optimization problems. Automatic image partitioning is one of the most important examples in this context: whether based on some prior knowledge or completely unsupervised, we wish to find coherent parts of the image. However, the inherent combinatorial complexity of such problems often prevents to find the global optimum in polynomial time. For this reason, various approaches have been proposed to find good approximative solutions for image partitioning problems. As an important example, we will first consider different spectral relaxation techniques: based on straightforward eigenvector calculations, these methods compute suboptimal solutions in short time. However, the main contribution of this thesis is to introduce a novel optimization technique for discrete image partitioning problems which is based on a semidefinite programming relaxation. In contrast to approximation methods employing annealing algorithms, this approach involves solving a convex optimization problem, which does not suffer from possible local minima. Using interior point techniques, the solution of the relaxation can be found in polynomial time, and without elaborate parameter tuning. High quality solutions to the original combinatorial problem are then obtained with a randomized rounding technique. The only potential drawback of the semidefinite relaxation approach is that the number of variables of the optimization problem is squared. Nevertheless, it can still be applied to problems with up to a few thousand variables, as is demonstrated for various computer vision tasks including unsupervised segmentation, perceptual grouping and image restoration. Concerning problems of higher dimensionality, we study two different approaches to effectively reduce the number of variables. The first one is based on probabilistic sampling: by considering only a small random fraction of the pixels in the image, our semidefinite relaxation method can be applied in an efficient way while maintaining a reliable quality of the resulting segmentations. The second approach reduces the problem size by computing an over-segmentation of the image in a preprocessing step. After that, the image is partitioned based on the resulting "superpixels" instead of the original pixels. Since the real world does not consist of pixels, it can even be argued that this is the more natural image representation. Initially, our semidefinite relaxation method is defined only for binary partitioning problems. To derive image segmentations into multiple parts, one possibility is to apply the binary approach in a hierarchical way. Besides this natural extension, we also discuss how multiclass partitioning problems can be solved in a direct way based on semidefinite relaxation techniques

    Shape Representations Using Nested Descriptors

    Get PDF
    The problem of shape representation is a core problem in computer vision. It can be argued that shape representation is the most central representational problem for computer vision, since unlike texture or color, shape alone can be used for perceptual tasks such as image matching, object detection and object categorization. This dissertation introduces a new shape representation called the nested descriptor. A nested descriptor represents shape both globally and locally by pooling salient scaled and oriented complex gradients in a large nested support set. We show that this nesting property introduces a nested correlation structure that enables a new local distance function called the nesting distance, which provides a provably robust similarity function for image matching. Furthermore, the nesting property suggests an elegant flower like normalization strategy called a log-spiral difference. We show that this normalization enables a compact binary representation and is equivalent to a form a bottom up saliency. This suggests that the nested descriptor representational power is due to representing salient edges, which makes a fundamental connection between the saliency and local feature descriptor literature. In this dissertation, we introduce three examples of shape representation using nested descriptors: nested shape descriptors for imagery, nested motion descriptors for video and nested pooling for activities. We show evaluation results for these representations that demonstrate state-of-the-art performance for image matching, wide baseline stereo and activity recognition tasks

    Aprendizaje multi-etiqueta distribuido en Apache Spark

    Get PDF
    This thesis proposes a series of multi-label learning algorithms for classication and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up the multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios. The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of the individual information measures, and a method selects the subset of features that maximize the geometrical mean. The results indicate that each method excels in di_erent scenarios depending on type of features and the number of labels. Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets con_rm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art.Esta Tesis Doctoral propone unos algoritmos de clasificación y selección de atributos para aprendizaje multi-etiqueta distribuidos implementados en Apache Spark. Cinco estrategias para determinar la arquitectura óptima para acelerar el aprendizaje multi-etiqueta son presentadas. Estas estrategias varían desde la paralelización local utilizando hilos hasta la distribución de la computación utilizando espacios de memoria compartidos o independientes. Ha sido demostrado que la estrategia óptima permite ejecutar cientos de veces más rápido que el método de referencia. Se proponen tres métodos distribuidos de \k nearest neighbors" multi-etiqueta sobre la arquitectura de Spark seleccionada: un método exacto que computa iterativamente las distancias, un método aproximado que usa un árbol para indexar las instancias, y un método aproximado que utiliza tablas hash para indexar las instancias. Los resultados indican que las predicciones del método basado en árboles son equivalente a aquellas producidas por un método exacto a la vez que reduce los tiempos de ejecución en todos los escenarios. Dicho método es utilizado para evaluar la calidad de un subconjunto de atributos. Se discute el criterio para seleccionar atributos en problemas multi-etiqueta, proponiendo: un método que selecciona el subconjunto de atributos cuyas medidas de información individuales poseen la mayor norma Euclídea, y un método que selecciona el subconjunto de atributos con la mayor media geométrica. Los resultados indican que cada método destaca en escenarios diferentes dependiendo del tipo de atributos y el número de etiquetas. Los estudios experimentales y análisis estadísticos utilizando múltiples métricas y datos multi-etiqueta confirman que nuestras propuestas alcanzan un mejor rendimiento y proporcionan una mejor escalabilidad para datos de gran tamaño respecto a los métodos de referencia

    27th Annual European Symposium on Algorithms: ESA 2019, September 9-11, 2019, Munich/Garching, Germany

    Get PDF
    corecore