1,366 research outputs found

    Selective inference after convex clustering with 1\ell_1 penalization

    Full text link
    Classical inference methods notoriously fail when applied to data-driven test hypotheses or inference targets. Instead, dedicated methodologies are required to obtain statistical guarantees for these selective inference problems. Selective inference is particularly relevant post-clustering, typically when testing a difference in mean between two clusters. In this paper, we address convex clustering with 1\ell_1 penalization, by leveraging related selective inference tools for regression, based on Gaussian vectors conditioned to polyhedral sets. In the one-dimensional case, we prove a polyhedral characterization of obtaining given clusters, than enables us to suggest a test procedure with statistical guarantees. This characterization also allows us to provide a computationally efficient regularization path algorithm. Then, we extend the above test procedure and guarantees to multi-dimensional clustering with 1\ell_1 penalization, and also to more general multi-dimensional clusterings that aggregate one-dimensional ones. With various numerical experiments, we validate our statistical guarantees and we demonstrate the power of our methods to detect differences in mean between clusters. Our methods are implemented in the R package poclin.Comment: 40 pages, 8 figure

    Data Representation for Learning and Information Fusion in Bioinformatics

    Get PDF
    This thesis deals with the rigorous application of nonlinear dimension reduction and data organization techniques to biomedical data analysis. The Laplacian Eigenmaps algorithm is representative of these methods and has been widely applied in manifold learning and related areas. While their asymptotic manifold recovery behavior has been well-characterized, the clustering properties of Laplacian embeddings with finite data are largely motivated by heuristic arguments. We develop a precise bound, characterizing cluster structure preservation under Laplacian embeddings. From this foundation, we introduce flexible and mathematically well-founded approaches for information fusion and feature representation. These methods are applied to three substantial case studies in bioinformatics, illustrating their capacity to extract scientifically valuable information from complex data

    A New Measure for Analyzing and Fusing Sequences of Objects

    Get PDF
    This work is related to the combinatorial data analysis problem of seriation used for data visualization and exploratory analysis. Seriation re-sequences the data, so that more similar samples or objects appear closer together, whereas dissimilar ones are further apart. Despite the large number of current algorithms to realize such re-sequencing, there has not been a systematic way for analyzing the resulting sequences, comparing them, or fusing them to obtain a single unifying one. We propose a new positional proximity measure that evaluates the similarity of two arbitrary sequences based on their agreement on pairwise positional information of the sequenced objects. Furthermore, we present various statistical properties of this measure as well as its normalized version modeled as an instance of the generalized correlation coefficient. Based on this measure, we define a new procedure for consensus seriation that fuses multiple arbitrary sequences based on a quadratic assignment problem formulation and an efficient way of approximating its solution. We also derive theoretical links with other permutation distance functions and present their associated combinatorial optimization forms for consensus tasks. The utility of the proposed contributions is demonstrated through the comparison and fusion of multiple seriation algorithms we have implemented, using many real-world datasets from different application domains

    Entanglement thresholds for random induced states

    Full text link
    For a random quantum state on H=CdCdH=C^d \otimes C^d obtained by partial tracing a random pure state on HCsH \otimes C^s, we consider the whether it is typically separable or typically entangled. For this problem, we show the existence of a sharp threshold s0=s0(d)s_0=s_0(d) of order roughly d3d^3. More precisely, for any a>0a > 0 and for d large enough, such a random state is entangled with very large probability when s<(1a)s0s < (1-a)s_0, and separable with very large probability when s>(1+a)s0s > (1+a)s_0. One consequence of this result is as follows: for a system of N identical particles in a random pure state, there is a threshold k0=k0(N)N/5k_0 = k_0(N) \sim N/5 such that two subsystems of k particles each typically share entanglement if k>k0k > k_0, and typically do not share entanglement if k<k0k < k_0. Our methods work also for multipartite systems and for "unbalanced" systems such as CdCdC^{d} \otimes C^{d'}, ddd \neq d'. The arguments rely on random matrices, classical convexity, high-dimensional probability and geometry of Banach spaces; some of the auxiliary results may be of reference value. A high-level non-technical overview of the results of this paper and of a related article arXiv:1011.0275 can be found in arXiv:1112.4582.Comment: 34 pages; v.3: reorganized proof, new results only in section 7.1, references added; v.2: main result strengthened (much stronger threshold property) allowing the sharp "N-particle" interpretation of the results stated in the abstract, new appendix on majorization and \infty-Wasserstein distance, references adde

    K-means based clustering and context quantization

    Get PDF

    Networked Data Analytics: Network Comparison And Applied Graph Signal Processing

    Get PDF
    Networked data structures has been getting big, ubiquitous, and pervasive. As our day-to-day activities become more incorporated with and influenced by the digital world, we rely more on our intuition to provide us a high-level idea and subconscious understanding of the encountered data. This thesis aims at translating the qualitative intuitions we have about networked data into quantitative and formal tools by designing rigorous yet reasonable algorithms. In a nutshell, this thesis constructs models to compare and cluster networked data, to simplify a complicated networked structure, and to formalize the notion of smoothness and variation for domain-specific signals on a network. This thesis consists of two interrelated thrusts which explore both the scenarios where networks have intrinsic value and are themselves the object of study, and where the interest is for signals defined on top of the networks, so we leverage the information in the network to analyze the signals. Our results suggest that the intuition we have in analyzing huge data can be transformed into rigorous algorithms, and often the intuition results in superior performance, new observations, better complexity, and/or bridging two commonly implemented methods. Even though different in the principles they investigate, both thrusts are constructed on what we think as a contemporary alternation in data analytics: from building an algorithm then understanding it to having an intuition then building an algorithm around it. We show that in order to formalize the intuitive idea to measure the difference between a pair of networks of arbitrary sizes, we could design two algorithms based on the intuition to find mappings between the node sets or to map one network into the subset of another network. Such methods also lead to a clustering algorithm to categorize networked data structures. Besides, we could define the notion of frequencies of a given network by ordering features in the network according to how important they are to the overall information conveyed by the network. These proposed algorithms succeed in comparing collaboration histories of researchers, clustering research communities via their publication patterns, categorizing moving objects from uncertain measurmenets, and separating networks constructed from different processes. In the context of data analytics on top of networks, we design domain-specific tools by leveraging the recent advances in graph signal processing, which formalizes the intuitive notion of smoothness and variation of signals defined on top of networked structures, and generalizes conventional Fourier analysis to the graph domain. In specific, we show how these tools can be used to better classify the cancer subtypes by considering genetic profiles as signals on top of gene-to-gene interaction networks, to gain new insights to explain the difference between human beings in learning new tasks and switching attentions by considering brain activities as signals on top of brain connectivity networks, as well as to demonstrate how common methods in rating prediction are special graph filters and to base on this observation to design novel recommendation system algorithms

    Efficient Data Driven Multi Source Fusion

    Get PDF
    Data/information fusion is an integral component of many existing and emerging applications; e.g., remote sensing, smart cars, Internet of Things (IoT), and Big Data, to name a few. While fusion aims to achieve better results than what any one individual input can provide, often the challenge is to determine the underlying mathematics for aggregation suitable for an application. In this dissertation, I focus on the following three aspects of aggregation: (i) efficient data-driven learning and optimization, (ii) extensions and new aggregation methods, and (iii) feature and decision level fusion for machine learning with applications to signal and image processing. The Choquet integral (ChI), a powerful nonlinear aggregation operator, is a parametric way (with respect to the fuzzy measure (FM)) to generate a wealth of aggregation operators. The FM has 2N variables and N(2N − 1) constraints for N inputs. As a result, learning the ChI parameters from data quickly becomes impractical for most applications. Herein, I propose a scalable learning procedure (which is linear with respect to training sample size) for the ChI that identifies and optimizes only data-supported variables. As such, the computational complexity of the learning algorithm is proportional to the complexity of the solver used. This method also includes an imputation framework to obtain scalar values for data-unsupported (aka missing) variables and a compression algorithm (lossy or losselss) of the learned variables. I also propose a genetic algorithm (GA) to optimize the ChI for non-convex, multi-modal, and/or analytical objective functions. This algorithm introduces two operators that automatically preserve the constraints; therefore there is no need to explicitly enforce the constraints as is required by traditional GA algorithms. In addition, this algorithm provides an efficient representation of the search space with the minimal set of vertices. Furthermore, I study different strategies for extending the fuzzy integral for missing data and I propose a GOAL programming framework to aggregate inputs from heterogeneous sources for the ChI learning. Last, my work in remote sensing involves visual clustering based band group selection and Lp-norm multiple kernel learning based feature level fusion in hyperspectral image processing to enhance pixel level classification

    Weakly monotonic averaging with application to image processing

    Full text link
    corecore