61,605 research outputs found

    Robust Correlation Clustering

    Get PDF
    In this paper, we introduce and study the Robust-Correlation-Clustering problem: given a graph G = (V,E) where every edge is either labeled + or - (denoting similar or dissimilar pairs of vertices), and a parameter m, the goal is to delete a set D of m vertices, and partition the remaining vertices V D into clusters to minimize the cost of the clustering, which is the sum of the number of + edges with end-points in different clusters and the number of - edges with end-points in the same cluster. This generalizes the classical Correlation-Clustering problem which is the special case when m = 0. Correlation clustering is useful when we have (only) qualitative information about the similarity or dissimilarity of pairs of points, and Robust-Correlation-Clustering equips this model with the capability to handle noise in datasets. In this work, we present a constant-factor bi-criteria algorithm for Robust-Correlation-Clustering on complete graphs (where our solution is O(1)-approximate w.r.t the cost while however discarding O(1) m points as outliers), and also complement this by showing that no finite approximation is possible if we do not violate the outlier budget. Our algorithm is very simple in that it first does a simple LP-based pre-processing to delete O(m) vertices, and subsequently runs a particular Correlation-Clustering algorithm ACNAlg [Ailon et al., 2005] on the residual instance. We then consider general graphs, and show (O(log n), O(log^2 n)) bi-criteria algorithms while also showing a hardness of alpha_MC on both the cost and the outlier violation, where alpha_MC is the lower bound for the Minimum-Multicut problem

    Estimation of instrinsic dimension via clustering

    Full text link
    The problem of estimating the intrinsic dimension of a set of points in high dimensional space is a critical issue for a wide range of disciplines, including genomics, finance, and networking. Current estimation techniques are dependent on either the ambient or intrinsic dimension in terms of computational complexity, which may cause these methods to become intractable for large data sets. In this paper, we present a clustering-based methodology that exploits the inherent self-similarity of data to efficiently estimate the intrinsic dimension of a set of points. When the data satisfies a specified general clustering condition, we prove that the estimated dimension approaches the true Hausdorff dimension. Experiments show that the clustering-based approach allows for more efficient and accurate intrinsic dimension estimation compared with all prior techniques, even when the data does not conform to obvious self-similarity structure. Finally, we present empirical results which show the clustering-based estimation allows for a natural partitioning of the data points that lie on separate manifolds of varying intrinsic dimension

    Consensus clustering and functional interpretation of gene-expression data

    Get PDF
    Microarray analysis using clustering algorithms can suffer from lack of inter-method consistency in assigning related gene-expression profiles to clusters. Obtaining a consensus set of clusters from a number of clustering methods should improve confidence in gene-expression analysis. Here we introduce consensus clustering, which provides such an advantage. When coupled with a statistically based gene functional analysis, our method allowed the identification of novel genes regulated by NFκB and the unfolded protein response in certain B-cell lymphomas

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
    • …
    corecore