36,524 research outputs found

    Clustering by compression

    Full text link
    We present a new method for clustering based on compression. The method doesn't use subject-specific features or background knowledge, and works as follows: First, we determine a universal similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is universal in that it is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal but uses the non-computable notion of Kolmogorov complexity. We propose precise notions of similarity metric, normal compressor, and show that the NCD based on a normal compressor is a similarity metric that approximates universality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (binary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.Comment: LaTeX, 27 pages, 20 figure

    Fronthaul-Constrained Cloud Radio Access Networks: Insights and Challenges

    Full text link
    As a promising paradigm for fifth generation (5G) wireless communication systems, cloud radio access networks (C-RANs) have been shown to reduce both capital and operating expenditures, as well as to provide high spectral efficiency (SE) and energy efficiency (EE). The fronthaul in such networks, defined as the transmission link between a baseband unit (BBU) and a remote radio head (RRH), requires high capacity, but is often constrained. This article comprehensively surveys recent advances in fronthaul-constrained C-RANs, including system architectures and key techniques. In particular, key techniques for alleviating the impact of constrained fronthaul on SE/EE and quality of service for users, including compression and quantization, large-scale coordinated processing and clustering, and resource allocation optimization, are discussed. Open issues in terms of software-defined networking, network function virtualization, and partial centralization are also identified.Comment: 5 Figures, accepted by IEEE Wireless Communications. arXiv admin note: text overlap with arXiv:1407.3855 by other author

    3D oil reservoir visualisation using octree compression techniques utilising logical grid co-ordinates

    Get PDF
    Octree compression techniques have been used for several years for compressing large three dimensional data sets into homogeneous regions. This compression technique is ideally suited to datasets which have similar values in clusters. Oil engineers represent reservoirs as a three dimensional grid where hydrocarbons occur naturally in clusters. This research looks at the efficiency of storing these grids using octree compression techniques where grid cells are broken into active and inactive regions. Initial experiments yielded high compression ratios as only active leaf nodes and their ancestor, header nodes are stored as a bitstream to file on disk. Savings in computational time and memory were possible at decompression, as only active leaf nodes are sent to the graphics card eliminating the need of reconstructing the original matrix. This results in a more compact vertex table, which can be loaded into the graphics card quicker and generating shorter refresh delay times

    Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means

    Get PDF
    We analyze a compression scheme for large data sets that randomly keeps a small percentage of the components of each data sample. The benefit is that the output is a sparse matrix and therefore subsequent processing, such as PCA or K-means, is significantly faster, especially in a distributed-data setting. Furthermore, the sampling is single-pass and applicable to streaming data. The sampling mechanism is a variant of previous methods proposed in the literature combined with a randomized preconditioning to smooth the data. We provide guarantees for PCA in terms of the covariance matrix, and guarantees for K-means in terms of the error in the center estimators at a given step. We present numerical evidence to show both that our bounds are nearly tight and that our algorithms provide a real benefit when applied to standard test data sets, as well as providing certain benefits over related sampling approaches.Comment: 28 pages, 10 figure

    Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain

    Full text link
    Real-world data typically contain repeated and periodic patterns. This suggests that they can be effectively represented and compressed using only a few coefficients of an appropriate basis (e.g., Fourier, Wavelets, etc.). However, distance estimation when the data are represented using different sets of coefficients is still a largely unexplored area. This work studies the optimization problems related to obtaining the \emph{tightest} lower/upper bound on Euclidean distances when each data object is potentially compressed using a different set of orthonormal coefficients. Our technique leads to tighter distance estimates, which translates into more accurate search, learning and mining operations \textit{directly} in the compressed domain. We formulate the problem of estimating lower/upper distance bounds as an optimization problem. We establish the properties of optimal solutions, and leverage the theoretical analysis to develop a fast algorithm to obtain an \emph{exact} solution to the problem. The suggested solution provides the tightest estimation of the L2L_2-norm or the correlation. We show that typical data-analysis operations, such as k-NN search or k-Means clustering, can operate more accurately using the proposed compression and distance reconstruction technique. We compare it with many other prevalent compression and reconstruction techniques, including random projections and PCA-based techniques. We highlight a surprising result, namely that when the data are highly sparse in some basis, our technique may even outperform PCA-based compression. The contributions of this work are generic as our methodology is applicable to any sequential or high-dimensional data as well as to any orthogonal data transformation used for the underlying data compression scheme.Comment: 25 pages, 20 figures, accepted in VLD

    Normalized Web Distance and Word Similarity

    Get PDF
    There is a great deal of work in cognitive psychology, linguistics, and computer science, about using word (or phrase) frequencies in context in text corpora to develop measures for word similarity or word association, going back to at least the 1960s. The goal of this chapter is to introduce the normalizedis a general way to tap the amorphous low-grade knowledge available for free on the Internet, typed in by local users aiming at personal gratification of diverse objectives, and yet globally achieving what is effectively the largest semantic electronic database in the world. Moreover, this database is available for all by using any search engine that can return aggregate page-count estimates for a large range of search-queries. In the paper introducing the NWD it was called `normalized Google distance (NGD),' but since Google doesn't allow computer searches anymore, we opt for the more neutral and descriptive NWD. web distance (NWD) method to determine similarity between words and phrases. ItComment: Latex, 20 pages, 7 figures, to appear in: Handbook of Natural Language Processing, Second Edition, Nitin Indurkhya and Fred J. Damerau Eds., CRC Press, Taylor and Francis Group, Boca Raton, FL, 2010, ISBN 978-142008592
    corecore