47,590 research outputs found

    Choosing Compressor for Computing Normalized Compression Distance

    Get PDF
    Modern measurement technologies in sciences such as biology produce huge amounts of measurement data. A major problem with this is that the data need to be classified so that it becomes possible to tell, for example, cancerous cells from regular cells. However, there is seldom enough information about the processes producing the data that is getting measured, so traditional feature based classification algorithms might prove useless. Normalized information distance is a universal metric than can be used to theoretically cluster all kinds of data based on their algorithmic similarity without knowing the interesting features beforehand. However normalized information distance depends on algorithmic Kolmogorov complexity which cannot be computed. Normalized compression distance is a construct based on normalized information distance. Normalized compression distance substitutes the uncomputable Kolmogorov complexity for compressed file size. Theoretically any good enough data compressor can be used to compute normalized compression distance. In practice different compressors are known to produce dissimilar results. The purpose of this work is to construct a framework for evaluating the performance of different compressors when computing the normalized compression distance. The evaluation framework that is presented in this work is able to uncover differences in performance of different compressors given different input lengths and input types. Four different compressors were evaluated using the framework and the results show how dif- ferent features of the compressors affect their performance. In addition, it became possible to give recommendations about which compressors to use for different kinds of inputs

    IDENTIFICATION OF COVER SONGS USING INFORMATION THEORETIC MEASURES OF SIMILARITY

    Get PDF
    13 pages, 5 figures, 4 tables. v3: Accepted version13 pages, 5 figures, 4 tables. v3: Accepted version13 pages, 5 figures, 4 tables. v3: Accepted versio

    Kolmogorov Complexity in perspective. Part II: Classification, Information Processing and Duality

    Get PDF
    We survey diverse approaches to the notion of information: from Shannon entropy to Kolmogorov complexity. Two of the main applications of Kolmogorov complexity are presented: randomness and classification. The survey is divided in two parts published in a same volume. Part II is dedicated to the relation between logic and information system, within the scope of Kolmogorov algorithmic information theory. We present a recent application of Kolmogorov complexity: classification using compression, an idea with provocative implementation by authors such as Bennett, Vitanyi and Cilibrasi. This stresses how Kolmogorov complexity, besides being a foundation to randomness, is also related to classification. Another approach to classification is also considered: the so-called "Google classification". It uses another original and attractive idea which is connected to the classification using compression and to Kolmogorov complexity from a conceptual point of view. We present and unify these different approaches to classification in terms of Bottom-Up versus Top-Down operational modes, of which we point the fundamental principles and the underlying duality. We look at the way these two dual modes are used in different approaches to information system, particularly the relational model for database introduced by Codd in the 70's. This allows to point out diverse forms of a fundamental duality. These operational modes are also reinterpreted in the context of the comprehension schema of axiomatic set theory ZF. This leads us to develop how Kolmogorov's complexity is linked to intensionality, abstraction, classification and information system.Comment: 43 page

    Normalized Information Distance

    Get PDF
    The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, expecially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.Comment: 33 pages, 12 figures, pdf, in: Normalized information distance, in: Information Theory and Statistical Learning, Eds. M. Dehmer, F. Emmert-Streib, Springer-Verlag, New-York, To appea

    Clustering by compression

    Full text link
    We present a new method for clustering based on compression. The method doesn't use subject-specific features or background knowledge, and works as follows: First, we determine a universal similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is universal in that it is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal but uses the non-computable notion of Kolmogorov complexity. We propose precise notions of similarity metric, normal compressor, and show that the NCD based on a normal compressor is a similarity metric that approximates universality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (binary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.Comment: LaTeX, 27 pages, 20 figure
    corecore