140 research outputs found

    The Google Similarity Distance

    Full text link
    Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers the equivalent of `society' is `database,' and the equivalent of `use' is `way to search the database.' We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts we use the world-wide-web as database, and Google as search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the world-wide-web using Google page counts. The world-wide-web is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies, and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87% with the expert crafted WordNet categories.Comment: 15 pages, 10 figures; changed some text/figures/notation/part of theorem. Incorporated referees comments. This is the final published version up to some minor changes in the galley proof

    A New Quartet Tree Heuristic for Hierarchical Clustering

    Get PDF
    We consider the problem of constructing an an optimal-weight tree from the 3*(n choose 4) weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologiesis optimal (so it can be the case that the optimal tree embeds all quartets as non-optimal topologies). We present a heuristic for reconstructing the optimal-weight tree, and a canonical manner to derive the quartet-topology weights from a given distance matrix. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. This contrasts to other heuristic search methods from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly, incrementally construct a solution from a random order of objects, and subsequently add agreement values.Comment: 22 pages, 14 figure

    Normalized Web Distance and Word Similarity

    Get PDF
    There is a great deal of work in cognitive psychology, linguistics, and computer science, about using word (or phrase) frequencies in context in text corpora to develop measures for word similarity or word association, going back to at least the 1960s. The goal of this chapter is to introduce the normalizedis a general way to tap the amorphous low-grade knowledge available for free on the Internet, typed in by local users aiming at personal gratification of diverse objectives, and yet globally achieving what is effectively the largest semantic electronic database in the world. Moreover, this database is available for all by using any search engine that can return aggregate page-count estimates for a large range of search-queries. In the paper introducing the NWD it was called `normalized Google distance (NGD),' but since Google doesn't allow computer searches anymore, we opt for the more neutral and descriptive NWD. web distance (NWD) method to determine similarity between words and phrases. ItComment: Latex, 20 pages, 7 figures, to appear in: Handbook of Natural Language Processing, Second Edition, Nitin Indurkhya and Fred J. Damerau Eds., CRC Press, Taylor and Francis Group, Boca Raton, FL, 2010, ISBN 978-142008592

    Normalized Information Distance

    Get PDF
    The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, expecially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.Comment: 33 pages, 12 figures, pdf, in: Normalized information distance, in: Information Theory and Statistical Learning, Eds. M. Dehmer, F. Emmert-Streib, Springer-Verlag, New-York, To appea

    Hierarchical structuring of Cultural Heritage objects within large aggregations

    Full text link
    Huge amounts of cultural content have been digitised and are available through digital libraries and aggregators like Europeana.eu. However, it is not easy for a user to have an overall picture of what is available nor to find related objects. We propose a method for hier- archically structuring cultural objects at different similarity levels. We describe a fast, scalable clustering algorithm with an automated field selection method for finding semantic clusters. We report a qualitative evaluation on the cluster categories based on records from the UK and a quantitative one on the results from the complete Europeana dataset.Comment: The paper has been published in the proceedings of the TPDL conference, see http://tpdl2013.info. For the final version see http://link.springer.com/chapter/10.1007%2F978-3-642-40501-3_2

    CFT Duals for Extreme Black Holes

    Get PDF
    It is argued that the general four-dimensional extremal Kerr-Newman-AdS-dS black hole is holographically dual to a (chiral half of a) two-dimensional CFT, generalizing an argument given recently for the special case of extremal Kerr. Specifically, the asymptotic symmetries of the near-horizon region of the general extremal black hole are shown to be generated by a Virasoro algebra. Semiclassical formulae are derived for the central charge and temperature of the dual CFT as functions of the cosmological constant, Newton's constant and the black hole charges and spin. We then show, assuming the Cardy formula, that the microscopic entropy of the dual CFT precisely reproduces the macroscopic Bekenstein-Hawking area law. This CFT description becomes singular in the extreme Reissner-Nordstrom limit where the black hole has no spin. At this point a second dual CFT description is proposed in which the global part of the U(1) gauge symmetry is promoted to a Virasoro algebra. This second description is also found to reproduce the area law. Various further generalizations including higher dimensions are discussed.Comment: 18 pages; v2 minor change

    An almost sure limit theorem for super-Brownian motion

    Get PDF
    We establish an almost sure scaling limit theorem for super-Brownian motion on Rd\mathbb{R}^d associated with the semi-linear equation ut=1/2Δu+βu−αu2u_t = {1/2}\Delta u +\beta u-\alpha u^2, where α\alpha and β\beta are positive constants. In this case, the spectral theoretical assumptions that required in Chen et al (2008) are not satisfied. An example is given to show that the main results also hold for some sub-domains in Rd\mathbb{R}^d.Comment: 14 page

    Evaluation of Fused Pyrrolothiazole Systems as Correctors of Mutant CFTR Protein

    Get PDF
    Cystic fibrosis (CF) is a genetic disease caused by mutations that impair the function of the CFTR chloride channel. The most frequent mutation, F508del, causes misfolding and premature degradation of CFTR protein. This defect can be overcome with pharmacological agents named "correctors". So far, at least three different classes of correctors have been identified based on the additive/synergistic effects that are obtained when compounds of different classes are combined together. The development of class 2 correctors has lagged behind that of compounds belonging to the other classes. It was shown that the efficacy of the prototypical class 2 corrector, the bithiazole corr-4a, could be improved by generating conformationally-locked bithiazoles. In the present study, we investigated the effect of tricyclic pyrrolothiazoles as analogues of constrained bithiazoles. Thirty-five compounds were tested using the functional assay based on the halide-sensitive yellow fluorescent protein (HS-YFP) that measured CFTR activity. One compound, having a six atom carbocyle central ring in the tricyclic pyrrolothiazole system and bearing a pivalamide group at the thiazole moiety and a 5-chloro-2-methoxyphenyl carboxamide at the pyrrole ring, significantly increased F508del-CFTR activity. This compound could lead to the synthesis of a novel class of CFTR correctors

    Cumulants and the moment algebra: tools for analysing weak measurements

    Full text link
    Recently it has been shown that cumulants significantly simplify the analysis of multipartite weak measurements. Here we consider the mathematical structure that underlies this, and find that it can be formulated in terms of what we call the moment algebra. Apart from resulting in simpler proofs, the flexibility of this structure allows generalizations of the original results to a number of weak measurement scenarios, including one where the weakly interacting pointers reach thermal equilibrium with the probed system.Comment: Journal reference added, minor correction
    • …
    corecore