140 research outputs found
The Google Similarity Distance
Words and phrases acquire meaning from the way they are used in society, from
their relative semantics to other words and phrases. For computers the
equivalent of `society' is `database,' and the equivalent of `use' is `way to
search the database.' We present a new theory of similarity between words and
phrases based on information distance and Kolmogorov complexity. To fix
thoughts we use the world-wide-web as database, and Google as search engine.
The method is also applicable to other search engines and databases. This
theory is then applied to construct a method to automatically extract
similarity, the Google similarity distance, of words and phrases from the
world-wide-web using Google page counts. The world-wide-web is the largest
database on earth, and the context information entered by millions of
independent users averages out to provide automatic semantics of useful
quality. We give applications in hierarchical clustering, classification, and
language translation. We give examples to distinguish between colors and
numbers, cluster names of paintings by 17th century Dutch masters and names of
books by English novelists, the ability to understand emergencies, and primes,
and we demonstrate the ability to do a simple automatic English-Spanish
translation. Finally, we use the WordNet database as an objective baseline
against which to judge the performance of our method. We conduct a massive
randomized trial in binary classification using support vector machines to
learn categories based on our Google distance, resulting in an a mean agreement
of 87% with the expert crafted WordNet categories.Comment: 15 pages, 10 figures; changed some text/figures/notation/part of
theorem. Incorporated referees comments. This is the final published version
up to some minor changes in the galley proof
A New Quartet Tree Heuristic for Hierarchical Clustering
We consider the problem of constructing an an optimal-weight tree from the
3*(n choose 4) weighted quartet topologies on n objects, where optimality means
that the summed weight of the embedded quartet topologiesis optimal (so it can
be the case that the optimal tree embeds all quartets as non-optimal
topologies). We present a heuristic for reconstructing the optimal-weight tree,
and a canonical manner to derive the quartet-topology weights from a given
distance matrix. The method repeatedly transforms a bifurcating tree, with all
objects involved as leaves, achieving a monotonic approximation to the exact
single globally optimal tree. This contrasts to other heuristic search methods
from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly,
incrementally construct a solution from a random order of objects, and
subsequently add agreement values.Comment: 22 pages, 14 figure
Normalized Web Distance and Word Similarity
There is a great deal of work in cognitive psychology, linguistics, and
computer science, about using word (or phrase) frequencies in context in text
corpora to develop measures for word similarity or word association, going back
to at least the 1960s. The goal of this chapter is to introduce the
normalizedis a general way to tap the amorphous low-grade knowledge available
for free on the Internet, typed in by local users aiming at personal
gratification of diverse objectives, and yet globally achieving what is
effectively the largest semantic electronic database in the world. Moreover,
this database is available for all by using any search engine that can return
aggregate page-count estimates for a large range of search-queries. In the
paper introducing the NWD it was called `normalized Google distance (NGD),' but
since Google doesn't allow computer searches anymore, we opt for the more
neutral and descriptive NWD. web distance (NWD) method to determine similarity
between words and phrases. ItComment: Latex, 20 pages, 7 figures, to appear in: Handbook of Natural
Language Processing, Second Edition, Nitin Indurkhya and Fred J. Damerau
Eds., CRC Press, Taylor and Francis Group, Boca Raton, FL, 2010, ISBN
978-142008592
Normalized Information Distance
The normalized information distance is a universal distance measure for
objects of all kinds. It is based on Kolmogorov complexity and thus
uncomputable, but there are ways to utilize it. First, compression algorithms
can be used to approximate the Kolmogorov complexity if the objects have a
string representation. Second, for names and abstract concepts, page count
statistics from the World Wide Web can be used. These practical realizations of
the normalized information distance can then be applied to machine learning
tasks, expecially clustering, to perform feature-free and parameter-free data
mining. This chapter discusses the theoretical foundations of the normalized
information distance and both practical realizations. It presents numerous
examples of successful real-world applications based on these distance
measures, ranging from bioinformatics to music clustering to machine
translation.Comment: 33 pages, 12 figures, pdf, in: Normalized information distance, in:
Information Theory and Statistical Learning, Eds. M. Dehmer, F.
Emmert-Streib, Springer-Verlag, New-York, To appea
Hierarchical structuring of Cultural Heritage objects within large aggregations
Huge amounts of cultural content have been digitised and are available
through digital libraries and aggregators like Europeana.eu. However, it is not
easy for a user to have an overall picture of what is available nor to find
related objects. We propose a method for hier- archically structuring cultural
objects at different similarity levels. We describe a fast, scalable clustering
algorithm with an automated field selection method for finding semantic
clusters. We report a qualitative evaluation on the cluster categories based on
records from the UK and a quantitative one on the results from the complete
Europeana dataset.Comment: The paper has been published in the proceedings of the TPDL
conference, see http://tpdl2013.info. For the final version see
http://link.springer.com/chapter/10.1007%2F978-3-642-40501-3_2
CFT Duals for Extreme Black Holes
It is argued that the general four-dimensional extremal Kerr-Newman-AdS-dS
black hole is holographically dual to a (chiral half of a) two-dimensional CFT,
generalizing an argument given recently for the special case of extremal Kerr.
Specifically, the asymptotic symmetries of the near-horizon region of the
general extremal black hole are shown to be generated by a Virasoro algebra.
Semiclassical formulae are derived for the central charge and temperature of
the dual CFT as functions of the cosmological constant, Newton's constant and
the black hole charges and spin. We then show, assuming the Cardy formula, that
the microscopic entropy of the dual CFT precisely reproduces the macroscopic
Bekenstein-Hawking area law. This CFT description becomes singular in the
extreme Reissner-Nordstrom limit where the black hole has no spin. At this
point a second dual CFT description is proposed in which the global part of the
U(1) gauge symmetry is promoted to a Virasoro algebra. This second description
is also found to reproduce the area law. Various further generalizations
including higher dimensions are discussed.Comment: 18 pages; v2 minor change
An almost sure limit theorem for super-Brownian motion
We establish an almost sure scaling limit theorem for super-Brownian motion
on associated with the semi-linear equation , where and are positive constants. In
this case, the spectral theoretical assumptions that required in Chen et al
(2008) are not satisfied. An example is given to show that the main results
also hold for some sub-domains in .Comment: 14 page
Evaluation of Fused Pyrrolothiazole Systems as Correctors of Mutant CFTR Protein
Cystic fibrosis (CF) is a genetic disease caused by mutations that impair the function of the CFTR chloride channel. The most frequent mutation, F508del, causes misfolding and premature degradation of CFTR protein. This defect can be overcome with pharmacological agents named "correctors". So far, at least three different classes of correctors have been identified based on the additive/synergistic effects that are obtained when compounds of different classes are combined together. The development of class 2 correctors has lagged behind that of compounds belonging to the other classes. It was shown that the efficacy of the prototypical class 2 corrector, the bithiazole corr-4a, could be improved by generating conformationally-locked bithiazoles. In the present study, we investigated the effect of tricyclic pyrrolothiazoles as analogues of constrained bithiazoles. Thirty-five compounds were tested using the functional assay based on the halide-sensitive yellow fluorescent protein (HS-YFP) that measured CFTR activity. One compound, having a six atom carbocyle central ring in the tricyclic pyrrolothiazole system and bearing a pivalamide group at the thiazole moiety and a 5-chloro-2-methoxyphenyl carboxamide at the pyrrole ring, significantly increased F508del-CFTR activity. This compound could lead to the synthesis of a novel class of CFTR correctors
Cumulants and the moment algebra: tools for analysing weak measurements
Recently it has been shown that cumulants significantly simplify the analysis
of multipartite weak measurements. Here we consider the mathematical structure
that underlies this, and find that it can be formulated in terms of what we
call the moment algebra. Apart from resulting in simpler proofs, the
flexibility of this structure allows generalizations of the original results to
a number of weak measurement scenarios, including one where the weakly
interacting pointers reach thermal equilibrium with the probed system.Comment: Journal reference added, minor correction
- …