43,135 research outputs found
Automatic Meaning Discovery Using Google
We survey a new area of parameter-free similarity distance measures
useful in data-mining,
pattern recognition, learning and automatic semantics extraction.
Given a family of distances on a set of objects,
a distance is universal up to a certain precision for that family if it
minorizes every distance in the family between every two objects
in the set, up to the stated precision (we do not require the universal
distance to be an element of the family).
We consider similarity distances
for two types of objects: literal objects that as such contain all of their
meaning, like genomes or books, and names for objects.
The latter may have
literal embodyments like the first type, but may also
be abstract like ``red\u27\u27 or ``christianity.\u27\u27 For the first type
we consider
a family of computable distance measures
corresponding to parameters expressing similarity according to
particular features
between
pairs of literal objects. For the second type we consider similarity
distances generated by web users corresponding to particular semantic
relations between the (names for) the designated objects.
For both families we give universal similarity
distance measures, incorporating all particular distance measures
in the family. In the first case the universal
distance is based on compression and in the second
case it is based on Google page counts related to search terms.
In both cases experiments on a massive scale give evidence of the
viability of the approaches
The Google Similarity Distance
Words and phrases acquire meaning from the way they are used in society, from
their relative semantics to other words and phrases. For computers the
equivalent of `society' is `database,' and the equivalent of `use' is `way to
search the database.' We present a new theory of similarity between words and
phrases based on information distance and Kolmogorov complexity. To fix
thoughts we use the world-wide-web as database, and Google as search engine.
The method is also applicable to other search engines and databases. This
theory is then applied to construct a method to automatically extract
similarity, the Google similarity distance, of words and phrases from the
world-wide-web using Google page counts. The world-wide-web is the largest
database on earth, and the context information entered by millions of
independent users averages out to provide automatic semantics of useful
quality. We give applications in hierarchical clustering, classification, and
language translation. We give examples to distinguish between colors and
numbers, cluster names of paintings by 17th century Dutch masters and names of
books by English novelists, the ability to understand emergencies, and primes,
and we demonstrate the ability to do a simple automatic English-Spanish
translation. Finally, we use the WordNet database as an objective baseline
against which to judge the performance of our method. We conduct a massive
randomized trial in binary classification using support vector machines to
learn categories based on our Google distance, resulting in an a mean agreement
of 87% with the expert crafted WordNet categories.Comment: 15 pages, 10 figures; changed some text/figures/notation/part of
theorem. Incorporated referees comments. This is the final published version
up to some minor changes in the galley proof
Normalized Web Distance and Word Similarity
There is a great deal of work in cognitive psychology, linguistics, and
computer science, about using word (or phrase) frequencies in context in text
corpora to develop measures for word similarity or word association, going back
to at least the 1960s. The goal of this chapter is to introduce the
normalizedis a general way to tap the amorphous low-grade knowledge available
for free on the Internet, typed in by local users aiming at personal
gratification of diverse objectives, and yet globally achieving what is
effectively the largest semantic electronic database in the world. Moreover,
this database is available for all by using any search engine that can return
aggregate page-count estimates for a large range of search-queries. In the
paper introducing the NWD it was called `normalized Google distance (NGD),' but
since Google doesn't allow computer searches anymore, we opt for the more
neutral and descriptive NWD. web distance (NWD) method to determine similarity
between words and phrases. ItComment: Latex, 20 pages, 7 figures, to appear in: Handbook of Natural
Language Processing, Second Edition, Nitin Indurkhya and Fred J. Damerau
Eds., CRC Press, Taylor and Francis Group, Boca Raton, FL, 2010, ISBN
978-142008592
Recommended from our members
Application of Natural Language Processing and Evidential Analysis to Web-Based Intelligence Information Acquisition
The quality of decisions made in business and government relates directly to the quality of the information used to formulate the decision. This information may be retrieved from an organization's knowledge base (Intranet) or from the World Wide Web. Intelligence services Intranet held information can be efficiently manipulated by technologies based upon either semantics such as ontologies, or statistics such as meaning-based computing. These technologies require complex processing of large amount of textual information. However, they cannot currently be effectively applied to Web-based search due to various obstacles, such as lack of semantic tagging. A new approach proposed in this paper supports Web-based search for intelligence information utilizing evidence-based natural language processing (NLP). This approach combines traditional NLP methods for filtering of Web-search results, Grounded Theory to test the completeness of the evidence, and Evidential Analysis to test the quality of gathered information. The enriched information derived from the Web-search will be transferred to the intelligence services knowledge base for handling by an effective Intranet search system thus increasing substantially the information for intelligence analysis. The paper will show that the quality of retrieved information is significantly enhanced by the discovery of previously unknown facts derived from known facts
Recommended from our members
Geospatial data integration with Semantic Web services: the eMerges approach
Geographic space still lacks the semantics allowing a unified view of spatial data. Indeed, as a unique but all encompassing domain, it presents specificities that geospatial applications are still unable to handle. Moreover, to be useful, new spatial applications need to match human cognitive abilities of spatial representation and reasoning. In this context, eMerges, an approach to geospatial data integration based on Semantic Web Services (SWS), allows the unified representation and manipulation of heterogeneous spatial data sources. eMerges provides this integration by mediating legacy spatial data sources to high-level spatial ontologies through SWS and by presenting for each object context dependent affordances. This generic approach is applied here in the context of an emergency management use case developed in collaboration with emergency planners of public agencies
Learning Object Categories From Internet Image Searches
In this paper, we describe a simple approach to learning models of visual object categories from images gathered from Internet image search engines. The images for a given keyword are typically highly variable, with a large fraction being unrelated to the query term, and thus pose a challenging environment from which to learn. By training our models directly from Internet images, we remove the need to laboriously compile training data sets, required by most other recognition approaches-this opens up the possibility of learning object category models “on-the-fly.” We describe two simple approaches, derived from the probabilistic latent semantic analysis (pLSA) technique for text document analysis, that can be used to automatically learn object models from these data. We show two applications of the learned model: first, to rerank the images returned by the search engine, thus improving the quality of the search engine; and second, to recognize objects in other image data sets
- …