Search CORE

4,505 research outputs found

Faster Clustering via Preprocessing

Author: Kopelowitz Tsvi
Krauthgamer Robert
Publication venue
Publication date: 01/01/2012
Field of study

We examine the efficiency of clustering a set of points, when the encompassing metric space may be preprocessed in advance. In computational problems of this genre, there is a first stage of preprocessing, whose input is a collection of points

M

; the next stage receives as input a query set

Q\subset M

, and should report a clustering of

Q

according to some objective, such as 1-median, in which case the answer is a point

a\in M

minimizing

\sum_{q\in Q} d_M(a,q)

. We design fast algorithms that approximately solve such problems under standard clustering objectives like

p

-center and

p

-median, when the metric

M

has low doubling dimension. By leveraging the preprocessing stage, our algorithms achieve query time that is near-linear in the query size

n=|Q|

, and is (almost) independent of the total number of points

m=|M|

.Comment: 24 page

arXiv.org e-Print Archive

CiteSeerX

Statistical keyword detection in literary corpora

Author: Cancho
Cassandro
Cohen
Ebeling
Ebeling
Grosse
J. P. Herrera
Luhn
Mantegna
Montemurro
Ortuño
P. A. Pury
Stanley
Yang
Zhou
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/05/2008
Field of study

Understanding the complexity of human language requires an appropriate analysis of the statistical distribution of words in texts. We consider the information retrieval problem of detecting and ranking the relevant words of a text by means of statistical information referring to the "spatial" use of the words. Shannon's entropy of information is used as a tool for automatic keyword extraction. By using The Origin of Species by Charles Darwin as a representative text sample, we show the performance of our detector and compare it with another proposals in the literature. The random shuffled text receives special attention as a tool for calibrating the ranking indices.Comment: Published version. 11 pages, 7 figures. SVJour for LaTeX2

arXiv.org e-Print Archive

Crossref

EDP Sciences OAI-PMH repository (1.2.0)

Intersecting Faces: Non-negative Matrix Factorization With New Guarantees

Author: Ge Rong
Zou James
Publication venue
Publication date: 01/01/2015
Field of study

Non-negative matrix factorization (NMF) is a natural model of admixture and is widely used in science and engineering. A plethora of algorithms have been developed to tackle NMF, but due to the non-convex nature of the problem, there is little guarantee on how well these methods work. Recently a surge of research have focused on a very restricted class of NMFs, called separable NMF, where provably correct algorithms have been developed. In this paper, we propose the notion of subset-separable NMF, which substantially generalizes the property of separability. We show that subset-separability is a natural necessary condition for the factorization to be unique or to have minimum volume. We developed the Face-Intersect algorithm which provably and efficiently solves subset-separable NMF under natural conditions, and we prove that our algorithm is robust to small noise. We explored the performance of Face-Intersect on simulations and discuss settings where it empirically outperformed the state-of-art methods. Our work is a step towards finding provably correct algorithms that solve large classes of NMF problems

arXiv.org e-Print Archive

CiteSeerX

Bibliographic Analysis on Research Publications using Authors, Categorical Labels and the Citation Network

Author: Buntine Wray
Lim Kar Wai
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/09/2016
Field of study

Bibliographic analysis considers the author's research areas, the citation network and the paper content among other things. In this paper, we combine these three in a topic model that produces a bibliographic model of authors, topics and documents, using a nonparametric extension of a combination of the Poisson mixed-topic link model and the author-topic model. This gives rise to the Citation Network Topic Model (CNTM). We propose a novel and efficient inference algorithm for the CNTM to explore subsets of research publications from CiteSeerX. The publication datasets are organised into three corpora, totalling to about 168k publications with about 62k authors. The queried datasets are made available online. In three publicly available corpora in addition to the queried datasets, our proposed model demonstrates an improved performance in both model fitting and document clustering, compared to several baselines. Moreover, our model allows extraction of additional useful knowledge from the corpora, such as the visualisation of the author-topics network. Additionally, we propose a simple method to incorporate supervision into topic modelling to achieve further improvement on the clustering task.Comment: Preprint for Journal Machine Learnin

arXiv.org e-Print Archive

The Australian National University

The Google Similarity Distance

Author: Cilibrasi Rudi
Vitanyi Paul M. B.
Publication venue
Publication date: 01/01/2007
Field of study

Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers the equivalent of `society' is `database,' and the equivalent of `use' is `way to search the database.' We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts we use the world-wide-web as database, and Google as search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the world-wide-web using Google page counts. The world-wide-web is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies, and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87% with the expert crafted WordNet categories.Comment: 15 pages, 10 figures; changed some text/figures/notation/part of theorem. Incorporated referees comments. This is the final published version up to some minor changes in the galley proof

arXiv.org e-Print Archive

CiteSeerX

CWI's Institutional Repository

International Migration, Integration and Social Cohesion online publications