4,505 research outputs found
Faster Clustering via Preprocessing
We examine the efficiency of clustering a set of points, when the
encompassing metric space may be preprocessed in advance. In computational
problems of this genre, there is a first stage of preprocessing, whose input is
a collection of points ; the next stage receives as input a query set
, and should report a clustering of according to some
objective, such as 1-median, in which case the answer is a point
minimizing .
We design fast algorithms that approximately solve such problems under
standard clustering objectives like -center and -median, when the metric
has low doubling dimension. By leveraging the preprocessing stage, our
algorithms achieve query time that is near-linear in the query size ,
and is (almost) independent of the total number of points .Comment: 24 page
Statistical keyword detection in literary corpora
Understanding the complexity of human language requires an appropriate
analysis of the statistical distribution of words in texts. We consider the
information retrieval problem of detecting and ranking the relevant words of a
text by means of statistical information referring to the "spatial" use of the
words. Shannon's entropy of information is used as a tool for automatic keyword
extraction. By using The Origin of Species by Charles Darwin as a
representative text sample, we show the performance of our detector and compare
it with another proposals in the literature. The random shuffled text receives
special attention as a tool for calibrating the ranking indices.Comment: Published version. 11 pages, 7 figures. SVJour for LaTeX2
Intersecting Faces: Non-negative Matrix Factorization With New Guarantees
Non-negative matrix factorization (NMF) is a natural model of admixture and
is widely used in science and engineering. A plethora of algorithms have been
developed to tackle NMF, but due to the non-convex nature of the problem, there
is little guarantee on how well these methods work. Recently a surge of
research have focused on a very restricted class of NMFs, called separable NMF,
where provably correct algorithms have been developed. In this paper, we
propose the notion of subset-separable NMF, which substantially generalizes the
property of separability. We show that subset-separability is a natural
necessary condition for the factorization to be unique or to have minimum
volume. We developed the Face-Intersect algorithm which provably and
efficiently solves subset-separable NMF under natural conditions, and we prove
that our algorithm is robust to small noise. We explored the performance of
Face-Intersect on simulations and discuss settings where it empirically
outperformed the state-of-art methods. Our work is a step towards finding
provably correct algorithms that solve large classes of NMF problems
Bibliographic Analysis on Research Publications using Authors, Categorical Labels and the Citation Network
Bibliographic analysis considers the author's research areas, the citation
network and the paper content among other things. In this paper, we combine
these three in a topic model that produces a bibliographic model of authors,
topics and documents, using a nonparametric extension of a combination of the
Poisson mixed-topic link model and the author-topic model. This gives rise to
the Citation Network Topic Model (CNTM). We propose a novel and efficient
inference algorithm for the CNTM to explore subsets of research publications
from CiteSeerX. The publication datasets are organised into three corpora,
totalling to about 168k publications with about 62k authors. The queried
datasets are made available online. In three publicly available corpora in
addition to the queried datasets, our proposed model demonstrates an improved
performance in both model fitting and document clustering, compared to several
baselines. Moreover, our model allows extraction of additional useful knowledge
from the corpora, such as the visualisation of the author-topics network.
Additionally, we propose a simple method to incorporate supervision into topic
modelling to achieve further improvement on the clustering task.Comment: Preprint for Journal Machine Learnin
The Google Similarity Distance
Words and phrases acquire meaning from the way they are used in society, from
their relative semantics to other words and phrases. For computers the
equivalent of `society' is `database,' and the equivalent of `use' is `way to
search the database.' We present a new theory of similarity between words and
phrases based on information distance and Kolmogorov complexity. To fix
thoughts we use the world-wide-web as database, and Google as search engine.
The method is also applicable to other search engines and databases. This
theory is then applied to construct a method to automatically extract
similarity, the Google similarity distance, of words and phrases from the
world-wide-web using Google page counts. The world-wide-web is the largest
database on earth, and the context information entered by millions of
independent users averages out to provide automatic semantics of useful
quality. We give applications in hierarchical clustering, classification, and
language translation. We give examples to distinguish between colors and
numbers, cluster names of paintings by 17th century Dutch masters and names of
books by English novelists, the ability to understand emergencies, and primes,
and we demonstrate the ability to do a simple automatic English-Spanish
translation. Finally, we use the WordNet database as an objective baseline
against which to judge the performance of our method. We conduct a massive
randomized trial in binary classification using support vector machines to
learn categories based on our Google distance, resulting in an a mean agreement
of 87% with the expert crafted WordNet categories.Comment: 15 pages, 10 figures; changed some text/figures/notation/part of
theorem. Incorporated referees comments. This is the final published version
up to some minor changes in the galley proof
- …