51 research outputs found
A Taxonomy of Hyperlink Hiding Techniques
Hidden links are designed solely for search engines rather than visitors. To
get high search engine rankings, link hiding techniques are usually used for
the profitability of black industries, such as illicit game servers, false
medical services, illegal gambling, and less attractive high-profit industry,
etc. This paper investigates hyperlink hiding techniques on the Web, and gives
a detailed taxonomy. We believe the taxonomy can help develop appropriate
countermeasures. Study on 5,583,451 Chinese sites' home pages indicate that
link hidden techniques are very prevalent on the Web. We also tried to explore
the attitude of Google towards link hiding spam by analyzing the PageRank
values of relative links. The results show that more should be done to punish
the hidden link spam.Comment: 12 pages, 2 figure
Representation Independent Analytics Over Structured Data
Database analytics algorithms leverage quantifiable structural properties of
the data to predict interesting concepts and relationships. The same
information, however, can be represented using many different structures and
the structural properties observed over particular representations do not
necessarily hold for alternative structures. Thus, there is no guarantee that
current database analytics algorithms will still provide the correct insights,
no matter what structures are chosen to organize the database. Because these
algorithms tend to be highly effective over some choices of structure, such as
that of the databases used to validate them, but not so effective with others,
database analytics has largely remained the province of experts who can find
the desired forms for these algorithms. We argue that in order to make database
analytics usable, we should use or develop algorithms that are effective over a
wide range of choices of structural organizations. We introduce the notion of
representation independence, study its fundamental properties for a wide range
of data analytics algorithms, and empirically analyze the amount of
representation independence of some popular database analytics algorithms. Our
results indicate that most algorithms are not generally representation
independent and find the characteristics of more representation independent
heuristics under certain representational shifts
PageRank: Standing on the shoulders of giants
PageRank is a Web page ranking technique that has been a fundamental
ingredient in the development and success of the Google search engine. The
method is still one of the many signals that Google uses to determine which
pages are most important. The main idea behind PageRank is to determine the
importance of a Web page in terms of the importance assigned to the pages
hyperlinking to it. In fact, this thesis is not new, and has been previously
successfully exploited in different contexts. We review the PageRank method and
link it to some renowned previous techniques that we have found in the fields
of Web information retrieval, bibliometrics, sociometry, and econometrics
The Bregman Variational Dual-Tree Framework
Graph-based methods provide a powerful tool set for many non-parametric
frameworks in Machine Learning. In general, the memory and computational
complexity of these methods is quadratic in the number of examples in the data
which makes them quickly infeasible for moderate to large scale datasets. A
significant effort to find more efficient solutions to the problem has been
made in the literature. One of the state-of-the-art methods that has been
recently introduced is the Variational Dual-Tree (VDT) framework. Despite some
of its unique features, VDT is currently restricted only to Euclidean spaces
where the Euclidean distance quantifies the similarity. In this paper, we
extend the VDT framework beyond the Euclidean distance to more general Bregman
divergences that include the Euclidean distance as a special case. By
exploiting the properties of the general Bregman divergence, we show how the
new framework can maintain all the pivotal features of the VDT framework and
yet significantly improve its performance in non-Euclidean domains. We apply
the proposed framework to different text categorization problems and
demonstrate its benefits over the original VDT.Comment: Appears in Proceedings of the Twenty-Ninth Conference on Uncertainty
in Artificial Intelligence (UAI2013
Second order accurate distributed eigenvector computation for extremely large matrices
We propose a second-order accurate method to estimate the eigenvectors of
extremely large matrices thereby addressing a problem of relevance to
statisticians working in the analysis of very large datasets. More
specifically, we show that averaging eigenvectors of randomly subsampled
matrices efficiently approximates the true eigenvectors of the original matrix
under certain conditions on the incoherence of the spectral decomposition. This
incoherence assumption is typically milder than those made in matrix completion
and allows eigenvectors to be sparse. We discuss applications to spectral
methods in dimensionality reduction and information retrieval.Comment: Complete proofs are included on averaging performanc
- …