145,540 research outputs found
A Comparison of Techniques for Sampling Web Pages
As the World Wide Web is growing rapidly, it is getting increasingly
challenging to gather representative information about it. Instead of crawling
the web exhaustively one has to resort to other techniques like sampling to
determine the properties of the web. A uniform random sample of the web would
be useful to determine the percentage of web pages in a specific language, on a
topic or in a top level domain. Unfortunately, no approach has been shown to
sample the web pages in an unbiased way. Three promising web sampling
algorithms are based on random walks. They each have been evaluated
individually, but making a comparison on different data sets is not possible.
We directly compare these algorithms in this paper. We performed three random
walks on the web under the same conditions and analyzed their outcomes in
detail. We discuss the strengths and the weaknesses of each algorithm and
propose improvements based on experimental results
Naive Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages
Web classification has been attempted through many different technologies. In this study we concentrate on the comparison of Neural Networks (NN), NaĂŻve Bayes (NB) and Decision Tree (DT) classifiers for the automatic analysis and classification of attribute data from training course web pages. We introduce an enhanced NB classifier and run the same data sample through the DT and NN classifiers to determine the success rate of our classifier in the training courses domain. This research shows that our enhanced NB classifier not only outperforms the traditional NB classifier, but also performs similarly as good, if not better, than some more popular, rival techniques. This paper also shows that, overall, our NB classifier is the best choice for the training courses domain, achieving an impressive F-Measure value of over 97%, despite it being trained with fewer samples than any of the classification systems we have encountered
Updating collection representations for federated search
To facilitate the search for relevant information across a set of online distributed collections, a federated information retrieval system typically represents each collection, centrally, by a set of vocabularies or sampled documents. Accurate retrieval is therefore related to how precise each representation reflects the underlying content stored in that collection. As collections evolve over time, collection representations should also be updated to reflect any change, however, a current solution has not yet been proposed. In this study we examine both the implications of out-of-date representation sets on retrieval accuracy, as well as proposing three different policies for managing necessary updates. Each policy is evaluated on a testbed of forty-four dynamic collections over an eight-week period. Our findings show that out-of-date representations significantly degrade performance overtime, however, adopting a suitable update policy can minimise this problem
LiveRank: How to Refresh Old Datasets
This paper considers the problem of refreshing a dataset. More precisely ,
given a collection of nodes gathered at some time (Web pages, users from an
online social network) along with some structure (hyperlinks, social
relationships), we want to identify a significant fraction of the nodes that
still exist at present time. The liveness of an old node can be tested through
an online query at present time. We call LiveRank a ranking of the old pages so
that active nodes are more likely to appear first. The quality of a LiveRank is
measured by the number of queries necessary to identify a given fraction of the
active nodes when using the LiveRank order. We study different scenarios from a
static setting where the Liv-eRank is computed before any query is made, to
dynamic settings where the LiveRank can be updated as queries are processed.
Our results show that building on the PageRank can lead to efficient LiveRanks,
for Web graphs as well as for online social networks
Crawling Facebook for Social Network Analysis Purposes
We describe our work in the collection and analysis of massive data describing the connections between participants to online social networks. Alternative approaches to social network data collection are defined and evaluated in practice, against the popular Facebook Web site. Thanks to our ad-hoc, privacy-compliant crawlers, two large samples, comprising millions of connections, have been collected; the data is anonymous and organized as an undirected graph. We describe a set of tools that we developed to analyze specific properties of such social-network graphs, i.e., among others, degree distribution, centrality measures, scaling laws and distribution of friendship.\u
Universal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the
emergence of large repositories of versioned documents, among other
applications. These collections may reach huge sizes, but are formed mostly of
documents that are near-copies of others. Traditional techniques for indexing
these collections fail to properly exploit their regularities in order to
reduce space.
We introduce new techniques for compressing inverted indexes that exploit
this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar
compression of the differential inverted lists, instead of the usual practice
of gap-encoding them. We show that, in this highly repetitive setting, our
compression methods significantly reduce the space obtained with classical
techniques, at the price of moderate slowdowns. Moreover, our best methods are
universal, that is, they do not need to know the versioning structure of the
collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are
designed for general strings (not only natural language texts) and represent
the text collection plus the index structure (not an inverted index) in
integrated form. We show that these techniques can compress much further, using
a small fraction of the space required by our new inverted indexes. Yet, they
are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Invisible Pixels Are Dead, Long Live Invisible Pixels!
Privacy has deteriorated in the world wide web ever since the 1990s. The
tracking of browsing habits by different third-parties has been at the center
of this deterioration. Web cookies and so-called web beacons have been the
classical ways to implement third-party tracking. Due to the introduction of
more sophisticated technical tracking solutions and other fundamental
transformations, the use of classical image-based web beacons might be expected
to have lost their appeal. According to a sample of over thirty thousand images
collected from popular websites, this paper shows that such an assumption is a
fallacy: classical 1 x 1 images are still commonly used for third-party
tracking in the contemporary world wide web. While it seems that ad-blockers
are unable to fully block these classical image-based tracking beacons, the
paper further demonstrates that even limited information can be used to
accurately classify the third-party 1 x 1 images from other images. An average
classification accuracy of 0.956 is reached in the empirical experiment. With
these results the paper contributes to the ongoing attempts to better
understand the lack of privacy in the world wide web, and the means by which
the situation might be eventually improved.Comment: Forthcoming in the 17th Workshop on Privacy in the Electronic Society
(WPES 2018), Toronto, AC
- âŠ