60 research outputs found
Data Cleaning Methods for Client and Proxy Logs
In this paper we present our experiences with the cleaning of Web client and proxy usage logs, based on a long-term browsing study with 25 participants. A detailed clickstream log, recorded using a Web intermediary, was combined with a second log of user interface actions, which was captured by a modified Firefox browser for a subset of the participants. The consolidated data from both records revealed many page requests that were not directly related to user actions. For participants who had no ad-filtering system installed, these artifacts made up one third of all transferred Web pages. Three major reasons could be identified: HTML Frames and iFrames, advertisements, and automatic page reloads. The experiences made during the data cleaning process might help other researchers to choose adequate filtering methods for their data
Web users' information retrieval methods and skills
When trying to locate information on the Web people are faced with a variety of options. This research reviewed how a group of health related professionals approached the task of finding a named document. Most were eventually successful, but the majority encountered problems in their search techniques. Even experienced Web users had problems when working with a different interface to normal, and without access to their favourites. No relationship was found between the number of years' experience Web users had and the efficiency of their searching strategy. The research concludes that if people are to be able to use the Web quickly and efficiently as an effective information retrieval tool, as opposed to a recreational tool to surf the Internet, they need to have both an understanding of the medium and the tools, and the skills to use them effectively, both of which were lacking in the majority of participants in this study
Characterizations of User Web Revisit Behavior
In this article we update and extend on earlier long-term studies on user's page revisit behavior. Revisits ar
VEWS: A Wikipedia Vandal Early Warning System
We study the problem of detecting vandals on Wikipedia before any human or
known vandalism detection system reports flagging potential vandals so that
such users can be presented early to Wikipedia administrators. We leverage
multiple classical ML approaches, but develop 3 novel sets of features. Our
Wikipedia Vandal Behavior (WVB) approach uses a novel set of user editing
patterns as features to classify some users as vandals. Our Wikipedia
Transition Probability Matrix (WTPM) approach uses a set of features derived
from a transition probability matrix and then reduces it via a neural net
auto-encoder to classify some users as vandals. The VEWS approach merges the
previous two approaches. Without using any information (e.g. reverts) provided
by other users, these algorithms each have over 85% classification accuracy.
Moreover, when temporal recency is considered, accuracy goes to almost 90%. We
carry out detailed experiments on a new data set we have created consisting of
about 33K Wikipedia users (including both a black list and a white list of
editors) and containing 770K edits. We describe specific behaviors that
distinguish between vandals and non-vandals. We show that VEWS beats ClueBot NG
and STiki, the best known algorithms today for vandalism detection. Moreover,
VEWS detects far more vandals than ClueBot NG and on average, detects them 2.39
edits before ClueBot NG when both detect the vandal. However, we show that the
combination of VEWS and ClueBot NG can give a fully automated vandal early
warning system with even higher accuracy.Comment: To appear in Proceedings of the 21st ACM SIGKDD Conference of
Knowledge Discovery and Data Mining (KDD 2015
Agents, Bookmarks and Clicks: A topical model of Web traffic
Analysis of aggregate and individual Web traffic has shown that PageRank is a
poor model of how people navigate the Web. Using the empirical traffic patterns
generated by a thousand users, we characterize several properties of Web
traffic that cannot be reproduced by Markovian models. We examine both
aggregate statistics capturing collective behavior, such as page and link
traffic, and individual statistics, such as entropy and session size. No model
currently explains all of these empirical observations simultaneously. We show
that all of these traffic patterns can be explained by an agent-based model
that takes into account several realistic browsing behaviors. First, agents
maintain individual lists of bookmarks (a non-Markovian memory mechanism) that
are used as teleportation targets. Second, agents can retreat along visited
links, a branching mechanism that also allows us to reproduce behaviors such as
the use of a back button and tabbed browsing. Finally, agents are sustained by
visiting novel pages of topical interest, with adjacent pages being more
topically related to each other than distant ones. This modulates the
probability that an agent continues to browse or starts a new session, allowing
us to recreate heterogeneous session lengths. The resulting model is capable of
reproducing the collective and individual behaviors we observe in the empirical
data, reconciling the narrowly focused browsing patterns of individual users
with the extreme heterogeneity of aggregate traffic measurements. This result
allows us to identify a few salient features that are necessary and sufficient
to interpret the browsing patterns observed in our data. In addition to the
descriptive and explanatory power of such a model, our results may lead the way
to more sophisticated, realistic, and effective ranking and crawling
algorithms.Comment: 10 pages, 16 figures, 1 table - Long version of paper to appear in
Proceedings of the 21th ACM conference on Hypertext and Hypermedi
I-pot: a new approach utilising visual and contextual cues to support users in graphical web browser revisitation.
With a quarter of the world’s population now having access to the internet, the area of web efficiency and optimal use is of growing importance to all users. The function of revisitation, where a user wants to return to a website that they have visited in the recent past becomes more important. Current static and textual approaches developed within the latest versions of mainstream web browsers leave much to be desired. This paper suggests a new approach via the use of organic visual and contextual cues to support users in this task area
Automatic classification of web pages into bookmark categories
We describe a technique to automatically classify a web page into an existing bookmark category whenever a user decides to bookmark a page. HyperBK compares a bag-of-words representation of the page to descriptions of categories in the user’s bookmark file. Unlike default web browser dialogs in which the user may be presented with the category into which he or she saved the last bookmarked file, HyperBK also offers the category most similar to the page being bookmarked. The user can opt to save the page to the last category used; create a new category; or save the page elsewhere. In an evaluation, the user’s preferred category was offered on average 67% of the time.peer-reviewe
Les pratiques informationnelles individuelles et collectives
Communication faite lors de la journée d\u27étude Thémat\u27IC 2007 "La maîtrise de l\u27information par les adultes : enjeux et méthodes", Strasbourg, mars 2007
How people recognize previously seen Web pages from titles, URLs and thumbnails
The selectable lists of pages offered by web browsers ’ history and bookmark facilities ostensibly make it easier for people to return to previously visited pages. These lists show the pages as abstractions, typically as truncated titles and URLs, and more rarely as small thumbnail images. Yet we have little knowledge of how recognizable these representations really are. Consequently, we carried out a study that compared the recognizability of thumbnails between various image sizes, and of titles and URLs between various string sizes. Our results quantify the tradeoff between the size of these representations and their recognizability. These findings directly contribute to how history and bookmark lists should be designed
- …