7,528 research outputs found
Of course we share! Testing Assumptions about Social Tagging Systems
Social tagging systems have established themselves as an important part in
today's web and have attracted the interest from our research community in a
variety of investigations. The overall vision of our community is that simply
through interactions with the system, i.e., through tagging and sharing of
resources, users would contribute to building useful semantic structures as
well as resource indexes using uncontrolled vocabulary not only due to the
easy-to-use mechanics. Henceforth, a variety of assumptions about social
tagging systems have emerged, yet testing them has been difficult due to the
absence of suitable data. In this work we thoroughly investigate three
available assumptions - e.g., is a tagging system really social? - by examining
live log data gathered from the real-world public social tagging system
BibSonomy. Our empirical results indicate that while some of these assumptions
hold to a certain extent, other assumptions need to be reflected and viewed in
a very critical light. Our observations have implications for the design of
future search and other algorithms to better reflect the actual user behavior
Recommended from our members
Analysing web search logs to determine session boundaries for user-oriented learning
Incremental learning approaches based on user search activities provide a means of building adaptive information retrieval systems. To develop more effective user-oriented learning techniques for the Web, we need to be able to identify a meaningful session unit from which we can learn. Without this, we run a high risk of grouping together activities that are unrelated or perhaps not from the same user. We are interested in detecting boundaries of sequences between related activities (sessions) that would group the activities for a learning purpose. Session boundaries, in Reuters transaction logs, were detected automatically. The generated boundaries were compared with human judgements. The comparison confirmed that a meaningful session threshold for establishing these session boundaries was confined to a 11-15 minute range
Improving the efficiency of spam filtering through cache architecture
Blacklists (BLs), also called Domain Name Systembased Blackhole List (DNSBLs) are the databases of known internet addresses used by the spammers to send out the spam mails. Mail servers use these lists to filter out the e-mails coming from different spam sources. In contrary, Whitelists (WLs) are the explicit list of senders from whom e-mail can be accepted or delivered. Mail Transport Agent (MTA) is usually configured to reject, challenge or flag the messages which have been sent from the sources listed on one or more DNSBLs and to allow the messages from the sources listed on the WLs. In this paper, we are demonstrating how the bandwidth (the overall requests and responses that need to go over the network) performance is improved by using local caches for BLs and WLs. The actual sender\u27s IP addresses are extracted from the e-mail log. These are then compared with the list in the local caches to find out if they should be accepted or not, before they are checked against the global DNSBLs by running \u27DNSBL queries\u27 (if required). Around three quarters of the e-mail sources have been observed to be filtered locally through caches with this method. Provision of local control over the lists and lower search (filtering) time are the other related benefits. Š 2008 IEEE
A Comparative analysis: QA evaluation questions versus real-world queries
This paper presents a comparative analysis of user queries to a web search engine, questions to a Q&A service (answers.com), and questions employed in question answering (QA) evaluations at TREC and CLEF. The analysis shows that user queries to search engines contain mostly content words (i.e. keywords) but lack structure words (i.e. stopwords) and capitalization. Thus, they resemble natural language input after case folding and stopword removal. In contrast, topics for QA evaluation and questions to answers.com mainly
consist of fully capitalized and syntactically well-formed questions. Classification experiments using a na¨Ĺve Bayes classifier show that stopwords play an important role in determining the expected answer type. A classification based on stopwords is considerably more accurate (47.5% accuracy) than a classification based on all query words (40.1% accuracy) or on content words (33.9% accuracy). To
simulate user input, questions are preprocessed by case folding and stopword removal. Additional classification experiments aim at reconstructing the syntactic wh-word frame of a question, i.e. the embedding of the interrogative word. Results indicate that this part of
questions can be reconstructed with moderate accuracy (25.7%), but for a classification problem with a much larger number of classes compared to classifying queries by expected answer type (2096 classes vs. 130 classes). Furthermore, eliminating stopwords can lead to multiple reconstructed questions with a different or with the opposite meaning (e.g. if negations or temporal restrictions are included). In conclusion, question reconstruction from short user queries can be seen as a new realistic evaluation challenge for QA systems
Population Density-based Hospital Recommendation with Mobile LBS Big Data
The difficulty of getting medical treatment is one of major livelihood issues
in China. Since patients lack prior knowledge about the spatial distribution
and the capacity of hospitals, some hospitals have abnormally high or sporadic
population densities. This paper presents a new model for estimating the
spatiotemporal population density in each hospital based on location-based
service (LBS) big data, which would be beneficial to guiding and dispersing
outpatients. To improve the estimation accuracy, several approaches are
proposed to denoise the LBS data and classify people by detecting their various
behaviors. In addition, a long short-term memory (LSTM) based deep learning is
presented to predict the trend of population density. By using Baidu
large-scale LBS logs database, we apply the proposed model to 113 hospitals in
Beijing, P. R. China, and constructed an online hospital recommendation system
which can provide users with a hospital rank list basing the real-time
population density information and the hospitals' basic information such as
hospitals' levels and their distances. We also mine several interesting
patterns from these LBS logs by using our proposed system
PlayeRank: data-driven performance evaluation and player ranking in soccer via a machine learning approach
The problem of evaluating the performance of soccer players is attracting the
interest of many companies and the scientific community, thanks to the
availability of massive data capturing all the events generated during a match
(e.g., tackles, passes, shots, etc.). Unfortunately, there is no consolidated
and widely accepted metric for measuring performance quality in all of its
facets. In this paper, we design and implement PlayeRank, a data-driven
framework that offers a principled multi-dimensional and role-aware evaluation
of the performance of soccer players. We build our framework by deploying a
massive dataset of soccer-logs and consisting of millions of match events
pertaining to four seasons of 18 prominent soccer competitions. By comparing
PlayeRank to known algorithms for performance evaluation in soccer, and by
exploiting a dataset of players' evaluations made by professional soccer
scouts, we show that PlayeRank significantly outperforms the competitors. We
also explore the ratings produced by {\sf PlayeRank} and discover interesting
patterns about the nature of excellent performances and what distinguishes the
top players from the others. At the end, we explore some applications of
PlayeRank -- i.e. searching players and player versatility --- showing its
flexibility and efficiency, which makes it worth to be used in the design of a
scalable platform for soccer analytics
- âŚ