174,046 research outputs found
What Users Ask a Search Engine: Analyzing One Billion Russian Question Queries
We analyze the question queries submitted to a large commercial web search engine to get insights about what people ask, and to better tailor the search results to the users’ needs. Based on a dataset of about one billion question queries submitted during the year 2012, we investigate askers’ querying behavior with the support of automatic query categorization. While the importance of question queries is likely to increase, at present they only make up 3–4% of the total search traffic. Since questions are such a small part of the query stream and are more likely to be unique than shorter queries, clickthrough information is typically rather sparse. Thus, query categorization methods based on the categories of clicked web documents do not work well for questions. As an alternative, we propose a robust question query classification method that uses the labeled questions from a large community question answering platform (CQA) as a training set. The resulting classifier is then transferred to the web search questions. Even though questions on CQA platforms tend to be different to web search questions, our categorization method proves competitive with strong baselines with respect to classification accuracy. To show the scalability of our proposed method we apply the classifiers to about one billion question queries and discuss the trade-offs between performance and accuracy that different classification models offer. Our findings reveal what people ask a search engine and also how this contrasts behavior on a CQA platform
Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization
Protecting vast quantities of data poses a daunting challenge for the growing
number of organizations that collect, stockpile, and monetize it. The ability
to distinguish data that is actually needed from data collected "just in case"
would help these organizations to limit the latter's exposure to attack. A
natural approach might be to monitor data use and retain only the working-set
of in-use data in accessible storage; unused data can be evicted to a highly
protected store. However, many of today's big data applications rely on machine
learning (ML) workloads that are periodically retrained by accessing, and thus
exposing to attack, the entire data store. Training set minimization methods,
such as count featurization, are often used to limit the data needed to train
ML workloads to improve performance or scalability. We present Pyramid, a
limited-exposure data management system that builds upon count featurization to
enhance data protection. As such, Pyramid uniquely introduces both the idea and
proof-of-concept for leveraging training set minimization methods to instill
rigor and selectivity into big data management. We integrated Pyramid into
Spark Velox, a framework for ML-based targeting and personalization. We
evaluate it on three applications and show that Pyramid approaches
state-of-the-art models while training on less than 1% of the raw data
Online Learning Models for Content Popularity Prediction In Wireless Edge Caching
Caching popular contents in advance is an important technique to achieve the
low latency requirement and to reduce the backhaul costs in future wireless
communications. Considering a network with base stations distributed as a
Poisson point process (PPP), optimal content placement caching probabilities
are derived for known popularity profile, which is unknown in practice. In this
paper, online prediction (OP) and online learning (OL) methods are presented
based on popularity prediction model (PPM) and Grassmannian prediction model
(GPM), to predict the content profile for future time slots for time-varying
popularities. In OP, the problem of finding the coefficients is modeled as a
constrained non-negative least squares (NNLS) problem which is solved with a
modified NNLS algorithm. In addition, these two models are compared with
log-request prediction model (RPM), information prediction model (IPM) and
average success probability (ASP) based model. Next, in OL methods for the
time-varying case, the cumulative mean squared error (MSE) is minimized and the
MSE regret is analyzed for each of the models. Moreover, for quasi-time varying
case where the popularity changes block-wise, KWIK (know what it knows)
learning method is modified for these models to improve the prediction MSE and
ASP performance. Simulation results show that for OP, PPM and GPM provides the
best ASP among these models, concluding that minimum mean squared error based
models do not necessarily result in optimal ASP. OL based models yield
approximately similar ASP and MSE, while for quasi-time varying case, KWIK
methods provide better performance, which has been verified with MovieLens
dataset.Comment: 9 figure, 29 page
Emerging Phishing Trends and Effectiveness of the Anti-Phishing Landing Page
Each month, more attacks are launched with the aim of making web users
believe that they are communicating with a trusted entity which compels them to
share their personal, financial information. Phishing costs Internet users
billions of dollars every year. Researchers at Carnegie Mellon University (CMU)
created an anti-phishing landing page supported by Anti-Phishing Working Group
(APWG) with the aim to train users on how to prevent themselves from phishing
attacks. It is used by financial institutions, phish site take down vendors,
government organizations, and online merchants. When a potential victim clicks
on a phishing link that has been taken down, he / she is redirected to the
landing page. In this paper, we present the comparative analysis on two
datasets that we obtained from APWG's landing page log files; one, from
September 7, 2008 - November 11, 2009, and other from January 1, 2014 - April
30, 2014. We found that the landing page has been successful in training users
against phishing. Forty six percent users clicked lesser number of phishing
URLs from January 2014 to April 2014 which shows that training from the landing
page helped users not to fall for phishing attacks. Our analysis shows that
phishers have started to modify their techniques by creating more legitimate
looking URLs and buying large number of domains to increase their activity. We
observed that phishers are exploiting ICANN accredited registrars to launch
their attacks even after strict surveillance. We saw that phishers are trying
to exploit free subdomain registration services to carry out attacks. In this
paper, we also compared the phishing e-mails used by phishers to lure victims
in 2008 and 2014. We found that the phishing e-mails have changed considerably
over time. Phishers have adopted new techniques like sending promotional
e-mails and emotionally targeting users in clicking phishing URLs
Sequence Modelling For Analysing Student Interaction with Educational Systems
The analysis of log data generated by online educational systems is an
important task for improving the systems, and furthering our knowledge of how
students learn. This paper uses previously unseen log data from Edulab, the
largest provider of digital learning for mathematics in Denmark, to analyse the
sessions of its users, where 1.08 million student sessions are extracted from a
subset of their data. We propose to model students as a distribution of
different underlying student behaviours, where the sequence of actions from
each session belongs to an underlying student behaviour. We model student
behaviour as Markov chains, such that a student is modelled as a distribution
of Markov chains, which are estimated using a modified k-means clustering
algorithm. The resulting Markov chains are readily interpretable, and in a
qualitative analysis around 125,000 student sessions are identified as
exhibiting unproductive student behaviour. Based on our results this student
representation is promising, especially for educational systems offering many
different learning usages, and offers an alternative to common approaches like
modelling student behaviour as a single Markov chain often done in the
literature.Comment: The 10th International Conference on Educational Data Mining 201
Why It Takes So Long to Connect to a WiFi Access Point
Today's WiFi networks deliver a large fraction of traffic. However, the
performance and quality of WiFi networks are still far from satisfactory. Among
many popular quality metrics (throughput, latency), the probability of
successfully connecting to WiFi APs and the time cost of the WiFi connection
set-up process are the two of the most critical metrics that affect WiFi users'
experience. To understand the WiFi connection set-up process in real-world
settings, we carry out measurement studies on million mobile users from
representative cities associating with million APs in billion WiFi
sessions, collected from a mobile "WiFi Manager" App that tops the Android/iOS
App market. To the best of our knowledge, we are the first to do such large
scale study on: how large the WiFi connection set-up time cost is, what factors
affect the WiFi connection set-up process, and what can be done to reduce the
WiFi connection set-up time cost. Based on the measurement analysis, we develop
a machine learning based AP selection strategy that can significantly improve
WiFi connection set-up performance, against the conventional strategy purely
based on signal strength, by reducing the connection set-up failures from
to and reducing time costs of the connection set-up
processes by more than times.Comment: 11pages, conferenc
- …