65,072 research outputs found
A Random Forest Guided Tour
The random forest algorithm, proposed by L. Breiman in 2001, has been
extremely successful as a general-purpose classification and regression method.
The approach, which combines several randomized decision trees and aggregates
their predictions by averaging, has shown excellent performance in settings
where the number of variables is much larger than the number of observations.
Moreover, it is versatile enough to be applied to large-scale problems, is
easily adapted to various ad-hoc learning tasks, and returns measures of
variable importance. The present article reviews the most recent theoretical
and methodological developments for random forests. Emphasis is placed on the
mathematical forces driving the algorithm, with special attention given to the
selection of parameters, the resampling mechanism, and variable importance
measures. This review is intended to provide non-experts easy access to the
main ideas
Can Deep Learning Predict Risky Retail Investors? A Case Study in Financial Risk Behavior Forecasting
The paper examines the potential of deep learning to support decisions in
financial risk management. We develop a deep learning model for predicting
whether individual spread traders secure profits from future trades. This task
embodies typical modeling challenges faced in risk and behavior forecasting.
Conventional machine learning requires data that is representative of the
feature-target relationship and relies on the often costly development,
maintenance, and revision of handcrafted features. Consequently, modeling
highly variable, heterogeneous patterns such as trader behavior is challenging.
Deep learning promises a remedy. Learning hierarchical distributed
representations of the data in an automatic manner (e.g. risk taking behavior),
it uncovers generative features that determine the target (e.g., trader's
profitability), avoids manual feature engineering, and is more robust toward
change (e.g. dynamic market conditions). The results of employing a deep
network for operational risk forecasting confirm the feature learning
capability of deep learning, provide guidance on designing a suitable network
architecture and demonstrate the superiority of deep learning over machine
learning and rule-based benchmarks.Comment: Within the "equal" contribution, Yaodong Yang contributed the core
deep learning algorithm along with its experimental results, and the first
draft of the manuscript (including Figure 1,2,3,4,7,8,9,11, and Table 3
On-the-Job Learning with Bayesian Decision Theory
Our goal is to deploy a high-accuracy system starting with zero training
examples. We consider an "on-the-job" setting, where as inputs arrive, we use
real-time crowdsourcing to resolve uncertainty where needed and output our
prediction when confident. As the model improves over time, the reliance on
crowdsourcing queries decreases. We cast our setting as a stochastic game based
on Bayesian decision theory, which allows us to balance latency, cost, and
accuracy objectives in a principled way. Computing the optimal policy is
intractable, so we develop an approximation based on Monte Carlo Tree Search.
We tested our approach on three datasets---named-entity recognition, sentiment
classification, and image classification. On the NER task we obtained more than
an order of magnitude reduction in cost compared to full human annotation,
while boosting performance relative to the expert provided labels. We also
achieve a 8% F1 improvement over having a single human label the whole set, and
a 28% F1 improvement over online learning.Comment: As appearing in NIPS 201
FECBench: A Holistic Interference-aware Approach for Application Performance Modeling
Services hosted in multi-tenant cloud platforms often encounter performance
interference due to contention for non-partitionable resources, which in turn
causes unpredictable behavior and degradation in application performance. To
grapple with these problems and to define effective resource management
solutions for their services, providers often must expend significant efforts
and incur prohibitive costs in developing performance models of their services
under a variety of interference scenarios on different hardware. This is a hard
problem due to the wide range of possible co-located services and their
workloads, and the growing heterogeneity in the runtime platforms including the
use of fog and edge-based resources, not to mention the accidental complexity
in performing application profiling under a variety of scenarios. To address
these challenges, we present FECBench, a framework to guide providers in
building performance interference prediction models for their services without
incurring undue costs and efforts. The contributions of the paper are as
follows. First, we developed a technique to build resource stressors that can
stress multiple system resources all at once in a controlled manner to gain
insights about the interference on an application's performance. Second, to
overcome the need for exhaustive application profiling, FECBench intelligently
uses the design of experiments (DoE) approach to enable users to build
surrogate performance models of their services. Third, FECBench maintains an
extensible knowledge base of application combinations that create resource
stresses across the multi-dimensional resource design space. Empirical results
using real-world scenarios to validate the efficacy of FECBench show that the
predicted application performance has a median error of only 7.6% across all
test cases, with 5.4% in the best case and 13.5% in the worst case
A Survey of Prediction Using Social Media
Social media comprises interactive applications and platforms for creating,
sharing and exchange of user-generated contents. The past ten years have
brought huge growth in social media, especially online social networking
services, and it is changing our ways to organize and communicate. It
aggregates opinions and feelings of diverse groups of people at low cost.
Mining the attributes and contents of social media gives us an opportunity to
discover social structure characteristics, analyze action patterns
qualitatively and quantitatively, and sometimes the ability to predict future
human related events. In this paper, we firstly discuss the realms which can be
predicted with current social media, then overview available predictors and
techniques of prediction, and finally discuss challenges and possible future
directions.Comment: 20 page
Attacking Machine Learning models as part of a cyber kill chain
Machine learning is gaining popularity in the network security domain as many
more network-enabled devices get connected, as malicious activities become
stealthier, and as new technologies like Software Defined Networking emerge.
Compromising machine learning model is a desirable goal. In fact, spammers have
been quite successful getting through machine learning enabled spam filters for
years. While previous works have been done on adversarial machine learning,
none has been considered within a defense-in-depth environment, in which
correct classification alone may not be good enough. For the first time, this
paper proposes a cyber kill-chain for attacking machine learning models
together with a proof of concept. The intention is to provide a high level
attack model that inspire more secure processes in
research/design/implementation of machine learning based security solutions.Comment: 8 page
A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit
Adaptive and sequential experiment design is a well-studied area in numerous
domains. We survey and synthesize the work of the online statistical learning
paradigm referred to as multi-armed bandits integrating the existing research
as a resource for a certain class of online experiments. We first explore the
traditional stochastic model of a multi-armed bandit, then explore a taxonomic
scheme of complications to that model, for each complication relating it to a
specific requirement or consideration of the experiment design context.
Finally, at the end of the paper, we present a table of known upper-bounds of
regret for all studied algorithms providing both perspectives for future
theoretical work and a decision-making tool for practitioners looking for
theoretical guarantees.Comment: 49 pages, 1 figur
A Bayesian Perspective of Statistical Machine Learning for Big Data
Statistical Machine Learning (SML) refers to a body of algorithms and methods
by which computers are allowed to discover important features of input data
sets which are often very large in size. The very task of feature discovery
from data is essentially the meaning of the keyword `learning' in SML.
Theoretical justifications for the effectiveness of the SML algorithms are
underpinned by sound principles from different disciplines, such as Computer
Science and Statistics. The theoretical underpinnings particularly justified by
statistical inference methods are together termed as statistical learning
theory.
This paper provides a review of SML from a Bayesian decision theoretic point
of view -- where we argue that many SML techniques are closely connected to
making inference by using the so called Bayesian paradigm. We discuss many
important SML techniques such as supervised and unsupervised learning, deep
learning, online learning and Gaussian processes especially in the context of
very large data sets where these are often employed. We present a dictionary
which maps the key concepts of SML from Computer Science and Statistics. We
illustrate the SML techniques with three moderately large data sets where we
also discuss many practical implementation issues. Thus the review is
especially targeted at statisticians and computer scientists who are aspiring
to understand and apply SML for moderately large to big data sets.Comment: 26 pages, 3 figures, Review pape
Student Success Prediction in MOOCs
Predictive models of student success in Massive Open Online Courses (MOOCs)
are a critical component of effective content personalization and adaptive
interventions. In this article we review the state of the art in predictive
models of student success in MOOCs and present a categorization of MOOC
research according to the predictors (features), prediction (outcomes), and
underlying theoretical model. We critically survey work across each category,
providing data on the raw data source, feature engineering, statistical model,
evaluation method, prediction architecture, and other aspects of these
experiments. Such a review is particularly useful given the rapid expansion of
predictive modeling research in MOOCs since the emergence of major MOOC
platforms in 2012. This survey reveals several key methodological gaps, which
include extensive filtering of experimental subpopulations, ineffective student
model evaluation, and the use of experimental data which would be unavailable
for real-world student success prediction and intervention, which is the
ultimate goal of such models. Finally, we highlight opportunities for future
research, which include temporal modeling, research bridging predictive and
explanatory student models, work which contributes to learning theory, and
evaluating long-term learner success in MOOCs
A Study of WhatsApp Usage Patterns and Prediction Models without Message Content
Internet social networks have become a ubiquitous application allowing people
to easily share text, pictures, and audio and video files. Popular networks
include WhatsApp, Facebook, Reddit and LinkedIn. We present an extensive study
of the usage of the WhatsApp social network, an Internet messaging application
that is quickly replacing SMS messaging. In order to better understand people's
use of the network, we provide an analysis of over 6 million messages from over
100 users, with the objective of building demographic prediction models using
activity data. We performed extensive statistical and numerical analysis of the
data and found significant differences in WhatsApp usage across people of
different genders and ages. We also inputted the data into the Weka data mining
package and studied models created from decision tree and Bayesian network
algorithms. We found that different genders and age demographics had
significantly different usage habits in almost all message and group
attributes. We also noted differences in users' group behavior and created
prediction models, including the likelihood a given group would have relatively
more file attachments, if a group would contain a larger number of
participants, a higher frequency of activity, quicker response times and
shorter messages. We were successful in quantifying and predicting a user's
gender and age demographic. Similarly, we were able to predict different types
of group usage. All models were built without analyzing message content. We
present a detailed discussion about the specific attributes that were contained
in all predictive models and suggest possible applications based on these
results.Comment: 24 page
- …