42,091 research outputs found
Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science
As the field of data science continues to grow, there will be an
ever-increasing demand for tools that make machine learning accessible to
non-experts. In this paper, we introduce the concept of tree-based pipeline
optimization for automating one of the most tedious parts of machine
learning---pipeline design. We implement an open source Tree-based Pipeline
Optimization Tool (TPOT) in Python and demonstrate its effectiveness on a
series of simulated and real-world benchmark data sets. In particular, we show
that TPOT can design machine learning pipelines that provide a significant
improvement over a basic machine learning analysis while requiring little to no
input nor prior knowledge from the user. We also address the tendency for TPOT
to design overly complex pipelines by integrating Pareto optimization, which
produces compact pipelines without sacrificing classification accuracy. As
such, this work represents an important step toward fully automating machine
learning pipeline design.Comment: 8 pages, 5 figures, preprint to appear in GECCO 2016, edits not yet
made from reviewer comment
Engineering Crowdsourced Stream Processing Systems
A crowdsourced stream processing system (CSP) is a system that incorporates
crowdsourced tasks in the processing of a data stream. This can be seen as
enabling crowdsourcing work to be applied on a sample of large-scale data at
high speed, or equivalently, enabling stream processing to employ human
intelligence. It also leads to a substantial expansion of the capabilities of
data processing systems. Engineering a CSP system requires the combination of
human and machine computation elements. From a general systems theory
perspective, this means taking into account inherited as well as emerging
properties from both these elements. In this paper, we position CSP systems
within a broader taxonomy, outline a series of design principles and evaluation
metrics, present an extensible framework for their design, and describe several
design patterns. We showcase the capabilities of CSP systems by performing a
case study that applies our proposed framework to the design and analysis of a
real system (AIDR) that classifies social media messages during time-critical
crisis events. Results show that compared to a pure stream processing system,
AIDR can achieve a higher data classification accuracy, while compared to a
pure crowdsourcing solution, the system makes better use of human workers by
requiring much less manual work effort
Dropout Model Evaluation in MOOCs
The field of learning analytics needs to adopt a more rigorous approach for
predictive model evaluation that matches the complex practice of
model-building. In this work, we present a procedure to statistically test
hypotheses about model performance which goes beyond the state-of-the-practice
in the community to analyze both algorithms and feature extraction methods from
raw data. We apply this method to a series of algorithms and feature sets
derived from a large sample of Massive Open Online Courses (MOOCs). While a
complete comparison of all potential modeling approaches is beyond the scope of
this paper, we show that this approach reveals a large gap in dropout
prediction performance between forum-, assignment-, and clickstream-based
feature extraction methods, where the latter is significantly better than the
former two, which are in turn indistinguishable from one another. This work has
methodological implications for evaluating predictive or AI-based models of
student success, and practical implications for the design and targeting of
at-risk student models and interventions
False News On Social Media: A Data-Driven Survey
In the past few years, the research community has dedicated growing interest
to the issue of false news circulating on social networks. The widespread
attention on detecting and characterizing false news has been motivated by
considerable backlashes of this threat against the real world. As a matter of
fact, social media platforms exhibit peculiar characteristics, with respect to
traditional news outlets, which have been particularly favorable to the
proliferation of deceptive information. They also present unique challenges for
all kind of potential interventions on the subject. As this issue becomes of
global concern, it is also gaining more attention in academia. The aim of this
survey is to offer a comprehensive study on the recent advances in terms of
detection, characterization and mitigation of false news that propagate on
social media, as well as the challenges and the open questions that await
future research on the field. We use a data-driven approach, focusing on a
classification of the features that are used in each study to characterize
false information and on the datasets used for instructing classification
methods. At the end of the survey, we highlight emerging approaches that look
most promising for addressing false news
- …