11,123 research outputs found
CS Circles: An In-Browser Python Course for Beginners
Computer Science Circles is a free programming website for beginners that is
designed to be fun, easy to use, and accessible to the broadest possible
audience. We teach Python since it is simple yet powerful, and the course
content is well-structured but written in plain language. The website has over
one hundred exercises in thirty lesson pages, plus special features to help
teachers support their students. It is available in both English and French. We
discuss the philosophy behind the course and its design, we describe how it was
implemented, and we give statistics on its use.Comment: To appear in SIGCSE 201
PrivacyScore: Improving Privacy and Security via Crowd-Sourced Benchmarks of Websites
Website owners make conscious and unconscious decisions that affect their
users, potentially exposing them to privacy and security risks in the process.
In this paper we introduce PrivacyScore, an automated website scanning portal
that allows anyone to benchmark security and privacy features of multiple
websites. In contrast to existing projects, the checks implemented in
PrivacyScore cover a wider range of potential privacy and security issues.
Furthermore, users can control the ranking and analysis methodology. Therefore,
PrivacyScore can also be used by data protection authorities to perform
regularly scheduled compliance checks. In the long term we hope that the
transparency resulting from the published benchmarks creates an incentive for
website owners to improve their sites. The public availability of a first
version of PrivacyScore was announced at the ENISA Annual Privacy Forum in June
2017.Comment: 14 pages, 4 figures. A german version of this paper discussing the
legal aspects of this system is available at arXiv:1705.0888
Automating biomedical data science through tree-based pipeline optimization
Over the past decade, data science and machine learning has grown from a
mysterious art form to a staple tool across a variety of fields in academia,
business, and government. In this paper, we introduce the concept of tree-based
pipeline optimization for automating one of the most tedious parts of machine
learning---pipeline design. We implement a Tree-based Pipeline Optimization
Tool (TPOT) and demonstrate its effectiveness on a series of simulated and
real-world genetic data sets. In particular, we show that TPOT can build
machine learning pipelines that achieve competitive classification accuracy and
discover novel pipeline operators---such as synthetic feature
constructors---that significantly improve classification accuracy on these data
sets. We also highlight the current challenges to pipeline optimization, such
as the tendency to produce pipelines that overfit the data, and suggest future
research paths to overcome these challenges. As such, this work represents an
early step toward fully automating machine learning pipeline design.Comment: 16 pages, 5 figures, to appear in EvoBIO 2016 proceeding
Shingle 2.0: generalising self-consistent and automated domain discretisation for multi-scale geophysical models
The approaches taken to describe and develop spatial discretisations of the
domains required for geophysical simulation models are commonly ad hoc, model
or application specific and under-documented. This is particularly acute for
simulation models that are flexible in their use of multi-scale, anisotropic,
fully unstructured meshes where a relatively large number of heterogeneous
parameters are required to constrain their full description. As a consequence,
it can be difficult to reproduce simulations, ensure a provenance in model data
handling and initialisation, and a challenge to conduct model intercomparisons
rigorously. This paper takes a novel approach to spatial discretisation,
considering it much like a numerical simulation model problem of its own. It
introduces a generalised, extensible, self-documenting approach to carefully
describe, and necessarily fully, the constraints over the heterogeneous
parameter space that determine how a domain is spatially discretised. This
additionally provides a method to accurately record these constraints, using
high-level natural language based abstractions, that enables full accounts of
provenance, sharing and distribution. Together with this description, a
generalised consistent approach to unstructured mesh generation for geophysical
models is developed, that is automated, robust and repeatable, quick-to-draft,
rigorously verified and consistent to the source data throughout. This
interprets the description above to execute a self-consistent spatial
discretisation process, which is automatically validated to expected discrete
characteristics and metrics.Comment: 18 pages, 10 figures, 1 table. Submitted for publication and under
revie
Recommended from our members
Building thermal load prediction through shallow machine learning and deep learning
Building thermal load prediction informs the optimization of cooling plant and thermal energy storage. Physics-based prediction models of building thermal load are constrained by the model and input complexity. In this study, we developed 12 data-driven models (7 shallow learning, 2 deep learning, and 3 heuristic methods) to predict building thermal load and compared shallow machine learning and deep learning. The 12 prediction models were compared with the measured cooling demand. It was found XGBoost (Extreme Gradient Boost) and LSTM (Long Short Term Memory) provided the most accurate load prediction in the shallow and deep learning category, and both outperformed the best baseline model, which uses the previous day's data for prediction. Then, we discussed how the prediction horizon and input uncertainty would influence the load prediction accuracy. Major conclusions are twofold: first, LSTM performs well in short-term prediction (1 h ahead) but not in long term prediction (24 h ahead), because the sequential information becomes less relevant and accordingly not so useful when the prediction horizon is long. Second, the presence of weather forecast uncertainty deteriorates XGBoost's accuracy and favors LSTM, because the sequential information makes the model more robust to input uncertainty. Training the model with the uncertain rather than accurate weather data could enhance the model's robustness. Our findings have two implications for practice. First, LSTM is recommended for short-term load prediction given that weather forecast uncertainty is unavoidable. Second, XGBoost is recommended for long term prediction, and the model should be trained with the presence of input uncertainty
BEAT: An Open-Source Web-Based Open-Science Platform
With the increased interest in computational sciences, machine learning (ML),
pattern recognition (PR) and big data, governmental agencies, academia and
manufacturers are overwhelmed by the constant influx of new algorithms and
techniques promising improved performance, generalization and robustness.
Sadly, result reproducibility is often an overlooked feature accompanying
original research publications, competitions and benchmark evaluations. The
main reasons behind such a gap arise from natural complications in research and
development in this area: the distribution of data may be a sensitive issue;
software frameworks are difficult to install and maintain; Test protocols may
involve a potentially large set of intricate steps which are difficult to
handle. Given the raising complexity of research challenges and the constant
increase in data volume, the conditions for achieving reproducible research in
the domain are also increasingly difficult to meet.
To bridge this gap, we built an open platform for research in computational
sciences related to pattern recognition and machine learning, to help on the
development, reproducibility and certification of results obtained in the
field. By making use of such a system, academic, governmental or industrial
organizations enable users to easily and socially develop processing
toolchains, re-use data, algorithms, workflows and compare results from
distinct algorithms and/or parameterizations with minimal effort. This article
presents such a platform and discusses some of its key features, uses and
limitations. We overview a currently operational prototype and provide design
insights.Comment: References to papers published on the platform incorporate
- …