142 research outputs found
Leakage and the Reproducibility Crisis in ML-based Science
The use of machine learning (ML) methods for prediction and forecasting has
become widespread across the quantitative sciences. However, there are many
known methodological pitfalls, including data leakage, in ML-based science. In
this paper, we systematically investigate reproducibility issues in ML-based
science. We show that data leakage is indeed a widespread problem and has led
to severe reproducibility failures. Specifically, through a survey of
literature in research communities that adopted ML methods, we find 17 fields
where errors have been found, collectively affecting 329 papers and in some
cases leading to wildly overoptimistic conclusions. Based on our survey, we
present a fine-grained taxonomy of 8 types of leakage that range from textbook
errors to open research problems.
We argue for fundamental methodological changes to ML-based science so that
cases of leakage can be caught before publication. To that end, we propose
model info sheets for reporting scientific claims based on ML models that would
address all types of leakage identified in our survey. To investigate the
impact of reproducibility errors and the efficacy of model info sheets, we
undertake a reproducibility study in a field where complex ML models are
believed to vastly outperform older statistical models such as Logistic
Regression (LR): civil war prediction. We find that all papers claiming the
superior performance of complex ML models compared to LR models fail to
reproduce due to data leakage, and complex ML models don't perform
substantively better than decades-old LR models. While none of these errors
could have been caught by reading the papers, model info sheets would enable
the detection of leakage in each case
A Critical Look at Decentralized Personal Data Architectures
While the Internet was conceived as a decentralized network, the most widely
used web applications today tend toward centralization. Control increasingly
rests with centralized service providers who, as a consequence, have also
amassed unprecedented amounts of data about the behaviors and personalities of
individuals.
Developers, regulators, and consumer advocates have looked to alternative
decentralized architectures as the natural response to threats posed by these
centralized services. The result has been a great variety of solutions that
include personal data stores (PDS), infomediaries, Vendor Relationship
Management (VRM) systems, and federated and distributed social networks. And
yet, for all these efforts, decentralized personal data architectures have seen
little adoption.
This position paper attempts to account for these failures, challenging the
accepted wisdom in the web community on the feasibility and desirability of
these approaches. We start with a historical discussion of the development of
various categories of decentralized personal data architectures. Then we survey
the main ideas to illustrate the common themes among these efforts. We tease
apart the design characteristics of these systems from the social values that
they (are intended to) promote. We use this understanding to point out numerous
drawbacks of the decentralization paradigm, some inherent and others
incidental. We end with recommendations for designers of these systems for
working towards goals that are achievable, but perhaps more limited in scope
and ambition
- …