10,606 research outputs found
The miracle of the Septuagint and the promise of data mining in economics
This paper argues that the sometimes-conflicting results of a modern revisionist literature on data mining in econometrics reflect different approaches to solving the central problem of model uncertainty in a science of non-experimental data. The literature has entered an exciting phase with theoretical development, methodological reflection, considerable technological strides on the computing front and interesting empirical applications providing momentum for this branch of econometrics. The organising principle for this discussion of data mining is a philosophical spectrum that sorts the various econometric traditions according to their epistemological assumptions (about the underlying data-generating-process DGP) starting with nihilism at one end and reaching claims of encompassing the DGP at the other end; call it the DGP-spectrum. In the course of exploring this spectrum the reader will encounter various Bayesian, specific-to-general (S-G) as well general-to-specific (G-S) methods. To set the stage for this exploration the paper starts with a description of data mining, its potential risks and a short section on potential institutional safeguards to these problems.Data mining, model selection, automated model selection, general to specific modelling, extreme bounds analysis, Bayesian model selection
On Cognitive Preferences and the Plausibility of Rule-based Models
It is conventional wisdom in machine learning and data mining that logical
models such as rule sets are more interpretable than other models, and that
among such rule-based models, simpler models are more interpretable than more
complex ones. In this position paper, we question this latter assumption by
focusing on one particular aspect of interpretability, namely the plausibility
of models. Roughly speaking, we equate the plausibility of a model with the
likeliness that a user accepts it as an explanation for a prediction. In
particular, we argue that, all other things being equal, longer explanations
may be more convincing than shorter ones, and that the predominant bias for
shorter models, which is typically necessary for learning powerful
discriminative models, may not be suitable when it comes to user acceptance of
the learned models. To that end, we first recapitulate evidence for and against
this postulate, and then report the results of an evaluation in a
crowd-sourcing study based on about 3.000 judgments. The results do not reveal
a strong preference for simple rules, whereas we can observe a weak preference
for longer rules in some domains. We then relate these results to well-known
cognitive biases such as the conjunction fallacy, the representative heuristic,
or the recogition heuristic, and investigate their relation to rule length and
plausibility.Comment: V4: Another rewrite of section on interpretability to clarify focus
on plausibility and relation to interpretability, comprehensibility, and
justifiabilit
Efficient Optimization of Performance Measures by Classifier Adaptation
In practical applications, machine learning algorithms are often needed to
learn classifiers that optimize domain specific performance measures.
Previously, the research has focused on learning the needed classifier in
isolation, yet learning nonlinear classifier for nonlinear and nonsmooth
performance measures is still hard. In this paper, rather than learning the
needed classifier by optimizing specific performance measure directly, we
circumvent this problem by proposing a novel two-step approach called as CAPO,
namely to first train nonlinear auxiliary classifiers with existing learning
methods, and then to adapt auxiliary classifiers for specific performance
measures. In the first step, auxiliary classifiers can be obtained efficiently
by taking off-the-shelf learning algorithms. For the second step, we show that
the classifier adaptation problem can be reduced to a quadratic program
problem, which is similar to linear SVMperf and can be efficiently solved. By
exploiting nonlinear auxiliary classifiers, CAPO can generate nonlinear
classifier which optimizes a large variety of performance measures including
all the performance measure based on the contingency table and AUC, whilst
keeping high computational efficiency. Empirical studies show that CAPO is
effective and of high computational efficiency, and even it is more efficient
than linear SVMperf.Comment: 30 pages, 5 figures, to appear in IEEE Transactions on Pattern
Analysis and Machine Intelligence, 201
Agent-Based Simulations of Blockchain protocols illustrated via Kadena's Chainweb
While many distributed consensus protocols provide robust liveness and
consistency guarantees under the presence of malicious actors, quantitative
estimates of how economic incentives affect security are few and far between.
In this paper, we describe a system for simulating how adversarial agents, both
economically rational and Byzantine, interact with a blockchain protocol. This
system provides statistical estimates for the economic difficulty of an attack
and how the presence of certain actors influences protocol-level statistics,
such as the expected time to regain liveness. This simulation system is
influenced by the design of algorithmic trading and reinforcement learning
systems that use explicit modeling of an agent's reward mechanism to evaluate
and optimize a fully autonomous agent. We implement and apply this simulation
framework to Kadena's Chainweb, a parallelized Proof-of-Work system, that
contains complexity in how miner incentive compliance affects security and
censorship resistance. We provide the first formal description of Chainweb that
is in the literature and use this formal description to motivate our simulation
design. Our simulation results include a phase transition in block height
growth rate as a function of shard connectivity and empirical evidence that
censorship in Chainweb is too costly for rational miners to engage in. We
conclude with an outlook on how simulation can guide and optimize protocol
development in a variety of contexts, including Proof-of-Stake parameter
optimization and peer-to-peer networking design.Comment: 10 pages, 7 figures, accepted to the IEEE S&B 2019 conferenc
On the role of pre and post-processing in environmental data mining
The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed
Feature selection algorithms: a survey and experimental evaluation
In view of the substantial number of existing feature selection
algorithms, the need arises to count on criteria that
enables to adequately decide which algorithm to use in certain
situations. This work reviews several fundamental algorithms found in the
literature and assesses their performance in a controlled
scenario. A scoring measure ranks the algorithms by
taking into account the amount of relevance, irrelevance
and redundance on sample data sets. This measure computes the
degree of matching between the output given by the algorithm and the known
optimal solution. Sample size effects are also studied.Postprint (published version
- …