1,659 research outputs found
Bounded Guaranteed Algorithms for Concave Impurity Minimization Via Maximum Likelihood
Partitioning algorithms play a key role in many scientific and engineering
disciplines. A partitioning algorithm divides a set into a number of disjoint
subsets or partitions. Often, the quality of the resulted partitions is
measured by the amount of impurity in each partition, the smaller impurity the
higher quality of the partitions. In general, for a given impurity measure
specified by a function of the partitions, finding the minimum impurity
partitions is an NP-hard problem. Let be the number of -dimensional
elements in a set and be the number of desired partitions, then an
exhaustive search over all the possible partitions to find a minimum partition
has the complexity of which quickly becomes impractical for many
applications with modest values of and . Thus, many approximate
algorithms with polynomial time complexity have been proposed, but few provide
bounded guarantee. In this paper, an upper bound and a lower bound for a class
of impurity functions are constructed. Based on these bounds, we propose a
low-complexity partitioning algorithm with bounded guarantee based on the
maximum likelihood principle. The theoretical analyses on the bounded guarantee
of the algorithms are given for two well-known impurity functions Gini index
and entropy. When , the proposed algorithm achieves state-of-the-art
results in terms of lowest approximations and polynomial time complexity
. In addition, a heuristic greedy-merge algorithm having the time
complexity of is proposed for . Although the greedy-merge
algorithm does not provide a bounded guarantee, its performance is comparable
to that of the state-of-the-art methods. Our results also generalize some
well-known information-theoretic bounds such as Fano's inequality and
Boyd-Chiang's bound.Comment: 13 pages, 6 figure
Ranking Median Regression: Learning to Order through Local Consensus
This article is devoted to the problem of predicting the value taken by a
random permutation , describing the preferences of an individual over a
set of numbered items say, based on the observation of
an input/explanatory r.v. e.g. characteristics of the individual), when
error is measured by the Kendall distance. In the probabilistic
formulation of the 'Learning to Order' problem we propose, which extends the
framework for statistical Kemeny ranking aggregation developped in
\citet{CKS17}, this boils down to recovering conditional Kemeny medians of
given from i.i.d. training examples . For this reason, this statistical learning problem is
referred to as \textit{ranking median regression} here. Our contribution is
twofold. We first propose a probabilistic theory of ranking median regression:
the set of optimal elements is characterized, the performance of empirical risk
minimizers is investigated in this context and situations where fast learning
rates can be achieved are also exhibited. Next we introduce the concept of
local consensus/median, in order to derive efficient methods for ranking median
regression. The major advantage of this local learning approach lies in its
close connection with the widely studied Kemeny aggregation problem. From an
algorithmic perspective, this permits to build predictive rules for ranking
median regression by implementing efficient techniques for (approximate) Kemeny
median computations at a local level in a tractable manner. In particular,
versions of -nearest neighbor and tree-based methods, tailored to ranking
median regression, are investigated. Accuracy of piecewise constant ranking
median regression rules is studied under a specific smoothness assumption for
's conditional distribution given
An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests
Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years.
High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications, and provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions.
The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application.
Application of the methods is illustrated using freely available implementations in the R system for statistical computing
Integrating Economic Knowledge in Data Mining Algorithms
The assessment of knowledge derived from databases depends on many factors. Decision makers often need to convince others about the correctness and effectiveness of knowledge induced from data.The current data mining techniques do not contribute much to this process of persuasion.Part of this limitation can be removed by integrating knowledge from experts in the field, encoded in some accessible way, with knowledge derived form patterns stored in the database.In this paper we will in particular discuss methods for implementing monotonicity constraints in economic decision problems.This prior knowledge is combined with data mining algorithms based on decision trees and neural networks.The method is illustrated in a hedonic price model.knowledge;neural network;data mining;decision trees
- …