6,342 research outputs found
A systematic comparison of supervised classifiers
Pattern recognition techniques have been employed in a myriad of industrial,
medical, commercial and academic applications. To tackle such a diversity of
data, many techniques have been devised. However, despite the long tradition of
pattern recognition research, there is no technique that yields the best
classification in all scenarios. Therefore, the consideration of as many as
possible techniques presents itself as an fundamental practice in applications
aiming at high accuracy. Typical works comparing methods either emphasize the
performance of a given algorithm in validation tests or systematically compare
various algorithms, assuming that the practical use of these methods is done by
experts. In many occasions, however, researchers have to deal with their
practical classification tasks without an in-depth knowledge about the
underlying mechanisms behind parameters. Actually, the adequate choice of
classifiers and parameters alike in such practical circumstances constitutes a
long-standing problem and is the subject of the current paper. We carried out a
study on the performance of nine well-known classifiers implemented by the Weka
framework and compared the dependence of the accuracy with their configuration
parameter configurations. The analysis of performance with default parameters
revealed that the k-nearest neighbors method exceeds by a large margin the
other methods when high dimensional datasets are considered. When other
configuration of parameters were allowed, we found that it is possible to
improve the quality of SVM in more than 20% even if parameters are set
randomly. Taken together, the investigation conducted in this paper suggests
that, apart from the SVM implementation, Weka's default configuration of
parameters provides an performance close the one achieved with the optimal
configuration
Beta-trees: Multivariate histograms with confidence statements
Multivariate histograms are difficult to construct due to the curse of
dimensionality. Motivated by -d trees in computer science, we show how to
construct an efficient data-adaptive partition of Euclidean space that
possesses the following two properties: With high confidence the distribution
from which the data are generated is close to uniform on each rectangle of the
partition; and despite the data-dependent construction we can give guaranteed
finite sample simultaneous confidence intervals for the probabilities (and
hence for the average densities) of each rectangle in the partition. This
partition will automatically adapt to the sizes of the regions where the
distribution is close to uniform. The methodology produces confidence intervals
whose widths depend only on the probability content of the rectangles and not
on the dimensionality of the space, thus avoiding the curse of dimensionality.
Moreover, the widths essentially match the optimal widths in the univariate
setting. The simultaneous validity of the confidence intervals allows to use
this construction, which we call {\sl Beta-trees}, for various data-analytic
purposes. We illustrate this by using Beta-trees for visualizing data and for
multivariate mode-hunting
- …