209 research outputs found
The k-NN algorithm for compositional data: a revised approach with and without zero values present
In compositional data, an observation is a vector with non-negative
components which sum to a constant, typically 1. Data of this type arise in
many areas, such as geology, archaeology, biology, economics and political
science among others. The goal of this paper is to extend the taxicab metric
and a newly suggested metric for compositional data by employing a power
transformation. Both metrics are to be used in the k-nearest neighbours
algorithm regardless of the presence of zeros. Examples with real data are
exhibited.Comment: This manuscript will appear at the.
http://www.jds-online.com/volume-12-number-3-july-201
A novel, divergence based, regression for compositional data
In compositional data, an observation is a vector with non-negative
components which sum to a constant, typically 1. Data of this type arise in
many areas, such as geology, archaeology, biology, economics and political
science amongst others. The goal of this paper is to propose a new, divergence
based, regression modelling technique for compositional data. To do so, a
recently proved metric which is a special case of the Jensen-Shannon divergence
is employed. A strong advantage of this new regression technique is that zeros
are naturally handled. An example with real data and simulation studies are
presented and are both compared with the log-ratio based regression suggested
by Aitchison in 1986.Comment: This is a preprint of the paper accepted for publication in the
Proceedings of the 28th Panhellenic Statistics Conference, 15-18/4/2015,
Athens, Greec
Regression analysis with compositional data containing zero values
Regression analysis with compositional data containing zero valuesComment: The paper has been accepted for publication in the Chilean Journal of
Statistics. It consists of 12 pages with 4 figure
Improved classification for compositional data using the -transformation
In compositional data analysis an observation is a vector containing
non-negative values, only the relative sizes of which are considered to be of
interest. Without loss of generality, a compositional vector can be taken to be
a vector of proportions that sum to one. Data of this type arise in many areas
including geology, archaeology, biology, economics and political science. In
this paper we investigate methods for classification of compositional data. Our
approach centres on the idea of using the -transformation to transform
the data and then to classify the transformed data via regularised discriminant
analysis and the k-nearest neighbours algorithm. Using the
-transformation generalises two rival approaches in compositional data
analysis, one (when ) that treats the data as though they were
Euclidean, ignoring the compositional constraint, and another (when )
that employs Aitchison's centred log-ratio transformation. A numerical study
with several real datasets shows that whether using or
gives better classification performance depends on the dataset, and moreover
that using an intermediate value of can sometimes give better
performance than using either 1 or 0.Comment: This is a 17-page preprint and has been accepted for publication at
the Journal of Classificatio
The FEDHC Bayesian network learning algorithm
The paper proposes a new hybrid Bayesian network learning algorithm, termed
Forward Early Dropping Hill Climbing (FEDHC), devised to work with either
continuous or categorical variables. Specifically for the case of continuous
data, a robust to outliers version of FEDHC, that can be adopted by other BN
learning algorithms, is proposed. Further, the paper manifests that the only
implementation of MMHC in the statistical software \textit{R}, is prohibitively
expensive and a new implementation is offered. The FEDHC is tested via Monte
Carlo simulations that distinctly show it is computationally efficient, and
produces Bayesian networks of similar to, or of higher accuracy than MMHC and
PCHC. Finally, an application of FEDHC, PCHC and MMHC algorithms to real data,
from the field of economics, is demonstrated using the statistical software
\textit{R}
Hypothesis testing for two population means: parametric or non-parametric test?
The parametric Welch -test and the non-parametric Wilcoxon-Mann-Whitney
test are the most commonly used two independent sample means tests. More recent
testing approaches include the non-parametric, empirical likelihood and
exponential empirical likelihood. However, the applicability of these
non-parametric likelihood testing procedures is limited partially because of
their tendency to inflate the type I error in small sized samples. In order to
circumvent the type I error problem, we propose simple calibrations using the
distribution and bootstrapping. The two non-parametric likelihood testing
procedures, with and without those calibrations, are then compared against the
Wilcoxon-Mann-Whitney test and the Welch -test. The comparisons are
implemented via extensive Monte Carlo simulations on the grounds of type I
error and power in small/medium sized samples generated from various non-normal
populations. The simulation studies clearly demonstrate that a) the
calibration improves the type I error of the empirical likelihood, b) bootstrap
calibration improves the type I error of both non-parametric likelihoods, c)
the Welch -test with or without bootstrap calibration attains the type I
error and produces similar levels of power with the former testing procedures,
and d) the Wilcoxon-Mann-Whitney test produces inflated type I error while the
computation of an exact p-value is not feasible in the presence of ties with
discrete data. Further, an application to real gene expression data illustrates
the computational high cost and thus the impracticality of the non parametric
likelihoods. Overall, the Welch t-test, which is highly computationally
efficient and readily interpretable, is shown to be the best method when
testing equality of two population means.Comment: Accepted for publication in the Journal of Statistical Computation
and Simulatio
A data-based power transformation for compositional data
Compositional data analysis is carried out either by neglecting the
compositional constraint and applying standard multivariate data analysis, or
by transforming the data using the logs of the ratios of the components. In
this work we examine a more general transformation which includes both
approaches as special cases. It is a power transformation and involves a single
parameter, {\alpha}. The transformation has two equivalent versions. The first
is the stay-in-the-simplex version, which is the power transformation as
defined by Aitchison in 1986. The second version, which is a linear
transformation of the power transformation, is a Box-Cox type transformation.
We discuss a parametric way of estimating the value of {\alpha}, which is
maximization of its profile likelihood (assuming multivariate normality of the
transformed data) and the equivalence between the two versions is exhibited.
Other ways include maximization of the correct classification probability in
discriminant analysis and maximization of the pseudo R-squared (as defined by
Aitchison in 1986) in linear regression. We examine the relationship between
the {\alpha}-transformation, the raw data approach and the isometric log-ratio
transformation. Furthermore, we also define a suitable family of metrics
corresponding to the family of {\alpha}-transformation and consider the
corresponding family of Frechet means.Comment: Published in the proceddings of the 4th international workshop on
Compositional Data Analysis.
http://congress.cimne.com/codawork11/frontal/default.as
- …