209 research outputs found

    The k-NN algorithm for compositional data: a revised approach with and without zero values present

    Get PDF
    In compositional data, an observation is a vector with non-negative components which sum to a constant, typically 1. Data of this type arise in many areas, such as geology, archaeology, biology, economics and political science among others. The goal of this paper is to extend the taxicab metric and a newly suggested metric for compositional data by employing a power transformation. Both metrics are to be used in the k-nearest neighbours algorithm regardless of the presence of zeros. Examples with real data are exhibited.Comment: This manuscript will appear at the. http://www.jds-online.com/volume-12-number-3-july-201

    A novel, divergence based, regression for compositional data

    Get PDF
    In compositional data, an observation is a vector with non-negative components which sum to a constant, typically 1. Data of this type arise in many areas, such as geology, archaeology, biology, economics and political science amongst others. The goal of this paper is to propose a new, divergence based, regression modelling technique for compositional data. To do so, a recently proved metric which is a special case of the Jensen-Shannon divergence is employed. A strong advantage of this new regression technique is that zeros are naturally handled. An example with real data and simulation studies are presented and are both compared with the log-ratio based regression suggested by Aitchison in 1986.Comment: This is a preprint of the paper accepted for publication in the Proceedings of the 28th Panhellenic Statistics Conference, 15-18/4/2015, Athens, Greec

    Regression analysis with compositional data containing zero values

    Get PDF
    Regression analysis with compositional data containing zero valuesComment: The paper has been accepted for publication in the Chilean Journal of Statistics. It consists of 12 pages with 4 figure

    Improved classification for compositional data using the α\alpha-transformation

    Get PDF
    In compositional data analysis an observation is a vector containing non-negative values, only the relative sizes of which are considered to be of interest. Without loss of generality, a compositional vector can be taken to be a vector of proportions that sum to one. Data of this type arise in many areas including geology, archaeology, biology, economics and political science. In this paper we investigate methods for classification of compositional data. Our approach centres on the idea of using the α\alpha-transformation to transform the data and then to classify the transformed data via regularised discriminant analysis and the k-nearest neighbours algorithm. Using the α\alpha-transformation generalises two rival approaches in compositional data analysis, one (when α=1\alpha=1) that treats the data as though they were Euclidean, ignoring the compositional constraint, and another (when α=0\alpha=0) that employs Aitchison's centred log-ratio transformation. A numerical study with several real datasets shows that whether using α=1\alpha=1 or α=0\alpha=0 gives better classification performance depends on the dataset, and moreover that using an intermediate value of α\alpha can sometimes give better performance than using either 1 or 0.Comment: This is a 17-page preprint and has been accepted for publication at the Journal of Classificatio

    The FEDHC Bayesian network learning algorithm

    Full text link
    The paper proposes a new hybrid Bayesian network learning algorithm, termed Forward Early Dropping Hill Climbing (FEDHC), devised to work with either continuous or categorical variables. Specifically for the case of continuous data, a robust to outliers version of FEDHC, that can be adopted by other BN learning algorithms, is proposed. Further, the paper manifests that the only implementation of MMHC in the statistical software \textit{R}, is prohibitively expensive and a new implementation is offered. The FEDHC is tested via Monte Carlo simulations that distinctly show it is computationally efficient, and produces Bayesian networks of similar to, or of higher accuracy than MMHC and PCHC. Finally, an application of FEDHC, PCHC and MMHC algorithms to real data, from the field of economics, is demonstrated using the statistical software \textit{R}

    Hypothesis testing for two population means: parametric or non-parametric test?

    Full text link
    The parametric Welch tt-test and the non-parametric Wilcoxon-Mann-Whitney test are the most commonly used two independent sample means tests. More recent testing approaches include the non-parametric, empirical likelihood and exponential empirical likelihood. However, the applicability of these non-parametric likelihood testing procedures is limited partially because of their tendency to inflate the type I error in small sized samples. In order to circumvent the type I error problem, we propose simple calibrations using the tt distribution and bootstrapping. The two non-parametric likelihood testing procedures, with and without those calibrations, are then compared against the Wilcoxon-Mann-Whitney test and the Welch tt-test. The comparisons are implemented via extensive Monte Carlo simulations on the grounds of type I error and power in small/medium sized samples generated from various non-normal populations. The simulation studies clearly demonstrate that a) the tt calibration improves the type I error of the empirical likelihood, b) bootstrap calibration improves the type I error of both non-parametric likelihoods, c) the Welch tt-test with or without bootstrap calibration attains the type I error and produces similar levels of power with the former testing procedures, and d) the Wilcoxon-Mann-Whitney test produces inflated type I error while the computation of an exact p-value is not feasible in the presence of ties with discrete data. Further, an application to real gene expression data illustrates the computational high cost and thus the impracticality of the non parametric likelihoods. Overall, the Welch t-test, which is highly computationally efficient and readily interpretable, is shown to be the best method when testing equality of two population means.Comment: Accepted for publication in the Journal of Statistical Computation and Simulatio

    A data-based power transformation for compositional data

    Get PDF
    Compositional data analysis is carried out either by neglecting the compositional constraint and applying standard multivariate data analysis, or by transforming the data using the logs of the ratios of the components. In this work we examine a more general transformation which includes both approaches as special cases. It is a power transformation and involves a single parameter, {\alpha}. The transformation has two equivalent versions. The first is the stay-in-the-simplex version, which is the power transformation as defined by Aitchison in 1986. The second version, which is a linear transformation of the power transformation, is a Box-Cox type transformation. We discuss a parametric way of estimating the value of {\alpha}, which is maximization of its profile likelihood (assuming multivariate normality of the transformed data) and the equivalence between the two versions is exhibited. Other ways include maximization of the correct classification probability in discriminant analysis and maximization of the pseudo R-squared (as defined by Aitchison in 1986) in linear regression. We examine the relationship between the {\alpha}-transformation, the raw data approach and the isometric log-ratio transformation. Furthermore, we also define a suitable family of metrics corresponding to the family of {\alpha}-transformation and consider the corresponding family of Frechet means.Comment: Published in the proceddings of the 4th international workshop on Compositional Data Analysis. http://congress.cimne.com/codawork11/frontal/default.as
    corecore