415,702 research outputs found

    ProbCD: enrichment analysis accounting for categorization uncertainty

    Get PDF
    As in many other areas of science, systems biology makes extensive use of statistical association and significance estimates in contingency tables, a type of categorical data analysis known in this field as enrichment (also over-representation or enhancement) analysis. In spite of efforts to create probabilistic annotations, especially in the Gene Ontology context, or to deal with uncertainty in high throughput-based datasets, current enrichment methods largely ignore this probabilistic information since they are mainly based on variants of the Fisher Exact Test. We developed an open-source R package to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table. The contingency table for
the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities. An on-line interface was created to allow usage by non-programmers and is available at: http://xerad.systemsbiology.net/ProbCD/. We present an analysis framework and software tools to address the issue of uncertainty in categorical data analysis. In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation

    CATDAP マニュアル

    Get PDF
    CATDAP, CATegorical Data Analysis Program, is an AIC-based program published by Katsura,K. and Sakamoto,Y.(1980). It is based on the contingency table analysis method proposed by Prof. Sakamoto of the Institute of Statistical Mathematics.This article is a usage manual of an fortified version of CATDAP. This version handles not only the categorical object variable case, but the continuous object variable case either. It is not a CAT anymore, in this sense. We would like to call it TIGERDAP, The Integrated GEneRal Data Analysis Program.Katsura,K. and Sakamoto,Y.(1980); CATDAP, A categorical data analysis program; Computer Sci-ence Monographs, 14, The Institute of Statistical Mathematics, Toky

    An Overview of Methods in the Analysis of Dependent ordered catagorical Data: Assumptions and Implications

    Get PDF
    Subjective assessments of pain, quality of life, ability etc. measured by rating scales and questionnaires are common in clinical research. The resulting responses are categorical with an ordered structure and the statistical methods must take account of this type of data structure. In this paper we give an overview of methods for analysis of dependent ordered categorical data and a comparison of standard models and measures with nonparametric augmented rank measures proposed by Svensson. We focus on assumptions and issues behind model specifications and data as well as implications of the methods. First we summarise some fundamental models for categorical data and two main approaches for repeated ordinal data; marginal and cluster-specific models. We then describe models and measures for application in agreement studies and finally give a summary of the approach of Svensson. The paper concludes with a summary of important aspects.Dependent ordinal data; GEE; GLMM; Logit; modelling

    The spectral analysis of nonstationary categorical time series using local spectral envelope

    Get PDF
    Most classical methods for the spectral analysis are based on the assumption that the time series is stationary. However, many time series in practical problems shows nonstationary behaviors. The data from some fields are huge and have variance and spectrum which changes over time. Sometimes,we are interested in the cyclic behavior of the categorical-valued time series such as EEG sleep state data or DNA sequence, the general method is to scale the data, that is, assign numerical values to the categories and then use the periodogram to find the cyclic behavior. But there exists numerous possible scaling. If we arbitrarily assign the numerical values to the categories and proceed with a spectral analysis, then the results will depend on the particular assignment. We would like to find the all possible scaling that bring out all of the interesting features in the data. To overcome these problems, there have been many approaches in the spectral analysis. Our goal is to develop a statistical methodology for analyzing nonstationary categorical time series in the frequency domain. In this dissertation, the spectral envelope methodology is introduced for spectral analysis of categorical time series. This provides the general framework for the spectral analysis of the categorical time series and summarizes information from the spectrum matrix. To apply this method to nonstationary process, I used the TBAS(Tree-Based Adaptive Segmentation) and local spectral envelope based on the piecewise stationary process. In this dissertation,the TBAS(Tree-Based Adpative Segmentation) using distance function based on the Kullback-Leibler divergence was proposed to find the best segmentation

    Effects of censoring on parameter estimates and power in genetic modeling.

    Get PDF
    Genetic and environmental influences on variance in phenotypic traits may be estimated with normal theory Maximum Likelihood (ML). However, when the assumption of multivariate normality is not met, this method may result in biased parameter estimates and incorrect likelihood ratio tests. We simulated multivariate normal distributed twin data under the assumption of three different genetic models. Genetic model fitting was performed in six data sets: multivariate normal data, discrete uncensored data, censored data, square root transformed censored data, normal scores of censored data, and categorical data. Estimates were obtained with normal theory ML (data sets 1-5) and with categorical data analysis (data set 6). Statistical power was examined by fitting reduced models to the data. When fitting an ACE model to censored data, an unbiased estimate of the additive genetic effect was obtained. However, the common environmental effect was underestimated and the unique environmental effect was overestimated. Transformations did not remove this bias. When fitting an ADE model, the additive genetic effect was underestimated while the dominant and unique environmental effects were overestimated. In all models, the correct parameter estimates were recovered with categorical data analysis. However, with categorical data analysis, the statistical power decreased. The analysis of L-shaped distributed data with normal theory ML results in biased parameter estimates. Unbiased parameter estimates are obtained with categorical data analysis, but the power decreases

    Session 3h: Learning Progressions of Elementary Data and Measurement

    Get PDF
    The foundation for formal statistical and probability concepts is laid in the early grades through investigating both categorical and measurement data. Join us for a quick trip through the progression of categorical data and measurement data analysis from kindergarten through fifth-grade. Example tasks will be provided

    Rotation in Multiple Correspondence Analysis: a planar rotation iterative procedure

    Get PDF
    Multiple Correspondence Analysis (MCA) is a well-known multivariate method for statistical description of categorical data (see for instance Greenacre and Blasius, 2006). Similarly to what is done in Principal Component Analysis (PCA) and Factor Analysis, the MCA solution can be rotated to increase the components simplicity. The idea behind a rotation is to find subsets of variables which coincide more clearly with the rotated components. This implies that maximizing components simplicity can help in factor interpretation and in variables clustering. In PCA, the probably most famous rotation criterion is the varimax one introduced by Kaiser (1958). Besides, Kiers (1991) proposed a rotation criterion in his method named PCAMIX developed for the analysis of both numerical and categorical data, and including PCA and MCA as special cases. In case of only categorical data, this criterion is a varimax-based one relying on the correlation ratio between the categorical variables and the MCA numerical components. The optimization of this criterion is then reached by the algorithm of De Leeuw and Pruzansky (1978). In this paper, we give the analytic expression of the optimal angle of planar rotation for this criterion. If more than two principal components are to be retained, similarly to what is done by Kaiser (1958) for PCA, this planar solution is computed in a practical algorithm applying successive pairwise planar rotations for optimizing the rotation criterion. A simulation study is used to illustrate the analytic expression of the angle for planar rotation. The proposed procedure is also applied on a real data set to show the possible benefits of using rotation in MCA.categorical data, multiple correspondence analysis, correlation ratio, rotation, varimax criterion

    A new development cycle of the Statistical Toolkit

    Full text link
    The Statistical Toolkit is an open source system specialized in the statistical comparison of distributions. It addresses requirements common to different experimental domains, such as simulation validation (e.g. comparison of experimental and simulated distributions), regression testing in the course of the software development process, and detector performance monitoring. Various sets of statistical tests have been added to the existing collection to deal with the one sample problem (i.e. the comparison of a data distribution to a function, including tests for normality, categorical analysis and the estimate of randomness). Improved algorithms and software design contribute to the robustness of the results. A simple user layer dealing with primitive data types facilitates the use of the toolkit both in standalone analyses and in large scale experiments.Comment: To be published in the Proc. of CHEP (Computing in High Energy Physics) 201
    corecore