808 research outputs found

    Recent advances in directional statistics

    Get PDF
    Mainstream statistical methodology is generally applicable to data observed in Euclidean space. There are, however, numerous contexts of considerable scientific interest in which the natural supports for the data under consideration are Riemannian manifolds like the unit circle, torus, sphere and their extensions. Typically, such data can be represented using one or more directions, and directional statistics is the branch of statistics that deals with their analysis. In this paper we provide a review of the many recent developments in the field since the publication of Mardia and Jupp (1999), still the most comprehensive text on directional statistics. Many of those developments have been stimulated by interesting applications in fields as diverse as astronomy, medicine, genetics, neurology, aeronautics, acoustics, image analysis, text mining, environmetrics, and machine learning. We begin by considering developments for the exploratory analysis of directional data before progressing to distributional models, general approaches to inference, hypothesis testing, regression, nonparametric curve estimation, methods for dimension reduction, classification and clustering, and the modelling of time series, spatial and spatio-temporal data. An overview of currently available software for analysing directional data is also provided, and potential future developments discussed.Comment: 61 page

    Shape Theoretic and Machine Learning Based Methods for Automatic Clustering and Classification of Cardiomyocytes Based on Action Potential Morphology

    Get PDF
    Stem cells have been a hot topic in the cardiology community for the last decade and a half. Ever since we learned how to differentiate cardiomyocytes from embryonic and induced pluripotent stem cells, there has been a lot of research devoted to the potential of utilizing these cardiomyocytes for regenerative medicine, drug model studies, and arrhythmogenesis analysis. However, while cardiomyocyte purification methods have advanced significantly, methods for the identification and isolation of specific types of cardiomyocytes, such as ventricular or pacemaking cells, have not seen the same progress. Among the different avenues for accomplishing this task, the electrophysiological one is of particular interest because every cardiomyocyte type generates a distinct signature known as an action potential. The current standard for analyzing the action potential of a cardiomyocyte is an expert-level subjective thresholding of specific features, such as action potential duration. However this approach does not transfer across datasets and does not scale with the increasing populations of cardiomyocytes. In this thesis, ideas from the machine learning and shape analysis communities are explored to develop new, automated methods for the analysis of cardiomyocytes based on their action potentials. These methods allow us to identify subpopulations of similar cardiomyocytes based on their action potential morphology, hypothesize the eventual chamber-specific fate of newly differentiated cardiomyocytes, and make effective comparisons between cardiomyocytes in drug and cell-line studies. The objective, scalable methods presented in this thesis present a new paradigm in performing analysis in high-throughput applications of cardiomyocytes via action potential morphology, and could be of large benefit to the cardiology and biology communities

    Development of statistical methods for the analysis of single-cell RNA-seq data

    Get PDF
    Single-cell RNA-sequencing profiles the transcriptome of cells from diverse populations. A popular intermediate data format is a large count matrix of genes x cells. This type of data brings several analytical challenges. Here, I present three projects that I worked on during my PhD that address particular aspects of working with such datasets: - The large number of cells in the count matrix is a challenge for fitting gamma-Poisson generalized linear models with existing tools. I developed a new R package called glmGamPoi to address this gap. I optimized the overdispersion estimation procedure to be quick and robust for datasets with many cells and small counts. I compared the performance against two popular tools (edgeR and DESeq2) and find that my inference is 6x to 13x faster and achieves a higher likelihood for a majority of the genes in four single-cell datasets. - The variance of single-cell RNA-seq counts depends on their mean but many existing statistical tools have optimal performance when the variance is uniform. Accordingly, variance-stabilizing transformations are applied to unlock the large number of methods with such an requirement. I compared four approaches to variance-stabilize the data based on the delta method, model residuals, inferred latent expression state or count factor analysis. I describe the theoretical strength and weaknesses, and compare their empirical performance in a benchmark on simulated and real single-cell data. I find that none of the mathematically more sophisticated transformations consistently outperform the simple log(y/s+1) transformation. - Multi-condition single-cell data offers the opportunity to find differentially expressed genes for individual cell subpopulations. However, the prevalent approach to analyze such data is to start by dividing the cells into discrete populations and then test for differential expression within each group. The results are interpretable but may miss interesting cases by (1) choosing the cluster size too small and lacking power to detect effects or (2) choosing the cluster size too large and obscuring interesting effects apparent on a smaller scale. I developed a new statistical framework for the analysis of multi-condition single-cell data that avoids the premature discretization. The approach performs regression on the latent subspaces occupied by the cells in each condition. The method is implemented as an R package called lemur

    Singular geodesic coordinates for representing diffeomorphic maps in computational anatomy, with application to the morphometry of early Alzheimer's disease in the medial temporal lobe

    Get PDF
    In this work we develop novel algorithms for building one to one correspondences between anatomical forms by providing a sparse representation of dense registration information. These sparse parameterizations of complex high dimensional data allow robustness in the face of noise and anomalies, and a platform for inference that is effective in the face of multiple comparisons. We review background in the theory of generating smooth, invertible transformations (the diffeomorphism group), and build our parameterization as a function supported on surfaces bounding anatomical structures of interest. We show how dimensionality can be reduced even further and still provide a rich family of mappings using principal component analysis or Laplace Beltrami eigenfunctions supported on the surface. We develop algorithms for surface matching and image matching within this model, and demonstrate the desired robustness by working with published large neuroimaging datasets that include many low quality examples. Finally we turn to addressing challenges associated with some specific data types: images with multiple labels, and longitudinal data. We use the mapping tools developed to draw conclusions about the progression of early Alzheimer's disease in the medial temporal lobe

    Geometric Data Analysis: Advancements of the Statistical Methodology and Applications

    Get PDF
    Data analysis has become fundamental to our society and comes in multiple facets and approaches. Nevertheless, in research and applications, the focus was primarily on data from Euclidean vector spaces. Consequently, the majority of methods that are applied today are not suited for more general data types. Driven by needs from fields like image processing, (medical) shape analysis, and network analysis, more and more attention has recently been given to data from non-Euclidean spaces–particularly (curved) manifolds. It has led to the field of geometric data analysis whose methods explicitly take the structure (for example, the topology and geometry) of the underlying space into account. This thesis contributes to the methodology of geometric data analysis by generalizing several fundamental notions from multivariate statistics to manifolds. We thereby focus on two different viewpoints. First, we use Riemannian structures to derive a novel regression scheme for general manifolds that relies on splines of generalized Bézier curves. It can accurately model non-geodesic relationships, for example, time-dependent trends with saturation effects or cyclic trends. Since Bézier curves can be evaluated with the constructive de Casteljau algorithm, working with data from manifolds of high dimensions (for example, a hundred thousand or more) is feasible. Relying on the regression, we further develop a hierarchical statistical model for an adequate analysis of longitudinal data in manifolds, and a method to control for confounding variables. We secondly focus on data that is not only manifold- but even Lie group-valued, which is frequently the case in applications. We can only achieve this by endowing the group with an affine connection structure that is generally not Riemannian. Utilizing it, we derive generalizations of several well-known dissimilarity measures between data distributions that can be used for various tasks, including hypothesis testing. Invariance under data translations is proven, and a connection to continuous distributions is given for one measure. A further central contribution of this thesis is that it shows use cases for all notions in real-world applications, particularly in problems from shape analysis in medical imaging and archaeology. We can replicate or further quantify several known findings for shape changes of the femur and the right hippocampus under osteoarthritis and Alzheimer's, respectively. Furthermore, in an archaeological application, we obtain new insights into the construction principles of ancient sundials. Last but not least, we use the geometric structure underlying human brain connectomes to predict cognitive scores. Utilizing a sample selection procedure, we obtain state-of-the-art results
    • …
    corecore