808 research outputs found
Recent advances in directional statistics
Mainstream statistical methodology is generally applicable to data observed
in Euclidean space. There are, however, numerous contexts of considerable
scientific interest in which the natural supports for the data under
consideration are Riemannian manifolds like the unit circle, torus, sphere and
their extensions. Typically, such data can be represented using one or more
directions, and directional statistics is the branch of statistics that deals
with their analysis. In this paper we provide a review of the many recent
developments in the field since the publication of Mardia and Jupp (1999),
still the most comprehensive text on directional statistics. Many of those
developments have been stimulated by interesting applications in fields as
diverse as astronomy, medicine, genetics, neurology, aeronautics, acoustics,
image analysis, text mining, environmetrics, and machine learning. We begin by
considering developments for the exploratory analysis of directional data
before progressing to distributional models, general approaches to inference,
hypothesis testing, regression, nonparametric curve estimation, methods for
dimension reduction, classification and clustering, and the modelling of time
series, spatial and spatio-temporal data. An overview of currently available
software for analysing directional data is also provided, and potential future
developments discussed.Comment: 61 page
Shape Theoretic and Machine Learning Based Methods for Automatic Clustering and Classification of Cardiomyocytes Based on Action Potential Morphology
Stem cells have been a hot topic in the cardiology community for the last decade and a half. Ever since we learned how to differentiate cardiomyocytes from embryonic and induced pluripotent stem cells, there has been a lot of research devoted to the potential of utilizing these cardiomyocytes for regenerative medicine, drug model studies, and arrhythmogenesis analysis. However, while cardiomyocyte purification methods have advanced significantly, methods for the identification and isolation of specific types of cardiomyocytes, such as ventricular or pacemaking cells, have not seen the same progress. Among the different avenues for accomplishing this task, the electrophysiological one is of particular interest because every cardiomyocyte type generates a distinct signature known as an action potential. The current standard for analyzing the action potential of a cardiomyocyte is an expert-level subjective thresholding of specific features, such as action potential duration. However this approach does not transfer across datasets and does not scale with the increasing populations of cardiomyocytes.
In this thesis, ideas from the machine learning and shape analysis communities are explored to develop new, automated methods for the analysis of cardiomyocytes based on their action potentials. These methods allow us to identify subpopulations of similar cardiomyocytes based on their action potential morphology, hypothesize the eventual chamber-specific fate of newly differentiated cardiomyocytes, and make effective comparisons between cardiomyocytes in drug and cell-line studies. The objective, scalable methods presented in this thesis present a new paradigm in performing analysis in high-throughput applications of cardiomyocytes via action potential morphology, and could be of large benefit to the cardiology and biology communities
Recommended from our members
Taking shape: The data science of elastic shape analysis with practical applications
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University London.A mathematical curve can represent many different objects, both physical and abstract,
from the outline curve of an artefact in an image to the weight of growing animal to
the set of frequencies used in a sound. Regardless of these variations, the curves can
almost always vary non-linearly. One way to study shapes and their potential variations
is elastic shape analysis, a rich theory of which has developed over the past twenty years.
However, methods of elastic shape analysis are seldom utilized in practical applications
on real-world data, especially outside of the mathematical shape analysis community.
Our aim in this thesis is to explore some practical applications of elastic shape analysis.
To do this, we work with various types of shape data, the majority of which are based on
image datasets. As our focus is on two-dimensional curves, it is important to be able to
robustly extract contours from images, before we can apply elastic shape analysis tools.
In order to analyse the shapes in a dataset, we turn to methods of machine learning, to
investigate the applications of elastic shape analysis in classification.
In this thesis, we introduce an anthology of projects, in order to emphasise and under-
stand the potential of elastic shape analysis in practical applications. There are four main
projects in this thesis: (i) Classification of objects using outlines and the comparisons
between methods of elastic shape analysis, geometric morphometrics, and human experts,
with a focus on ancient Greek vases, (ii) Mussel species identification and a demonstra-
tion that shape may not be enough in some applications, (iii) A novel tool to monitor
the development of k Ě„ak Ě„ap Ě„o chicks, and (iv) Classifying individual kiwi based on acoustic
data from their calls.
By combining tools from computer vision and machine learning with methods of elastic
shape analysis, we introduce a practical framework for the application of elastic shape
analysis, through a data science lens
Development of statistical methods for the analysis of single-cell RNA-seq data
Single-cell RNA-sequencing profiles the transcriptome of cells from diverse populations. A popular intermediate data format is a large count matrix of genes x cells. This type of data brings several analytical challenges. Here, I present three projects that I worked on during my PhD that address particular aspects of working with such datasets:
- The large number of cells in the count matrix is a challenge for fitting gamma-Poisson generalized linear models with existing tools. I developed a new R package called glmGamPoi to address this gap. I optimized the overdispersion estimation procedure to be quick and robust for datasets with many cells and small counts. I compared the performance against two popular tools (edgeR and DESeq2) and find that my inference is 6x to 13x faster and achieves a higher likelihood for a majority of the genes in four single-cell datasets.
- The variance of single-cell RNA-seq counts depends on their mean but many existing statistical tools have optimal performance when the variance is uniform. Accordingly, variance-stabilizing transformations are applied to unlock the large number of methods with such an requirement. I compared four approaches to variance-stabilize the data based on the delta method, model residuals, inferred latent expression state or count factor analysis. I describe the theoretical strength and weaknesses, and compare their empirical performance in a benchmark on simulated and real single-cell data. I find that none of the mathematically more sophisticated transformations consistently outperform the simple log(y/s+1) transformation.
- Multi-condition single-cell data offers the opportunity to find differentially expressed genes for individual cell subpopulations. However, the prevalent approach to analyze such data is to start by dividing the cells into discrete populations and then test for differential expression within each group. The results are interpretable but may miss interesting cases by (1) choosing the cluster size too small and lacking power to detect effects or (2) choosing the cluster size too large and obscuring interesting effects apparent on a smaller scale. I developed a new statistical framework for the analysis of multi-condition single-cell data that avoids the premature discretization. The approach performs regression on the latent subspaces occupied by the cells in each condition. The method is implemented as an R package called lemur
Singular geodesic coordinates for representing diffeomorphic maps in computational anatomy, with application to the morphometry of early Alzheimer's disease in the medial temporal lobe
In this work we develop novel algorithms for building one to one correspondences between anatomical forms by providing a sparse representation of dense registration information.
These sparse parameterizations of complex high dimensional data allow robustness in the face of noise and anomalies, and a platform for inference that is effective in the face of multiple comparisons.
We review background in the theory of generating smooth, invertible transformations (the diffeomorphism group), and build our parameterization as a function supported on surfaces bounding anatomical structures of interest. We show how dimensionality can be reduced even further and still provide a rich family of mappings using principal component analysis or Laplace Beltrami eigenfunctions supported on the surface.
We develop algorithms for surface matching and image matching within this model, and demonstrate the desired robustness by working with published large neuroimaging datasets that include many low quality examples.
Finally we turn to addressing challenges associated with some specific data types: images with multiple labels, and longitudinal data. We use the mapping tools developed to draw conclusions about the progression of early Alzheimer's disease in the medial temporal lobe
Geometric Data Analysis: Advancements of the Statistical Methodology and Applications
Data analysis has become fundamental to our society and comes in multiple facets and approaches. Nevertheless, in research and applications, the focus was primarily on data from Euclidean vector spaces. Consequently, the majority of methods that are applied today are not suited for more general data types. Driven by needs from fields like image processing, (medical) shape analysis, and network analysis, more and more attention has recently been given to data from non-Euclidean spaces–particularly (curved) manifolds. It has led to the field of geometric data analysis whose methods explicitly take the structure (for example, the topology and geometry) of the underlying space into account.
This thesis contributes to the methodology of geometric data analysis by generalizing several fundamental notions from multivariate statistics to manifolds. We thereby focus on two different viewpoints.
First, we use Riemannian structures to derive a novel regression scheme for general manifolds that relies on splines of generalized BĂ©zier curves. It can accurately model non-geodesic relationships, for example, time-dependent trends with saturation effects or cyclic trends. Since BĂ©zier curves can be evaluated with the constructive de Casteljau algorithm, working with data from manifolds of high dimensions (for example, a hundred thousand or more) is feasible. Relying on the regression, we further develop
a hierarchical statistical model for an adequate analysis of longitudinal data in manifolds, and a method to control for confounding variables.
We secondly focus on data that is not only manifold- but even Lie group-valued, which is frequently the case in applications. We can only achieve this by endowing the group with an affine connection structure that is generally not Riemannian. Utilizing it, we derive generalizations of several well-known dissimilarity measures between data distributions that can be used for various tasks, including hypothesis testing. Invariance under data translations is proven, and a connection to continuous distributions is given for one measure.
A further central contribution of this thesis is that it shows use cases for all notions in real-world applications, particularly in problems from shape analysis in medical imaging and archaeology. We can replicate or further quantify several known findings for shape changes of the femur and the right hippocampus under osteoarthritis and Alzheimer's, respectively. Furthermore, in an archaeological application, we obtain new insights into the construction principles of ancient sundials. Last but not least, we use the geometric structure underlying human brain connectomes to predict cognitive scores. Utilizing a sample selection procedure, we obtain state-of-the-art results
- …