2,218 research outputs found

    Model-Based Clustering and Classification of Functional Data

    Full text link
    The problem of complex data analysis is a central topic of modern statistical science and learning systems and is becoming of broader interest with the increasing prevalence of high-dimensional data. The challenge is to develop statistical models and autonomous algorithms that are able to acquire knowledge from raw data for exploratory analysis, which can be achieved through clustering techniques or to make predictions of future data via classification (i.e., discriminant analysis) techniques. Latent data models, including mixture model-based approaches are one of the most popular and successful approaches in both the unsupervised context (i.e., clustering) and the supervised one (i.e, classification or discrimination). Although traditionally tools of multivariate analysis, they are growing in popularity when considered in the framework of functional data analysis (FDA). FDA is the data analysis paradigm in which the individual data units are functions (e.g., curves, surfaces), rather than simple vectors. In many areas of application, the analyzed data are indeed often available in the form of discretized values of functions or curves (e.g., time series, waveforms) and surfaces (e.g., 2d-images, spatio-temporal data). This functional aspect of the data adds additional difficulties compared to the case of a classical multivariate (non-functional) data analysis. We review and present approaches for model-based clustering and classification of functional data. We derive well-established statistical models along with efficient algorithmic tools to address problems regarding the clustering and the classification of these high-dimensional data, including their heterogeneity, missing information, and dynamical hidden structure. The presented models and algorithms are illustrated on real-world functional data analysis problems from several application area

    Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics.

    Get PDF
    BackgroundSingle-cell transcriptomics allows researchers to investigate complex communities of heterogeneous cells. It can be applied to stem cells and their descendants in order to chart the progression from multipotent progenitors to fully differentiated cells. While a variety of statistical and computational methods have been proposed for inferring cell lineages, the problem of accurately characterizing multiple branching lineages remains difficult to solve.ResultsWe introduce Slingshot, a novel method for inferring cell lineages and pseudotimes from single-cell gene expression data. In previously published datasets, Slingshot correctly identifies the biological signal for one to three branching trajectories. Additionally, our simulation study shows that Slingshot infers more accurate pseudotimes than other leading methods.ConclusionsSlingshot is a uniquely robust and flexible tool which combines the highly stable techniques necessary for noisy single-cell data with the ability to identify multiple trajectories. Accurate lineage inference is a critical step in the identification of dynamic temporal gene expression

    Automatic pharynx and larynx cancer segmentation framework (PLCSF) on contrast enhanced MR images

    Get PDF
    A novel and effective pharynx and larynx cancer segmentation framework (PLCSF) is presented for automatic base of tongue and larynx cancer segmentation from gadolinium-enhanced T1-weighted magnetic resonance images (MRI). The aim of the proposed PLCSF is to assist clinicians in radiotherapy treatment planning. The initial processing of MRI data in PLCSF includes cropping of region of interest; reduction of artefacts and detection of the throat region for the location prior. Further, modified fuzzy c-means clustering is developed to robustly separate candidate cancer pixels from other tissue types. In addition, region-based level set method is evolved to ensure spatial smoothness for the final segmentation boundary after noise removal using non-linear and morphological filtering. Validation study of PLCSF on 102 axial MRI slices demonstrate mean dice similarity coefficient of 0.79 and mean modified Hausdorff distance of 2.2 mm when compared with manual segmentations. Comparison of PLCSF with other algorithms validates the robustness of the PLCSF. Inter- and intra-variability calculations from manual segmentations suggest that PLCSF can help to reduce the human subjectivity

    Quantitative Classification of Somatostatin-Positive Neocortical Interneurons Identifies Three Interneuron Subtypes

    Get PDF
    Deciphering the circuitry of the neocortex requires knowledge of its components, making a systematic classification of neocortical neurons necessary. GABAergic interneurons contribute most of the morphological, electrophysiological and molecular diversity of the cortex, yet interneuron subtypes are still not well defined. To quantitatively identify classes of interneurons, 59 GFP-positive interneurons from a somatostatin-positive mouse line were characterized by whole-cell recordings and anatomical reconstructions. For each neuron, we measured a series of physiological and morphological variables and analyzed these data using unsupervised classification methods. PCA and cluster analysis of morphological variables revealed three groups of cells: one comprised of Martinotti cells, and two other groups of interneurons with short asymmetric axons targeting layers 2/3 and bending medially. PCA and cluster analysis of electrophysiological variables also revealed the existence of these three groups of neurons, particularly with respect to action potential time course. These different morphological and electrophysiological characteristics could make each of these three interneuron subtypes particularly suited for a different function within the cortical circuit

    Spike sorting for large, dense electrode arrays

    Get PDF
    Developments in microfabrication technology have enabled the production of neural electrode arrays with hundreds of closely spaced recording sites, and electrodes with thousands of sites are under development. These probes in principle allow the simultaneous recording of very large numbers of neurons. However, use of this technology requires the development of techniques for decoding the spike times of the recorded neurons from the raw data captured from the probes. Here we present a set of tools to solve this problem, implemented in a suite of practical, user-friendly, open-source software. We validate these methods on data from the cortex, hippocampus and thalamus of rat, mouse, macaque and marmoset, demonstrating error rates as low as 5%

    Benchmark data and model independent event classification for the large hadron collider

    Get PDF
    We describe the outcome of a data challenge conducted as part of the Dark Machines (https://www.darkmachines.org) initiative and the Les Houches 2019 workshop on Physics at TeV colliders. The challenged aims to detect signals of new physics at the Large Hadron Collider (LHC) using unsupervised machine learning algorithms. First, we propose how an anomaly score could be implemented to define model-independent signal regions in LHC searches. We define and describe a large benchmark dataset, consisting of > 1 billion simulated LHC events corresponding to 10 fb−1 of proton-proton collisions at a center-of-mass energy of 13 TeV. We then review a wide range of anomaly detection and density estimation algorithms, developed in the context of the data challenge, and we measure their performance in a set of realistic analysis environments. We draw a number of useful conclusions that will aid the development of unsupervised new physics searches during the third run of the LHC, and provide our benchmark dataset for future studies at https://www.phenoMLdata.org. Code to reproduce the analysis is provided at https://github.com/bostdiek/DarkMachines-UnsupervisedChallenge

    Landscape mapping at sub-Antarctic South Georgia provides a protocol for underpinning large-scale marine protected areas

    Get PDF
    Global biodiversity is in decline, with the marine environment experiencing significant and increasing anthropogenic pressures. In response marine protected areas (MPAs) have increasingly been adopted as the flagship approach to marine conservation, many covering enormous areas. At present, however, the lack of biological sampling makes prioritising which regions of the ocean to protect, especially over large spatial scales, particularly problematic. Here we present an interdisciplinary approach to marine landscape mapping at the sub-Antarctic island of South Georgia as an effective protocol for underpinning large-scale (105–106  km2) MPA designations. We have developed a new high-resolution (100 m) digital elevation model (DEM) of the region and integrated this DEM with bathymetry-derived parameters, modelled oceanographic data, and satellite primary productivity data. These interdisciplinary datasets were used to apply an objective statistical approach to hierarchically partition and map the benthic environment into physical habitats types. We assess the potential application of physical habitat classifications as proxies for biological structuring and the application of the landscape mapping for informing on marine spatial plannin

    Robust Detection of Hierarchical Communities from Escherichia coli Gene Expression Data

    Get PDF
    Determining the functional structure of biological networks is a central goal of systems biology. One approach is to analyze gene expression data to infer a network of gene interactions on the basis of their correlated responses to environmental and genetic perturbations. The inferred network can then be analyzed to identify functional communities. However, commonly used algorithms can yield unreliable results due to experimental noise, algorithmic stochasticity, and the influence of arbitrarily chosen parameter values. Furthermore, the results obtained typically provide only a simplistic view of the network partitioned into disjoint communities and provide no information of the relationship between communities. Here, we present methods to robustly detect coregulated and functionally enriched gene communities and demonstrate their application and validity for Escherichia coli gene expression data. Applying a recently developed community detection algorithm to the network of interactions identified with the context likelihood of relatedness (CLR) method, we show that a hierarchy of network communities can be identified. These communities significantly enrich for gene ontology (GO) terms, consistent with them representing biologically meaningful groups. Further, analysis of the most significantly enriched communities identified several candidate new regulatory interactions. The robustness of our methods is demonstrated by showing that a core set of functional communities is reliably found when artificial noise, modeling experimental noise, is added to the data. We find that noise mainly acts conservatively, increasing the relatedness required for a network link to be reliably assigned and decreasing the size of the core communities, rather than causing association of genes into new communities.Comment: Due to appear in PLoS Computational Biology. Supplementary Figure S1 was not uploaded but is available by contacting the author. 27 pages, 5 figures, 15 supplementary file
    corecore