79 research outputs found

    Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck operators

    Full text link
    This paper presents a diffusion based probabilistic interpretation of spectral clustering and dimensionality reduction algorithms that use the eigenvectors of the normalized graph Laplacian. Given the pairwise adjacency matrix of all points, we define a diffusion distance between any two data points and show that the low dimensional representation of the data by the first few eigenvectors of the corresponding Markov matrix is optimal under a certain mean squared error criterion. Furthermore, assuming that data points are random samples from a density p(\x) = e^{-U(\x)} we identify these eigenvectors as discrete approximations of eigenfunctions of a Fokker-Planck operator in a potential 2U(\x) with reflecting boundary conditions. Finally, applying known results regarding the eigenvalues and eigenfunctions of the continuous Fokker-Planck operator, we provide a mathematical justification for the success of spectral clustering and dimensional reduction algorithms based on these first few eigenvectors. This analysis elucidates, in terms of the characteristics of diffusion processes, many empirical findings regarding spectral clustering algorithms.Comment: submitted to NIPS 200

    Variable-free exploration of stochastic models: a gene regulatory network example

    Get PDF
    Finding coarse-grained, low-dimensional descriptions is an important task in the analysis of complex, stochastic models of gene regulatory networks. This task involves (a) identifying observables that best describe the state of these complex systems and (b) characterizing the dynamics of the observables. In a previous paper [13], we assumed that good observables were known a priori, and presented an equation-free approach to approximate coarse-grained quantities (i.e, effective drift and diffusion coefficients) that characterize the long-time behavior of the observables. Here we use diffusion maps [9] to extract appropriate observables ("reduction coordinates") in an automated fashion; these involve the leading eigenvectors of a weighted Laplacian on a graph constructed from network simulation data. We present lifting and restriction procedures for translating between physical variables and these data-based observables. These procedures allow us to perform equation-free coarse-grained, computations characterizing the long-term dynamics through the design and processing of short bursts of stochastic simulation initialized at appropriate values of the data-based observables.Comment: 26 pages, 9 figure

    Genetic Drivers of Heterogeneity in Type 2 Diabetes Pathophysiology

    Get PDF
    Type 2 diabetes (T2D) is a heterogeneous disease that develops through diverse pathophysiological processes1,2 and molecular mechanisms that are often specific to cell type3,4. Here, to characterize the genetic contribution to these processes across ancestry groups, we aggregate genome-wide association study data from 2,535,601 individuals (39.7% not of European ancestry), including 428,452 cases of T2D. We identify 1,289 independent association signals at genome-wide significance (P \u3c 5 × 10-8) that map to 611 loci, of which 145 loci are, to our knowledge, previously unreported. We define eight non-overlapping clusters of T2D signals that are characterized by distinct profiles of cardiometabolic trait associations. These clusters are differentially enriched for cell-type-specific regions of open chromatin, including pancreatic islets, adipocytes, endothelial cells and enteroendocrine cells. We build cluster-specific partitioned polygenic scores5 in a further 279,552 individuals of diverse ancestry, including 30,288 cases of T2D, and test their association with T2D-related vascular outcomes. Cluster-specific partitioned polygenic scores are associated with coronary artery disease, peripheral artery disease and end-stage diabetic nephropathy across ancestry groups, highlighting the importance of obesity-related processes in the development of vascular outcomes. Our findings show the value of integrating multi-ancestry genome-wide association study data with single-cell epigenomics to disentangle the aetiological heterogeneity that drives the development and progression of T2D. This might offer a route to optimize global access to genetically informed diabetes care

    Genetic drivers of heterogeneity in type 2 diabetes pathophysiology

    Get PDF
    Type 2 diabetes (T2D) is a heterogeneous disease that develops through diverse pathophysiological processes1,2 and molecular mechanisms that are often specific to cell type3,4. Here, to characterize the genetic contribution to these processes across ancestry groups, we aggregate genome-wide association study data from 2,535,601 individuals (39.7% not of European ancestry), including 428,452 cases of T2D. We identify 1,289 independent association signals at genome-wide significance (P &lt; 5 × 10-8) that map to 611 loci, of which 145 loci are, to our knowledge, previously unreported. We define eight non-overlapping clusters of T2D signals that are characterized by distinct profiles of cardiometabolic trait associations. These clusters are differentially enriched for cell-type-specific regions of open chromatin, including pancreatic islets, adipocytes, endothelial cells and enteroendocrine cells. We build cluster-specific partitioned polygenic scores5 in a further 279,552 individuals of diverse ancestry, including 30,288 cases of T2D, and test their association with T2D-related vascular outcomes. Cluster-specific partitioned polygenic scores are associated with coronary artery disease, peripheral artery disease and end-stage diabetic nephropathy across ancestry groups, highlighting the importance of obesity-related processes in the development of vascular outcomes. Our findings show the value of integrating multi-ancestry genome-wide association study data with single-cell epigenomics to disentangle the aetiological heterogeneity that drives the development and progression of T2D. This might offer a route to optimize global access to genetically informed diabetes care.</p

    Preparedness of the CTSA's Structural and Scientific Assets to Support the Mission of the National Center for Advancing Translational Sciences (NCATS)

    Get PDF
    The formation of the National Center for Advancing Translational Sciences (NCATS) brings new promise for moving basic and discoveries to clinical practice, ultimately improving the health of the nation. The CTSA sites, now housed with NCATS, are organized and prepared to support in this endeavor. The CTSAs provide a foundation for capitalizing on such promise through provision of a disease-agnostic infrastructure devoted to C&T science, maintenance of training programs designed for C&T investigators of the future, by incentivizing institutional reorganization and by cultivating institutional support

    Mudança organizacional: uma abordagem preliminar

    Full text link

    The prediction error in CLS and PLS: the importance of feature selection prior to multivariate calibration

    No full text
    This paper presents an asymptotically exact mathematical analysis of the mean squared error of prediction of CLS and PLS under the linear mixture model commonly assumed in spectroscopy. For CLS regression with a very large calibration set the root mean squared error is approximately equal to the noise per wavelength divided by the length of the net analyte signal vector. It is shown, however, that for a finite training set with n samples in p dimensions there are additional error terms that depend on r , where r is the noise level per co-ordinat

    The prediction error in CLS and PLS: the importance of feature selection prior to multivariate calibration

    Get PDF
    This paper presents an asymptotically exact mathematical analysis of the mean squared error of prediction of CLS and PLS under the linear mixture model commonly assumed in spectroscopy. For CLS regression with a very large calibration set the root mean squared error is approximately equal to the noise per wavelength divided by the length of the net analyte signal vector. It is shown, however, that for a finite training set with n samples in p dimensions there are additional error terms that depend on r , where r is the noise level per co-ordinat

    Partial Least Squares

    No full text
    this paper we analyze the PLS algorithm under a specific probabilistic model for the relation between x and y. Following Beer&apos;s law, we assume a linear mixture model in which each data sample (x, y) is a random realization from a joint probability distribution where x is the sum of k components multiplied by their respective characteristic responses, and each of these components is a random variable. We analyze PLS on this model under two idealized settings: one is the ideal case of noise-free samples and the other is the case of an infinite number of noisy training samples. In the noise-free case we prove that, as expected, the regression vector computed by PLS is, up to normalization, the net analyte signal. We prove that PLS computes this vector after at most k iterations, where k is the total number of components. In the case of an infinite training set corrupted by unstructured noise, we show that PLS computes a final regression vector which is not in general purely proportional to the net analyte signal vector, but has the important property of being optimal under a mean squared error of prediction criterion. This result can be viewed as an asymptotic optimality of PLS in the limit of a very large but finite training set. Copyright # 2005 John Wiley &amp; Sons, Ltd. 1
    corecore