12 research outputs found

    Medoidshift clustering applied to genomic bulk tumor data.

    Get PDF
    Despite the enormous medical impact of cancers and intensive study of their biology, detailed characterization of tumor growth and development remains elusive. This difficulty occurs in large part because of enormous heterogeneity in the molecular mechanisms of cancer progression, both tumor-to-tumor and cell-to-cell in single tumors. Advances in genomic technologies, especially at the single-cell level, are improving the situation, but these approaches are held back by limitations of the biotechnologies for gathering genomic data from heterogeneous cell populations and the computational methods for making sense of those data. One popular way to gain the advantages of whole-genome methods without the cost of single-cell genomics has been the use of computational deconvolution (unmixing) methods to reconstruct clonal heterogeneity from bulk genomic data. These methods, too, are limited by the difficulty of inferring genomic profiles of rare or subtly varying clonal subpopulations from bulk data, a problem that can be computationally reduced to that of reconstructing the geometry of point clouds of tumor samples in a genome space. Here, we present a new method to improve that reconstruction by better identifying subspaces corresponding to tumors produced from mixtures of distinct combinations of clonal subpopulations. We develop a nonparametric clustering method based on medoidshift clustering for identifying subgroups of tumors expected to correspond to distinct trajectories of evolutionary progression. We show on synthetic and real tumor copy-number data that this new method substantially improves our ability to resolve discrete tumor subgroups, a key step in the process of accurately deconvolving tumor genomic data and inferring clonal heterogeneity from bulk data

    Applying unmixing to gene expression data for tumor phylogeny inference

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>While in principle a seemingly infinite variety of combinations of mutations could result in tumor development, in practice it appears that most human cancers fall into a relatively small number of "sub-types," each characterized a roughly equivalent sequence of mutations by which it progresses in different patients. There is currently great interest in identifying the common sub-types and applying them to the development of diagnostics or therapeutics. Phylogenetic methods have shown great promise for inferring common patterns of tumor progression, but suffer from limits of the technologies available for assaying differences between and within tumors. One approach to tumor phylogenetics uses differences between single cells within tumors, gaining valuable information about intra-tumor heterogeneity but allowing only a few markers per cell. An alternative approach uses tissue-wide measures of whole tumors to provide a detailed picture of averaged tumor state but at the cost of losing information about intra-tumor heterogeneity.</p> <p>Results</p> <p>The present work applies "unmixing" methods, which separate complex data sets into combinations of simpler components, to attempt to gain advantages of both tissue-wide and single-cell approaches to cancer phylogenetics. We develop an unmixing method to infer recurring cell states from microarray measurements of tumor populations and use the inferred mixtures of states in individual tumors to identify possible evolutionary relationships among tumor cells. Validation on simulated data shows the method can accurately separate small numbers of cell states and infer phylogenetic relationships among them. Application to a lung cancer dataset shows that the method can identify cell states corresponding to common lung tumor types and suggest possible evolutionary relationships among them that show good correspondence with our current understanding of lung tumor development.</p> <p>Conclusions</p> <p>Unmixing methods provide a way to make use of both intra-tumor heterogeneity and large probe sets for tumor phylogeny inference, establishing a new avenue towards the construction of detailed, accurate portraits of common tumor sub-types and the mechanisms by which they develop. These reconstructions are likely to have future value in discovering and diagnosing novel cancer sub-types and in identifying targets for therapeutic development.</p

    Topological signal processing over simplicial complexes

    Get PDF
    The goal of this paper is to establish the fundamental tools to analyze signals defined over a topological space, i.e. a set of points along with a set of neighborhood relations. This setup does not require the definition of a metric and then it is especially useful to deal with signals defined over non-metric spaces. We focus on signals defined over simplicial complexes. Graph Signal Processing (GSP) represents a special case of Topological Signal Processing (TSP), referring to the situation where the signals are associated only with the vertices of a graph. Even though the theory can be applied to signals of any order, we focus on signals defined over the edges of a graph and show how building a simplicial complex of order two, i.e. including triangles, yields benefits in the analysis of edge signals. After reviewing the basic principles of algebraic topology, we derive a sampling theory for signals of any order and emphasize the interplay between signals of different order. Then we propose a method to infer the topology of a simplicial complex from data. We conclude with applications to real edge signals and to the analysis of discrete vector fields to illustrate the benefits of the proposed methodologies

    Dynamical systems defined on simplicial complexes: symmetries, conjugacies, and invariant subspaces

    Full text link
    We consider the general model for dynamical systems defined on a simplicial complex. We describe the conjugacy classes of these systems and show how symmetries in a given simplicial complex manifest in the dynamics defined thereon, especially with regard to invariant subspaces in the dynamics

    Factor analysis of dynamic PET images

    Get PDF
    Thanks to its ability to evaluate metabolic functions in tissues from the temporal evolution of a previously injected radiotracer, dynamic positron emission tomography (PET) has become an ubiquitous analysis tool to quantify biological processes. Several quantification techniques from the PET imaging literature require a previous estimation of global time-activity curves (TACs) (herein called \textit{factors}) representing the concentration of tracer in a reference tissue or blood over time. To this end, factor analysis has often appeared as an unsupervised learning solution for the extraction of factors and their respective fractions in each voxel. Inspired by the hyperspectral unmixing literature, this manuscript addresses two main drawbacks of general factor analysis techniques applied to dynamic PET. The first one is the assumption that the elementary response of each tissue to tracer distribution is spatially homogeneous. Even though this homogeneity assumption has proven its effectiveness in several factor analysis studies, it may not always provide a sufficient description of the underlying data, in particular when abnormalities are present. To tackle this limitation, the models herein proposed introduce an additional degree of freedom to the factors related to specific binding. To this end, a spatially-variant perturbation affects a nominal and common TAC representative of the high-uptake tissue. This variation is spatially indexed and constrained with a dictionary that is either previously learned or explicitly modelled with convolutional nonlinearities affecting non-specific binding tissues. The second drawback is related to the noise distribution in PET images. Even though the positron decay process can be described by a Poisson distribution, the actual noise in reconstructed PET images is not expected to be simply described by Poisson or Gaussian distributions. Therefore, we propose to consider a popular and quite general loss function, called the β\beta-divergence, that is able to generalize conventional loss functions such as the least-square distance, Kullback-Leibler and Itakura-Saito divergences, respectively corresponding to Gaussian, Poisson and Gamma distributions. This loss function is applied to three factor analysis models in order to evaluate its impact on dynamic PET images with different reconstruction characteristics

    New approaches for unsupervised transcriptomic data analysis based on Dictionary learning

    Get PDF
    The era of high-throughput data generation enables new access to biomolecular profiles and exploitation thereof. However, the analysis of such biomolecular data, for example, transcriptomic data, suffers from the so-called "curse of dimensionality". This occurs in the analysis of datasets with a significantly larger number of variables than data points. As a consequence, overfitting and unintentional learning of process-independent patterns can appear. This can lead to insignificant results in the application. A common way of counteracting this problem is the application of dimension reduction methods and subsequent analysis of the resulting low-dimensional representation that has a smaller number of variables. In this thesis, two new methods for the analysis of transcriptomic datasets are introduced and evaluated. Our methods are based on the concepts of Dictionary learning, which is an unsupervised dimension reduction approach. Unlike many dimension reduction approaches that are widely applied for transcriptomic data analysis, Dictionary learning does not impose constraints on the components that are to be derived. This allows for great flexibility when adjusting the representation to the data. Further, Dictionary learning belongs to the class of sparse methods. The result of sparse methods is a model with few non-zero coefficients, which is often preferred for its simplicity and ease of interpretation. Sparse methods exploit the fact that the analysed datasets are highly structured. Indeed, a characteristic of transcriptomic data is particularly their structuredness, which appears due to the connection of genes and pathways, for example. Nonetheless, the application of Dictionary learning in medical data analysis is mainly restricted to image analysis. Another advantage of Dictionary learning is that it is an interpretable approach. Interpretability is a necessity in biomolecular data analysis to gain a holistic understanding of the investigated processes. Our two new transcriptomic data analysis methods are each designed for one main task: (1) identification of subgroups for samples from mixed populations, and (2) temporal ordering of samples from dynamic datasets, also referred to as "pseudotime estimation". Both methods are evaluated on simulated and real-world data and compared to other methods that are widely applied in transcriptomic data analysis. Our methods convince through high performance and overall outperform the comparison methods

    Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain

    Get PDF
    The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio

    Exploring and controlling for underlying structure in genome and microbiome case-control association studies

    Get PDF
    Case-control association studies in human genetics and microbiome pave the way to personalized medicine by enabling a personalized risk assessment, improved prognosis, or allowing an early diagnosis. However, confounding due to population structure, or other unobserved factors, can produce spurious findings or mask true associations, if not detected and corrected for. As a consequence, underlying structure improperly accounted for could explain lack of power or some unsuccessful replications observed in case-control association studies. Besides, points considered as outliers are commonly removed in such studies although they do not always correspond to technical errors. A wealth of methods exist to determine structure in genetic and microbiome association studies. However, there are few systematic comparisons between these methods in the frame of genetic or microbiome association studies, and even less attempts to apply robust methods, which produce stable estimates of confounding underlying structure, and which are able to incorporate information from outliers without degrading estimates quality. Consequently, the aim of this thesis was to detect and control robustly for underlying confounding structure in genetic and microbiome data, by comparing systematically the most relevant standard and robust forms of principal components analysis (PCA) or multidimensional scaling (MDS) based methods, and by contributing new robust methods. Own contributions include robustification of existing methods, adaption to the genetic or to the microbiome framework, and a dimensionality exploration and reduction method, nSimplices. Analysed datasets include a first synthetic example with a low-variance 2-groups confounding structure, a second synthetic example with a simple linear underlying structure, genome-wide single nucleotide polymorphism (SNP) from 860 case and control individuals enrolled in the European Prospective Investigation into Cancer and nutrition (EPIC prostate), and finally, 2 255 microbiome samples from the human microbiome project (HMP). Synthetic or real outliers were added in the second example and in EPIC and HMP datasets. All meaningful existing and contributed methods were applied to the EPIC and HMP datasets, while a restricted set was applied to the synthetic, illustrative examples. The 10 principal components or top axes resulting from each method were kept for further analysis. Quality of a method was assessed by how well these axes summarized the underlying structure (using Akaike's information criterion -AIC- from the regression of the 10 axes on known underlying structure in the data), and by how robust the estimates stayed in the presence of outliers (adjusted R2 from the regression of each outlier-disturbed axis on the original axis). In synthetic example 1, only ICA was able to uncover the low-variance confounding structure, whereas PCA or MDS failed to do so, in agreement with the fact that these methods detect large rather than small variance or distance components. In synthetic example 2, non-metric MDS remained the most representative and robust method when distance outliers are included, while nSimplices combined with classical MDS was the only method to stay representative and robust if contextual outliers are present. In the EPIC dataset, Eigenstrat was the most representative method (AIC of 782.8) whereas sample ancestry was best captured by new method gMCD (unbiased genetic relatedness estimates used in a Minimum Covariance Determinant procedure). Methods gMCD, spherical PCA, IBS (MDS on Identity-by-State estimates) and nSimplices were more robust than Eigenstrat, with a small to moderate loss in terms of representativity (AIC between 789.6 and 864.9). Association testing yielded p-values comparable with published values on candidate SNPs. Further SNPs rs8071475, rs3799631, rs2589118 with lowest p-value were identified, whose known role in other disorders could point to an indirect link with prostate cancer. In the HMP dataset, the new method nSimplices combined to data-driven normalization method qMDS mirrored best the underlying structure. The most robust method was qMDS (with nSimplices or alone), followed by CSS and MDS. Lastly, the original method nSimplices performed in all settings at least comparably (except ancestry in EPIC), and in some cases considerably better than other methods, while remaining tractable and fast in high-dimensional datasets. The improved performance of gMCD and qMDS agrees with the fact that these methods use adapted measures (genetic relatedness, selected model distribution, respectively) and recognized robust approaches (minimum covariance determinant and quantiles). Conversely, wMDS is likely to have failed because variance is not an adequate parameter for microbiome data. More generally, different methods report the underlying structure differently and are advantageous in different settings, for example PCA or non-metric MDS were best in some settings but failed in other. Finally, the original method nSimplices proved useful or markedly better in a variety of settings, with the exception of highly noisy datasets, and provided that distance outliers are corrected. Current genetic case-control association studies tend to integrate several types of data, for example clinical and SNP data, or several omics datasets. These approaches are promising but could be subject to increased inaccuracies or replication issues, by the mere combination of several sources of data. This motivates a reinforced use of robust methods, which are able to mirror accurately and steadily genetic information, such as gMCD, nSimplices or spherical PCA. Nevertheless, results on Eigenstrat show this stays a reasonable method. Results in microbiome confirmed that MDS based on proportions is a suboptimal method, and suggested the exponential distribution should be considered instead of multinomial-based distributions, certainly because the exponential better represents the inherent competitiveness between phylogenies in the microbiome. Moreover, illustrative and real world examples showed that methods could capture relevant, but different information, encouraging to apply several complementary methods when starting to explore a dataset. In particular, a low-variance confounder could stay undetected in some methods. Additionally, methods based on least absolute residuals revealed several shortcomings in spite of their utility in a univariate frame, but their expected benefit in a multivariate setting should motivate the development of more tractable implementations. Finally, SPH, IBS, gMCD are recommended methods in a genetic SNP dataset, while Eigenstrat should perform best if no more than 2% outliers are present. To mirror structure in a microbiome dataset, nSimplices (combined with qMDS, or with CSS) can be expected to perform best, whereas MDS on proportions is likely to underperform. Method nSimplices proved beneficial or largely better in various situations and should therefore be considered to analyse datasets including, but not limited to, genetic SNP and microbiome abundances
    corecore