6 research outputs found
Finding submatrices of maximal sum : applications to the analysis of gene expression data
An important aspect of cancer research is the development of better tools to understand underlying cellular processes. These tools are crucial as they help clinicians choose the best treatment strategy for each patient or develop new treatment strategies. Gene expression data is typically represented as a large matrix of gene expression levels across various samples. The study of such data is a valuable tool to improve the understanding of biological processes. Therefore, grouping genes according to their expression under certain conditions or group conditions based on the expression of some genes is a frequent objective of gene expression analysis. Biclustering, also known as co-clustering, is one of the most common approaches for such a task. It identifies specific subsets of rows and columns that jointly form homogeneous entries. However, relevant gene/sample combinations can be missed when they lack the assumed homogeneity of expression values. It is a growing concern as cancer is a heterogeneous disease. Thus, there is an ongoing trend for the study of cellular processes by combining heterogeneous data sources. This thesis is centered around the development of approaches that find patterns of high values in large data matrices. It encompasses the definition of optimization problems and algorithmic solutions to find such patterns. The relevance of these contributions is evaluated through implementation and comparative experiments on biological data.(FSA - Sciences de l'ingénieur) -- UCL, 202
Improving intraspecific allele networks inferred by maximum parsimony
Allele (or haplotype) networks are often used in phylogeographic studies to display genetic variation within a species or a group of closely related species. A global maximum parsimony approach to infer allele networks, arguably the method of choice to display genetic variation at the intraspecific level, consists in inferring all most parsimonious trees from a DNA sequence alignment and combining the corresponding phylograms into a single graph. However, it has been suggested that, while classic phylogenetic programs generate a single phylogram per most parsimonious tree, deriving all possible phylograms from them would allow identifying additional most parsimonious paths among alleles, thereby improving this network inference method. We test this prediction by analysing both simulated and empirical DNA sequence alignments. For this purpose, a computer program, CPN, was developed to implement the entire procedure, starting with a set of most parsimonious trees and combining all derived phylograms into a network. We show that including all possible most parsimonious phylograms indeed often results in finding additional most parsimonious paths in the network graph, thereby improving the search for a global maximum parsimony solution. We highly recommend the use of this approach in future phylogeographic studies, to ensure that all most parsimonious paths are included in the allele network, instead of an arbitrarily selected subset of those.SCOPUS: ar.jinfo:eu-repo/semantics/publishe
Identifying gene-specific subgroups: an alternative to biclustering
Background: Transcriptome analysis aims at gaining insight into cellular processes through discovering gene expression patterns across various experimental conditions. Biclustering is a standard approach to discover genes subsets with similar expression across subgroups of samples to be identified. The result is a set of biclusters, each forming a specific submatrix of rows (e.g. genes) and columns (e.g. samples). Relevant biclusters can, however, be missed when, due to the presence of a few outliers, they lack the assumed homogeneity of expression values among a few gene/sample combinations. The Max-Sum SubMatrix problem addresses this issue by looking at highly expressed subsets of genes and of samples, without enforcing such homogeneity. Results: We present here the K-CPGC algorithm to identify K relevant submatrices. Our main contribution is to show that this approach outperforms biclustering algorithms to identify several gene subsets representative of specific subgroups of samples. Experiments are conducted on 35 gene expression datasets from human tissues and yeast samples. We report comparative results with those obtained by several biclustering algorithms, including CCA, xMOTIFs, ISA, QUBIC, Plaid and Spectral. Gene enrichment analysis demonstrates the benefits of the proposed approach to identify more statistically significant gene subsets. The most significant Gene Ontology terms identified with K-CPGC are shown consistent with the controlled conditions of each dataset. This analysis supports the biological relevance of the identified gene subsets. An additional contribution is the statistical validation protocol proposed here to assess the relative performances of biclustering algorithms and of the proposed method. It relies on a Friedman test and the Hochberg’s sequential procedure to report critical differences of ranks among all algorithms. Conclusions: We propose here the K-CPGC method, a computationally efficient algorithm to identify K max-sum submatrices in a large gene expression matrix. Comparisons show that it identifies more significantly enriched subsets of genes and specific subgroups of samples which are easily interpretable by biologists. Experiments also show its ability to identify more reliable GO terms. These results illustrate the benefits of the proposed approach in terms of interpretability and of biological enrichment quality. Open implementation of this algorithm is available as an R package
Combinatorial Optimization Algorithms to Mine a Sub-Matrix of Maximal Sum
Biclustering techniques have been widely used to identify homogeneous subgroups within large data matrices, such as subsets of genes similarly expressed across subsets of patients. Mining a max-sum sub-matrix is a related but distinct problem for which one looks for a (non-necessarily contiguous) rectangular sub-matrix with a maximal sum of its entries. Le Van et al. (Ranked Tiling, 2014) already illustrated its applicability to gene expression analysis and addressed it with a constraint programming (CP) approach combined with large neighborhood search (LNS). In this work, we exhibit some key properties of this NP-hard problem and define a bounding function such that larger problems can be solved in reasonable time. The use of these properties results in an improved CP-LNS implementation evaluated here. Two additional algorithms are also proposed in order to exploit the highlighted characteristics of the problem: a CP approach with a global constraint (CPGC) and a mixed integer linear programming (MILP). Practical experiments conducted both on synthetic and real gene expression data exhibit the characteristics of these approaches and their relative benefits over the CP-LNS method. Overall, the CPGC approach tends to be the fastest to produce a good solution. Yet, the MILP formulation is arguably the easiest to formulate and can also be competitive
Mining a sub-matrix of maximal sum
Biclustering techniques have been widely used to identify homogeneous subgroups within large data matrices, such as subsets of genes similarly expressed across subsets of patients. Mining a max-sum sub-matrix is a related but distinct problem for which one looks for a (non-necessarily contiguous) rectangular sub-matrix with a maximal sum of its entries. Le Van et al. (Ranked Tiling, 2014) already illustrated its applicability to gene expression analysis and addressed it with a constraint programming (CP) approach combined with large neighborhood search (CP-LNS). In this work, we exhibit some key properties of this NP-hard problem and define a bounding function such that larger problems can be solved in reasonable time. Two different algorithms are proposed in order to exploit the highlighted characteristics of the problem: a CP approach with a global constraint (CPGC) and mixed integer linear programming (MILP). Practical experiments conducted both on synthetic and real gene expression data exhibit the characteristics of these approaches and their relative benefits over the original CP-LNS method. Overall, the CPGC approach tends to be the fastest to produce a good solution. Yet, the MILP formulation is arguably the easiest to formulate and can also be competitive