80 research outputs found

    Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization

    Get PDF
    2007-2008 > Academic research: refereed > Publication in refereed journalVersion of RecordPublishe

    Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The DNA microarray technology allows the measurement of expression levels of thousands of genes under tens/hundreds of different conditions. In microarray data, genes with similar functions usually co-express under certain conditions only <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Thus, biclustering which clusters genes and conditions simultaneously is preferred over the traditional clustering technique in discovering these coherent genes. Various biclustering algorithms have been developed using different bicluster formulations. Unfortunately, many useful formulations result in NP-complete problems. In this article, we investigate an efficient method for identifying a popular type of biclusters called additive model. Furthermore, parallel coordinate (PC) plots are used for bicluster visualization and analysis.</p> <p>Results</p> <p>We develop a novel and efficient biclustering algorithm which can be regarded as a greedy version of an existing algorithm known as pCluster algorithm. By relaxing the constraint in homogeneity, the proposed algorithm has polynomial-time complexity in the worst case instead of exponential-time complexity as in the pCluster algorithm. Experiments on artificial datasets verify that our algorithm can identify both additive-related and multiplicative-related biclusters in the presence of overlap and noise. Biologically significant biclusters have been validated on the yeast cell-cycle expression dataset using Gene Ontology annotations. Comparative study shows that the proposed approach outperforms several existing biclustering algorithms. We also provide an interactive exploratory tool based on PC plot visualization for determining the parameters of our biclustering algorithm.</p> <p>Conclusion</p> <p>We have proposed a novel biclustering algorithm which works with PC plots for an interactive exploratory analysis of gene expression data. Experiments show that the biclustering algorithm is efficient and is capable of detecting co-regulated genes. The interactive analysis enables an optimum parameter determination in the biclustering algorithm so as to achieve the best result. In future, we will modify the proposed algorithm for other bicluster models such as the coherent evolution model.</p

    DNA Microarray Data Analysis: A New Survey on Biclustering

    Get PDF
    There are subsets of genes that have similar behavior under subsets of conditions, so we say that they coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of utmost importance to make a simultaneous clustering of genes and conditions to identify clusters of genes that are coexpressed under clusters of conditions. This type of clustering is called biclustering.Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to approximate this problem by finding suboptimal solutions. In this paper, we make a new survey on biclustering of gene expression data, also called microarray data

    DeBi: Discovering Differentially Expressed Biclusters using a Frequent Itemset Approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The analysis of massive high throughput data via clustering algorithms is very important for elucidating gene functions in biological systems. However, traditional clustering methods have several drawbacks. Biclustering overcomes these limitations by grouping genes and samples simultaneously. It discovers subsets of genes that are co-expressed in certain samples. Recent studies showed that biclustering has a great potential in detecting marker genes that are associated with certain tissues or diseases. Several biclustering algorithms have been proposed. However, it is still a challenge to find biclusters that are significant based on biological validation measures. Besides that, there is a need for a biclustering algorithm that is capable of analyzing very large datasets in reasonable time.</p> <p>Results</p> <p>Here we present a fast biclustering algorithm called DeBi (Differentially Expressed BIclusters). The algorithm is based on a well known data mining approach called frequent itemset. It discovers maximum size homogeneous biclusters in which each gene is strongly associated with a subset of samples. We evaluate the performance of DeBi on a yeast dataset, on synthetic datasets and on human datasets.</p> <p>Conclusions</p> <p>We demonstrate that the DeBi algorithm provides functionally more coherent gene sets compared to standard clustering or biclustering algorithms using biological validation measures such as Gene Ontology term and Transcription Factor Binding Site enrichment. We show that DeBi is a computationally efficient and powerful tool in analyzing large datasets. The method is also applicable on multiple gene expression datasets coming from different labs or platforms.</p

    Development of Biclustering Techniques for Gene Expression Data Modeling and Mining

    Get PDF
    The next-generation sequencing technologies can generate large-scale biological data with higher resolution, better accuracy, and lower technical variation than the arraybased counterparts. RNA sequencing (RNA-Seq) can generate genome-scale gene expression data in biological samples at a given moment, facilitating a better understanding of cell functions at genetic and cellular levels. The abundance of gene expression datasets provides an opportunity to identify genes with similar expression patterns across multiple conditions, i.e., co-expression gene modules (CEMs). Genomescale identification of CEMs can be modeled and solved by biclustering, a twodimensional data mining technique that allows clustering of rows and columns in a gene expression matrix, simultaneously. Compared with traditional clustering that targets global patterns, biclustering can predict local patterns. This unique feature makes biclustering very useful when applied to big gene expression data since genes that participate in a cellular process are only active in specific conditions, thus are usually coexpressed under a subset of all conditions. The combination of biclustering and large-scale gene expression data holds promising potential for condition-specific functional pathway/network analysis. However, existing biclustering tools do not have satisfied performance on high-resolution RNA-Seq data, majorly due to the lack of (i) a consideration of high sparsity of RNA-Seq data, especially for scRNA-Seq data, and (ii) an understanding of the underlying transcriptional regulation signals of the observed gene expression values. QUBIC2, a novel biclustering algorithm, is designed for large-scale bulk RNA-Seq and single-cell RNA-seq (scRNA-Seq) data analysis. Critical novelties of the algorithm include (i) used a truncated model to handle the unreliable quantification of genes with low or moderate expression; (ii) adopted the Gaussian mixture distribution and an information-divergency objective function to capture shared transcriptional regulation signals among a set of genes; (iii) utilized a Dual strategy to expand the core biclusters, aiming to save dropouts from the background; and (iv) developed a statistical framework to evaluate the significances of all the identified biclusters. Method validation on comprehensive data sets suggests that QUBIC2 had superior performance in functional modules detection and cell type classification. The applications of temporal and spatial data demonstrated that QUBIC2 could derive meaningful biological information from scRNA-Seq data. Also presented in this dissertation is QUBICR. This R package is characterized by an 82% average improved efficiency compared to the source C code of QUBIC. It provides a set of comprehensive functions to facilitate biclustering-based biological studies, including the discretization of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. In the end, a systematical summary is provided regarding the primary applications of biclustering for biological data and more advanced applications for biomedical data. It will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency

    A biclustering algorithm based on a Bicluster Enumeration Tree: application to DNA microarray data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In a number of domains, like in DNA microarray data analysis, we need to cluster simultaneously rows (genes) and columns (conditions) of a data matrix to identify groups of rows coherent with groups of columns. This kind of clustering is called <it>biclustering</it>. Biclustering algorithms are extensively used in DNA microarray data analysis. More effective biclustering algorithms are highly desirable and needed.</p> <p>Methods</p> <p>We introduce <it>BiMine</it>, a new enumeration algorithm for biclustering of DNA microarray data. The proposed algorithm is based on three original features. First, <it>BiMine </it>relies on a new evaluation function called <it>Average Spearman's rho </it>(ASR). Second, <it>BiMine </it>uses a new tree structure, called <it>Bicluster Enumeration Tree </it>(BET), to represent the different biclusters discovered during the enumeration process. Third, to avoid the combinatorial explosion of the search tree, <it>BiMine </it>introduces a parametric rule that allows the enumeration process to cut tree branches that cannot lead to good biclusters.</p> <p>Results</p> <p>The performance of the proposed algorithm is assessed using both synthetic and real DNA microarray data. The experimental results show that <it>BiMine </it>competes well with several other biclustering methods. Moreover, we test the biological significance using a gene annotation web-tool to show that our proposed method is able to produce biologically relevant biclusters. The software is available upon request from the authors to academic users.</p

    Discernment of possible mechanisms of hepatotoxicity via biological processes over-represented by co-expressed genes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Hepatotoxicity is a form of liver injury caused by exposure to stressors. Genomic-based approaches have been used to detect changes in transcription in response to hepatotoxicants. However, there are no straightforward ways of using co-expressed genes anchored to a phenotype or constrained by the experimental design for discerning mechanisms of a biological response.</p> <p>Results</p> <p>Through the analysis of a gene expression dataset containing 318 liver samples from rats exposed to hepatotoxicants and leveraging alanine aminotransferase (ALT), a serum enzyme indicative of liver injury as the phenotypic marker, we identified biological processes and molecular pathways that may be associated with mechanisms of hepatotoxicity. Our analysis used an approach called Coherent Co-expression Biclustering (cc-Biclustering) for clustering a subset of genes through a coherent (consistency) measure within each group of samples representing a subset of experimental conditions. Supervised biclustering identified 87 genes co-expressed and correlated with ALT in all the samples exposed to the chemicals. None of the over-represented pathways related to liver injury. However, biclusters with subsets of samples exposed to one of the 7 hepatotoxicants, but not to a non-toxic isomer, contained co-expressed genes that represented pathways related to a stress response. Unsupervised biclustering of the data resulted in 1) four to five times more genes within the bicluster containing all the samples exposed to the chemicals, 2) biclusters with co-expression of genes that discerned 1,4 dichlorobenzene (a non-toxic isomer at low and mid doses) from the other chemicals, pathways and biological processes that underlie liver injury and 3) a bicluster with genes up-regulated in an early response to toxic exposure.</p> <p>Conclusion</p> <p>We obtained clusters of co-expressed genes that over-represented biological processes and molecular pathways related to hepatotoxicity in the rat. The mechanisms involved in the response of the liver to the exposure to 1,4-dichlorobenzene suggest non-genotoxicity whereas the exposure to the hepatotoxicants could be DNA damaging leading to overall genomic instability and activation of cell cycle check point signaling. In addition, key pathways and biological processes representative of an inflammatory response, energy production and apoptosis were impacted by the hepatotoxicant exposures that manifested liver injury in the rat.</p
    corecore