23 research outputs found

    Co-clustering algorithm for the identification of cancer subtypes from gene expression data

    Get PDF
    Cancer has been classified as a heterogeneous genetic disease comprising various different subtypes based on gene expression data. Early stages of diagnosis and prognosis for cancer type have become an essential requirement in cancer informatics research because it is helpful for the clinical treatment of patients. Besides this, gene network interaction which is the significant in order to understand the cellular and progressive mechanisms of cancer has been barely considered in current research. Hence, applications of machine learning methods become an important area for researchers to explore in order to categorize cancer genes into high and low risk groups or subtypes. Presently co-clustering is an extensively used data mining technique for analyzing gene expression data. This paper presents an improved network assisted co-clustering for the identification of cancer subtypes (iNCIS) where it combines gene network information with gene expression data to obtain co-clusters. The effectiveness of iNCIS was evaluated on large-scale Breast Cancer (BRCA) and Glioblastoma Multiforme (GBM). This weighted co-clustering approach in iNCIS delivers a distinctive result to integrate gene network into the clustering procedure

    DNA Microarray Data Analysis: A New Survey on Biclustering

    Get PDF
    There are subsets of genes that have similar behavior under subsets of conditions, so we say that they coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of utmost importance to make a simultaneous clustering of genes and conditions to identify clusters of genes that are coexpressed under clusters of conditions. This type of clustering is called biclustering.Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to approximate this problem by finding suboptimal solutions. In this paper, we make a new survey on biclustering of gene expression data, also called microarray data

    Approximation Algorithms for Bregman Co-clustering and Tensor Clustering

    Full text link
    In the past few years powerful generalizations to the Euclidean k-means problem have been made, such as Bregman clustering [7], co-clustering (i.e., simultaneous clustering of rows and columns of an input matrix) [9,18], and tensor clustering [8,34]. Like k-means, these more general problems also suffer from the NP-hardness of the associated optimization. Researchers have developed approximation algorithms of varying degrees of sophistication for k-means, k-medians, and more recently also for Bregman clustering [2]. However, there seem to be no approximation algorithms for Bregman co- and tensor clustering. In this paper we derive the first (to our knowledge) guaranteed methods for these increasingly important clustering settings. Going beyond Bregman divergences, we also prove an approximation factor for tensor clustering with arbitrary separable metrics. Through extensive experiments we evaluate the characteristics of our method, and show that it also has practical impact.Comment: 18 pages; improved metric cas

    Pattern Detection of Economic and Pandemic Vulnerability Index in Indonesia Using Bi-Cluster Analysis

    Get PDF
    Bi-clustering is a clustering development that aims to group data simultaneously from two directions. The Iterative Signature Algorithm (ISA) is one of the bi-clustering algorithms that work iteratively to find the most correlated bi-cluster. Detecting economic and pandemic vulnerability using bi-cluster analysis is essential to get spatial patterns and an overview of Indonesia's economic and pandemic vulnerability characteristics. Bi-clustering using ISA requires setting the row and column threshold to form seventy combinations of thresholds. The best is chosen based on the average value of mean square residue to volume ratios. In addition, the similarity of the best bi-cluster with the other is also seen based on the Liu and Wang index values. The -1.0 row and -1.0 column threshold combinations were selected and produced the best bi-cluster with the smallest average value of mean square residue to volume ratios (0.00141). Based on Liu and Wang index values, it has more than 95% similarity with the combination of -1.0 row and -0.9 column thresholds and the -0.9 row and -1.0 column thresholds. These selected threshold combinations produce three bi-clusters with five types of spatial patterns and different characteristics because of the overlap between these three bi-clusters

    Bicluster Analysis of Cheng and Church's Algorithm to Identify Patterns of People's Welfare in Indonesia

    Get PDF
    Biclustering is a method of grouping numerical data where rows and columns are grouped simultaneously. The Cheng and Church (CC) algorithm is one of the bi-clustering algorithms that try to find the maximum bi-cluster with a high similarity value, called MSR (Mean Square Residue). The association of rows and columns is called a bi-cluster if the MSR is lower than a predetermined threshold value (delta). Detection of people's welfare in Indonesia using Bi-Clustering is essential to get an overview of the characteristics of people's interest in each province in Indonesia. Bi-Clustering using the CC algorithm requires a threshold value (delta) determined by finding the MSR value of the actual data. The threshold value (delta) must be smaller than the MSR of the actual data. This study's threshold values are 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, and 0.8. After evaluating the optimum delta by considering the MSR value and the bi-cluster formed, the optimum delta is obtained as 0.1, with the number of bi-cluster included as 4

    BICLUSTERING APPLICATION IN INDONESIAN ECONOMIC AND PANDEMIC VULNERABILITY

    Get PDF
    Biclustering is an analytical tool to group data from two dimensions simultaneously. The analysis was first introduced by Hartigan (1972) and applied by Cheng and Church (2000) to the gene expression matrix. The Cheng and Church (CC) algorithm is a popular biclustering algorithm and has been widely applied outside the field of biological data in recent years. This algorithm application in economic and Covid-19 pandemic vulnerability cases is exciting and essential to do in order to get an overview of the spatial pattern and characteristics of the bicluster of economic and COVID-19 pandemic vulnerability in Indonesia. This study uses secondary data from some ministries. Forming a bicluster using the CC algorithm requires determining the delta threshold so that several types of delta thresholds are formed to choose the best (optimum) using the evaluation of the average value of mean square residue (MSR) to volume ratios. The similarity of the optimum bi-cluster with the other is also seen based on the Liu and Wang index values. The 0.01 delta threshold is chosen as the optimum threshold because it produces the smallest average value of MSR to volume ratios (0.00032). Based on Liu and Wang Index values, the optimum threshold has a similarity level below 50% with other types of delta thresholds, so the threshold is the best unique threshold. The optimum threshold resulted in six biclusters (six spatial patterns). Most regions in Indonesia (11 provinces) tend to have low economic and COVID-19 pandemic vulnerability in the first spatial pattern characteristic variables

    Techniques for clustering gene expression data

    Get PDF
    Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognises these limitations and implements procedures to overcome them. It provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for the clustering methods considered

    Extending bicluster analysis to annotate unclassified ORFs and predict novel functional modules using expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarrays have the capacity to measure the expressions of thousands of genes in parallel over many experimental samples. The unsupervised classification technique of bicluster analysis has been employed previously to uncover gene expression correlations over subsets of samples with the aim of providing a more accurate model of the natural gene functional classes. This approach also has the potential to aid functional annotation of unclassified open reading frames (ORFs). Until now this aspect of biclustering has been under-explored. In this work we illustrate how bicluster analysis may be extended into a 'semi-supervised' ORF annotation approach referred to as BALBOA.</p> <p>Results</p> <p>The efficacy of the BALBOA ORF classification technique is first assessed via cross validation and compared to a multi-class <it>k</it>-Nearest Neighbour (kNN) benchmark across three independent gene expression datasets. BALBOA is then used to assign putative functional annotations to unclassified yeast ORFs. These predictions are evaluated using existing experimental and protein sequence information. Lastly, we employ a related semi-supervised method to predict the presence of novel functional modules within yeast.</p> <p>Conclusion</p> <p>In this paper we demonstrate how unsupervised classification methods, such as bicluster analysis, may be extended using of available annotations to form semi-supervised approaches within the gene expression analysis domain. We show that such methods have the potential to improve upon supervised approaches and shed new light on the functions of unclassified ORFs and their co-regulation.</p
    corecore