287 research outputs found

    FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data

    Get PDF
    BACKGROUND: Data clustering analysis has been extensively applied to extract information from gene expression profiles obtained with DNA microarrays. To this aim, existing clustering approaches, mainly developed in computer science, have been adapted to microarray data analysis. However, previous studies revealed that microarray datasets have very diverse structures, some of which may not be correctly captured by current clustering methods. We therefore approached the problem from a new starting point, and developed a clustering algorithm designed to capture dataset-specific structures at the beginning of the process. RESULTS: The clustering algorithm is named Fuzzy clustering by Local Approximation of MEmbership (FLAME). Distinctive elements of FLAME are: (i) definition of the neighborhood of each object (gene or sample) and identification of objects with "archetypal" features named Cluster Supporting Objects, around which to construct the clusters; (ii) assignment to each object of a fuzzy membership vector approximated from the memberships of its neighboring objects, by an iterative converging process in which membership spreads from the Cluster Supporting Objects through their neighbors. Comparative analysis with K-means, hierarchical, fuzzy C-means and fuzzy self-organizing maps (SOM) showed that data partitions generated by FLAME are not superimposable to those of other methods and, although different types of datasets are better partitioned by different algorithms, FLAME displays the best overall performance. FLAME is implemented, together with all the above-mentioned algorithms, in a C++ software with graphical interface for Linux and Windows, capable of handling very large datasets, named Gene Expression Data Analysis Studio (GEDAS), freely available under GNU General Public License. CONCLUSION: The FLAME algorithm has intrinsic advantages, such as the ability to capture non-linear relationships and non-globular clusters, the automated definition of the number of clusters, and the identification of cluster outliers, i.e. genes that are not assigned to any cluster. As a result, clusters are more internally homogeneous and more diverse from each other, and provide better partitioning of biological functions. The clustering algorithm can be easily extended to applications different from gene expression analysis

    Techniques for clustering gene expression data

    Get PDF
    Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognises these limitations and implements procedures to overcome them. It provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for the clustering methods considered

    Using emergent clustering methods to analyse short time series gene expression data from childhood leukemia treated with glucocorticoids

    Get PDF
    Acute lymphoblastic leukemia (ALL) causes the highest number of deaths from cancer in children aged between one and fourteen. The most common treatment for children with ALL is chemotherapy, a cancer treatment that uses drugs to kill cancer cells or stop cell division. The drug and dosage combinations may vary for each child. Unfortunately, chemotherapy treatments may cause serious side effects. Glucocorticoids (GCs) have been used as therapeutic agents for children with ALL for more than 50 years. Common and widely drugs in this class include prednisolone and dexamethasone. Childhood leukemia now has a survival rate of 80% (Pui, Robison, & Look, 2008). The key clinical question is identifying those children who will not respond well to established therapy strategies.GCs regulate diverse biological processes, for example, metabolism, development, differentiation, cell survival and immunity. GCs induce apoptosis and G1 cell cycle arrest in lymphoid cells. In fact, not much is known about the molecular mechanism of GCs sensitivity and resistance, and GCs-induced apoptotic signal transduction pathways and there are many controversial hypotheses about both genes regulated by GCs and potential molecular mechanism of GCs-induced apoptosis. Therefore, understanding the mechanism of this drug should lead to better prognostic factors (treatment response), more targeted therapies and prevention of side effects. GCs induced apoptosis have been studied by using microarray technology in vivo and in vitro on samples consisting of GCs treated ALL cell lines, mouse thymocytes and/or ALL patients. However, time series GCs treated childhood ALL datasets are currently extremely limited. DNA microarrays are essential tools for analysis of expression of many genes simultaneously. Gene expression data show the level of activity of several genes under experimental conditions. Genes with similar expression patterns could belong to the same pathway or have similar function. DNA microarray data analysis has been carried out using statistical analysis as well as machine learning and data mining approaches. There are many microarray analysis tools; this study aims to combine emergent clustering methods to get meaningful biological insights into mechanisms underlying GCs induced apoptosis. In this study, microarray data originated from prednisolone (glucocorticoids) treated childhood ALL samples (Schmidt et al., 2006) (B-linage and T-linage) and collected at 6 and 24 hours after treatment are analysed using four methods: Selforganizing maps (SOMs), Emergent self-organizing maps (ESOM) (Ultsch & Morchen, 2005), the Short Time series Expression Miner (STEM) (Ernst & Bar-Joseph, 2006) and Fuzzy clustering by Local Approximation of MEmbership (FLAME) (Fu & Medico, 2007). The results revealed intrinsic biological patterns underlying the GCs time series data: there are at least five different gene activities happening during the three time points; GCs-induced apoptotic genes were identified; and genes active at both time points or only at 6 hours or 24 hours were determined. Also, interesting gene clusters with membership in already known pathways were found thereby providing promising candidate gens for further inferring GCs induced apoptotic gene regulatory networks

    Clustering Algorithms: Their Application to Gene Expression Data

    Get PDF
    Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks and the volume of genes present increase the challenges of comprehending and interpretation of the resulting mass of data, which consists of millions of measurements; these data also inhibit vagueness, imprecision, and noise. Therefore, the use of clustering techniques is a first step toward addressing these challenges, which is essential in the data mining process to reveal natural structures and iden-tify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and subtypes of cells, mining useful information from noisy data, and understanding gene regulation. The other benefit of clustering gene expression data is the identification of homology, which is very important in vaccine design. This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledge of the appropriate clustering technique that will guarantee stability and high degree of accuracy in its analysis procedure

    Visual analytics for relationships in scientific data

    Get PDF
    Domain scientists hope to address grand scientific challenges by exploring the abundance of data generated and made available through modern high-throughput techniques. Typical scientific investigations can make use of novel visualization tools that enable dynamic formulation and fine-tuning of hypotheses to aid the process of evaluating sensitivity of key parameters. These general tools should be applicable to many disciplines: allowing biologists to develop an intuitive understanding of the structure of coexpression networks and discover genes that reside in critical positions of biological pathways, intelligence analysts to decompose social networks, and climate scientists to model extrapolate future climate conditions. By using a graph as a universal data representation of correlation, our novel visualization tool employs several techniques that when used in an integrated manner provide innovative analytical capabilities. Our tool integrates techniques such as graph layout, qualitative subgraph extraction through a novel 2D user interface, quantitative subgraph extraction using graph-theoretic algorithms or by querying an optimized B-tree, dynamic level-of-detail graph abstraction, and template-based fuzzy classification using neural networks. We demonstrate our system using real-world workflows from several large-scale studies. Parallel coordinates has proven to be a scalable visualization and navigation framework for multivariate data. However, when data with thousands of variables are at hand, we do not have a comprehensive solution to select the right set of variables and order them to uncover important or potentially insightful patterns. We present algorithms to rank axes based upon the importance of bivariate relationships among the variables and showcase the efficacy of the proposed system by demonstrating autonomous detection of patterns in a modern large-scale dataset of time-varying climate simulation

    A comprehensive evaluation of module detection methods for gene expression data

    Get PDF
    A critical step in the analysis of large genome-wide gene expression datasets is the use of module detection methods to group genes into co-expression modules. Because of limitations of classical clustering methods, numerous alternative module detection methods have been proposed, which improve upon clustering by handling co-expression in only a subset of samples, modelling the regulatory network, and/or allowing overlap between modules. In this study we use known regulatory networks to do a comprehensive and robust evaluation of these different methods. Overall, decomposition methods outperform all other strategies, while we do not find a clear advantage of biclustering and network inference-based approaches on large gene expression datasets. Using our evaluation workflow, we also investigate several practical aspects of module detection, such as parameter estimation and the use of alternative similarity measures, and conclude with recommendations for the further development of these methods

    A Survey of Feature Selection Strategies for DNA Microarray Classification

    Get PDF
    Classification tasks are difficult and challenging in the bioinformatics field, that used to predict or diagnose patients at an early stage of disease by utilizing DNA microarray technology. However, crucial characteristics of DNA microarray technology are a large number of features and small sample sizes, which means the technology confronts a "dimensional curse" in its classification tasks because of the high computational execution needed and the discovery of biomarkers difficult. To reduce the dimensionality of features to find the significant features that can employ feature selection algorithms and not affect the performance of classification tasks. Feature selection helps decrease computational time by removing irrelevant and redundant features from the data. The study aims to briefly survey popular feature selection methods for classifying DNA microarray technology, such as filters, wrappers, embedded, and hybrid approaches. Furthermore, this study describes the steps of the feature selection process used to accomplish classification tasks and their relationships to other components such as datasets, cross-validation, and classifier algorithms. In the case study, we chose four different methods of feature selection on two-DNA microarray datasets to evaluate and discuss their performances, namely classification accuracy, stability, and the subset size of selected features. Keywords: Brief survey; DNA microarray data; feature selection; filter methods; wrapper methods; embedded methods; and hybrid methods. DOI: 10.7176/CEIS/14-2-01 Publication date:March 31st 202

    An Experimental Study on Microarray Expression Data from Plants under Salt Stress by using Clustering Methods

    Get PDF
    Current Genome-wide advancements in Gene chips technology provide in the “Omics (genomics, proteomics and transcriptomics) research”, an opportunity to analyze the expression levels of thousand of genes across multiple experiments. In this regard, many machine learning approaches were proposed to deal with this deluge of information. Clustering methods are one of these approaches. Their process consists of grouping data (gene profiles) into homogeneous clusters using distance measurements. Various clustering techniques are applied, but there is no consensus for the best one. In this context, a comparison of seven clustering algorithms was performed and tested against the gene expression datasets of three model plants under salt stress. These techniques are evaluated by internal and relative validity measures. It appears that the AGNES algorithm is the best one for internal validity measures for the three plant datasets. Also, K-Means profiles a trend for relative validity measures for these datasets

    Importance of Similarity Measure in Gene Expression Data-A Survey

    Get PDF
    The usage of data mining techniques in research fields of computational biology include gene finding, genome assembly , prediction of gene expression etc, are very promising because the large amount of data is involved in these research fields. These techniques aims that to disclose the unknown knowledge and relationships. Different data sources are available one such as DNA Micro Array is the technology which enables the researchers to investigate and address issues which are non traceable. DNA Micro Array experiments generates thousands of gene expression measurements and provide a simple way for collecting huge amounts of data in short time. Micro array data analysis allows identifying the most relevant genes for a target disease and group of genes with similar patterns under different experimental conditions.Clustering methods are widely used on gene expression data to categorize genes with similar expression profiles. The goal of clustering in micro array technology is to group genes or experiments into clusters according to a similarity measure. In this paper we introduce the concept of micro Array technology, clustering on gene expression data and survey on similarity measure. Finally we conclude this paper promising that similarity measure plays an important role on gene expression data while using one of the data mining techniques is clustering
    corecore