246 research outputs found

    Applying Gene Ontology to Microarray Gene Expression Data Analysis

    Get PDF
    [[conferencetype]]國際[[conferencedate]]20100701~20100703[[iscallforpapers]]Y[[conferencelocation]]Taipei, Taiwa

    Classification approaches for microarray gene expression data analysis

    Get PDF
    The technology of Microarray is among the vital technological advancements in bioinformatics. Usually, microarray data is characterized by noisiness as well as increased dimensionality. Therefore, data, that is finely tuned, is a requirement for conducting the microarray data analysis. Classification of biological samples represents the most performed analysis on microarray data. This study is focused on the determination of the confidence level used for the classification of a sample of an unknown gene based on microarray data. A support vector machine classifier (SVM) was applied, and the results compared with other classifiers including K-nearest neighbor (KNN) and neural network (NN). Four datasets of microarray data including leukemia data set, prostate dataset, colon dataset, and breast dataset were used in the research. Additionally, the study analyzed two different kernels of SVM. These were radial kernel and linear kernels. The analysis was conducted by varying percentages of dataset distribution coupled with training and test datasets in order to make sure that the best positive sets of data provided the best results. The 10-fold cross validation method (LOOCV) and the L1 L2 techniques of regularization were used to get solutions for the over-fitting issues as well as feature selection in classification. The ROC curve and a confusion matrix were applied in performance assessment. K-nearest neighbor and neural network classifiers were trained with similar sets of data and comparison of the results was done. The results showed that the SVM exceeded the performance and accuracy compared to other classifiers. For each set of data, support vector machine was the best functional method based on the linear kernel since it yielded better results than the other methods. The highest accuracy of colon data was 83% with SVM classifier, while the accuracy of NN with the same data was 77% and KNN was 72%. Leukemia data had the highest accuracy of 97% with SVM, 85% with NN, and 91% with KNN. For breast data, the highest accuracy was 73% with SVM-L2, while the accuracy was 56% with NN and 47% with KNN. Finally, the highest accuracy of prostate data was 80% with SVM-L1, while the accuracy was 75% with NN and 66% with KNN. It showed the highest accuracy as well as the area under curve compared to k-nearest neighbor and neural network in the three different tests.Master of Science (MSc) in Computational Science

    Clustering analysis for gene expression data: a methodological review

    Get PDF
    Clustering is one of most useful tools for the microarray gene expression data analysis. Although there have been many reviews and surveys in the literature, many good and effective clustering ideas have not been collected in a systematic way for some reasons. In this paper, we review five clustering families representing five clustering concepts rather than five algorithms. We also review some clustering validations and collect a list of benchmark gene expression datasets

    A novel computational framework for fast, distributed computing and knowledge integration for microarray gene expression data analysis

    Get PDF
    The healthcare burden and suffering due to life-threatening diseases such as cancer would be significantly reduced by the design and refinement of computational interpretation of micro-molecular data collected by bioinformaticians. Rapid technological advancements in the field of microarray analysis, an important component in the design of in-silico molecular medicine methods, have generated enormous amounts of such data, a trend that has been increasing exponentially over the last few years. However, the analysis and handling of these data has become one of the major bottlenecks in the utilization of the technology. The rate of collection of these data has far surpassed our ability to analyze the data for novel, non-trivial, and important knowledge. The high-performance computing platform, and algorithms that utilize its embedded computing capacity, has emerged as a leading technology that can handle such data-intensive knowledge discovery applications. In this dissertation, we present a novel framework to achieve fast, robust, and accurate (biologically-significant) multi-class classification of gene expression data using distributed knowledge discovery and integration computational routines, specifically for cancer genomics applications. The research presents a unique computational paradigm for the rapid, accurate, and efficient selection of relevant marker genes, while providing parametric controls to ensure flexibility of its application. The proposed paradigm consists of the following key computational steps: (a) preprocess, normalize the gene expression data; (b) discretize the data for knowledge mining application; (c) partition the data using two proposed methods: partitioning with overlapped windows and adaptive selection; (d) perform knowledge discovery on the partitioned data-spaces for association rule discovery; (e) integrate association rules from partitioned data and knowledge spaces on distributed processor nodes using a novel knowledge integration algorithm; and (f) post-analysis and functional elucidation of the discovered gene rule sets. The framework is implemented on a shared-memory multiprocessor supercomputing environment, and several experimental results are demonstrated to evaluate the algorithms. We conclude with a functional interpretation of the computational discovery routines for enhanced biological physiological discovery from cancer genomics datasets, while suggesting some directions for future research

    Elephant Search with Deep Learning for Microarray Data Analysis

    Full text link
    Even though there is a plethora of research in Microarray gene expression data analysis, still, it poses challenges for researchers to effectively and efficiently analyze the large yet complex expression of genes. The feature (gene) selection method is of paramount importance for understanding the differences in biological and non-biological variation between samples. In order to address this problem, a novel elephant search (ES) based optimization is proposed to select best gene expressions from the large volume of microarray data. Further, a promising machine learning method is envisioned to leverage such high dimensional and complex microarray dataset for extracting hidden patterns inside to make a meaningful prediction and most accurate classification. In particular, stochastic gradient descent based Deep learning (DL) with softmax activation function is then used on the reduced features (genes) for better classification of different samples according to their gene expression levels. The experiments are carried out on nine most popular Cancer microarray gene selection datasets, obtained from UCI machine learning repository. The empirical results obtained by the proposed elephant search based deep learning (ESDL) approach are compared with most recent published article for its suitability in future Bioinformatics research.Comment: 12 pages, 5 Tabl

    R/BHC: fast Bayesian hierarchical clustering for microarray data

    Get PDF
    BACKGROUND: Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data analysis, little attention has been paid to uncertainty in the results obtained. RESULTS: We present an R/Bioconductor port of a fast novel algorithm for Bayesian agglomerative hierarchical clustering and demonstrate its use in clustering gene expression microarray data. The method performs bottom-up hierarchical clustering, using a Dirichlet Process (infinite mixture) to model uncertainty in the data and Bayesian model selection to decide at each step which clusters to merge. CONCLUSION: Biologically plausible results are presented from a well studied data set: expression profiles of A. thaliana subjected to a variety of biotic and abiotic stresses. Our method avoids several limitations of traditional methods, for example how many clusters there should be and how to choose a principled distance metric

    SMART: Unique splitting-while-merging framework for gene clustering

    Get PDF
    Copyright @ 2014 Fa et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Successful clustering algorithms are highly dependent on parameter settings. The clustering performance degrades significantly unless parameters are properly set, and yet, it is difficult to set these parameters a priori. To address this issue, in this paper, we propose a unique splitting-while-merging clustering framework, named “splitting merging awareness tactics” (SMART), which does not require any a priori knowledge of either the number of clusters or even the possible range of this number. Unlike existing self-splitting algorithms, which over-cluster the dataset to a large number of clusters and then merge some similar clusters, our framework has the ability to split and merge clusters automatically during the process and produces the the most reliable clustering results, by intrinsically integrating many clustering techniques and tasks. The SMART framework is implemented with two distinct clustering paradigms in two algorithms: competitive learning and finite mixture model. Nevertheless, within the proposed SMART framework, many other algorithms can be derived for different clustering paradigms. The minimum message length algorithm is integrated into the framework as the clustering selection criterion. The usefulness of the SMART framework and its algorithms is tested in demonstration datasets and simulated gene expression datasets. Moreover, two real microarray gene expression datasets are studied using this approach. Based on the performance of many metrics, all numerical results show that SMART is superior to compared existing self-splitting algorithms and traditional algorithms. Three main properties of the proposed SMART framework are summarized as: (1) needing no parameters dependent on the respective dataset or a priori knowledge about the datasets, (2) extendible to many different applications, (3) offering superior performance compared with counterpart algorithms.National Institute for Health Researc

    Asterias: a parallelized web-based suite for the analysis of expression and aCGH data

    Get PDF
    Asterias (\url{http://www.asterias.info}) is an integrated collection of freely-accessible web tools for the analysis of gene expression and aCGH data. Most of the tools use parallel computing (via MPI). Most of our applications allow the user to obtain additional information for user-selected genes by using clickable links in tables and/or figures. Our tools include: normalization of expression and aCGH data; converting between different types of gene/clone and protein identifiers; filtering and imputation; finding differentially expressed genes related to patient class and survival data; searching for models of class prediction; using random forests to search for minimal models for class prediction or for large subsets of genes with predictive capacity; searching for molecular signatures and predictive genes with survival data; detecting regions of genomic DNA gain or loss. The capability to send results between different applications, access to additional functional information, and parallelized computation make our suite unique and exploit features only available to web-based applications.Comment: web based application; 3 figure