246 research outputs found
Applying Gene Ontology to Microarray Gene Expression Data Analysis
[[conferencetype]]國際[[conferencedate]]20100701~20100703[[iscallforpapers]]Y[[conferencelocation]]Taipei, Taiwa
Classification approaches for microarray gene expression data analysis
The technology of Microarray is among the vital technological advancements in bioinformatics. Usually, microarray data is characterized by noisiness as well as increased dimensionality. Therefore, data, that is finely tuned, is a requirement for conducting the microarray data analysis. Classification of biological samples represents the most performed analysis on microarray data. This study is focused on the determination of the confidence level used for the classification of a sample of an unknown gene based on microarray data. A support vector machine classifier (SVM) was applied, and the results compared with other classifiers including K-nearest neighbor (KNN) and neural network (NN). Four datasets of microarray data including leukemia data set, prostate dataset, colon dataset, and breast dataset were used in the research. Additionally, the study analyzed two different kernels of SVM. These were radial kernel and linear kernels. The analysis was conducted by varying percentages of dataset distribution coupled with training and test datasets in order to make sure that the best positive sets of data provided the best results. The 10-fold cross validation method (LOOCV) and the L1 L2 techniques of regularization were used to get solutions for the over-fitting issues as well as feature selection in classification. The ROC curve and a confusion matrix were applied in performance assessment. K-nearest neighbor and neural network classifiers were trained with similar sets of data and comparison of the results was done. The results showed that the SVM exceeded the performance and accuracy compared to other classifiers. For each set of data, support vector machine was the best functional method based on the linear kernel since it yielded better results than the other methods. The highest accuracy of colon data was 83% with SVM classifier, while the accuracy of NN with the same data was 77% and KNN was 72%. Leukemia data had the highest accuracy of 97% with SVM, 85% with NN, and 91% with KNN. For breast data, the highest accuracy was 73% with SVM-L2, while the accuracy was 56% with NN and 47% with KNN. Finally, the highest accuracy of prostate data was 80% with SVM-L1, while the accuracy was 75% with NN and 66% with KNN. It showed the highest accuracy as well as the area under curve compared to k-nearest neighbor and neural network in the three different tests.Master of Science (MSc) in Computational Science
Clustering analysis for gene expression data: a methodological review
Clustering is one of most useful tools for the microarray gene expression data analysis. Although there have been many reviews and surveys in the literature, many good and effective clustering ideas have not been collected in a systematic way for some reasons. In this paper, we review five clustering families representing five clustering concepts rather than five algorithms. We also review some clustering validations and collect a list of benchmark gene expression datasets
A novel computational framework for fast, distributed computing and knowledge integration for microarray gene expression data analysis
The healthcare burden and suffering due to life-threatening diseases such as cancer would be significantly reduced by the design and refinement of computational interpretation of micro-molecular data collected by bioinformaticians. Rapid technological advancements in the field of microarray analysis, an important component in the design of in-silico molecular medicine methods, have generated enormous amounts of such data, a trend that has been increasing exponentially over the last few years. However, the analysis and handling of these data has become one of the major bottlenecks in the utilization of the technology. The rate of collection of these data has far surpassed our ability to analyze the data for novel, non-trivial, and important knowledge. The high-performance computing platform, and algorithms that utilize its embedded computing capacity, has emerged as a leading technology that can handle such data-intensive knowledge discovery applications.
In this dissertation, we present a novel framework to achieve fast, robust, and accurate (biologically-significant) multi-class classification of gene expression data using distributed knowledge discovery and integration computational routines, specifically for cancer genomics applications. The research presents a unique computational paradigm for the rapid, accurate, and efficient selection of relevant marker genes, while providing parametric controls to ensure flexibility of its application.
The proposed paradigm consists of the following key computational steps: (a) preprocess, normalize the gene expression data; (b) discretize the data for knowledge mining application; (c) partition the data using two proposed methods: partitioning with overlapped windows and adaptive selection; (d) perform knowledge discovery on the partitioned data-spaces for association rule discovery; (e) integrate association rules from partitioned data and knowledge spaces on distributed processor nodes using a novel knowledge integration algorithm; and (f) post-analysis and functional elucidation of the discovered gene rule sets. The framework is implemented on a shared-memory multiprocessor supercomputing environment, and several experimental results are demonstrated to evaluate the algorithms. We conclude with a functional interpretation of the computational discovery routines for enhanced biological physiological discovery from cancer genomics datasets, while suggesting some directions for future research
Elephant Search with Deep Learning for Microarray Data Analysis
Even though there is a plethora of research in Microarray gene expression
data analysis, still, it poses challenges for researchers to effectively and
efficiently analyze the large yet complex expression of genes. The feature
(gene) selection method is of paramount importance for understanding the
differences in biological and non-biological variation between samples. In
order to address this problem, a novel elephant search (ES) based optimization
is proposed to select best gene expressions from the large volume of microarray
data. Further, a promising machine learning method is envisioned to leverage
such high dimensional and complex microarray dataset for extracting hidden
patterns inside to make a meaningful prediction and most accurate
classification. In particular, stochastic gradient descent based Deep learning
(DL) with softmax activation function is then used on the reduced features
(genes) for better classification of different samples according to their gene
expression levels. The experiments are carried out on nine most popular Cancer
microarray gene selection datasets, obtained from UCI machine learning
repository. The empirical results obtained by the proposed elephant search
based deep learning (ESDL) approach are compared with most recent published
article for its suitability in future Bioinformatics research.Comment: 12 pages, 5 Tabl
R/BHC: fast Bayesian hierarchical clustering for microarray data
BACKGROUND: Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data analysis, little attention has been paid to uncertainty in the results obtained. RESULTS: We present an R/Bioconductor port of a fast novel algorithm for Bayesian agglomerative hierarchical clustering and demonstrate its use in clustering gene expression microarray data. The method performs bottom-up hierarchical clustering, using a Dirichlet Process (infinite mixture) to model uncertainty in the data and Bayesian model selection to decide at each step which clusters to merge. CONCLUSION: Biologically plausible results are presented from a well studied data set: expression profiles of A. thaliana subjected to a variety of biotic and abiotic stresses. Our method avoids several limitations of traditional methods, for example how many clusters there should be and how to choose a principled distance metric
SMART: Unique splitting-while-merging framework for gene clustering
Copyright @ 2014 Fa et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author and source are credited.Successful clustering algorithms are highly dependent on parameter settings. The clustering performance degrades significantly unless parameters are properly set, and yet, it is difficult to set these parameters a priori. To address this issue, in this paper, we propose a unique splitting-while-merging clustering framework, named “splitting merging awareness tactics” (SMART), which does not require any a priori knowledge of either the number of clusters or even the possible range of this number. Unlike existing self-splitting algorithms, which over-cluster the dataset to a large number of clusters and then merge some similar clusters, our framework has the ability to split and merge clusters automatically during the process and produces the the most reliable clustering results, by intrinsically integrating many clustering techniques and tasks. The SMART framework is implemented with two distinct clustering paradigms in two algorithms: competitive learning and finite mixture model. Nevertheless, within the proposed SMART framework, many other algorithms can be derived for different clustering paradigms. The minimum message length algorithm is integrated into the framework as the clustering selection criterion. The usefulness of the SMART framework and its algorithms is tested in demonstration datasets and simulated gene expression datasets. Moreover, two real microarray gene expression datasets are studied using this approach. Based on the performance of many metrics, all numerical results show that SMART is superior to compared existing self-splitting algorithms and traditional algorithms. Three main properties of the proposed SMART framework are summarized as: (1) needing no parameters dependent on the respective dataset or a priori knowledge about the datasets, (2) extendible to many different applications, (3) offering superior performance compared with counterpart algorithms.National Institute for Health Researc
Asterias: a parallelized web-based suite for the analysis of expression and aCGH data
Asterias (\url{http://www.asterias.info}) is an integrated collection of
freely-accessible web tools for the analysis of gene expression and aCGH data.
Most of the tools use parallel computing (via MPI). Most of our applications
allow the user to obtain additional information for user-selected genes by
using clickable links in tables and/or figures. Our tools include:
normalization of expression and aCGH data; converting between different types
of gene/clone and protein identifiers; filtering and imputation; finding
differentially expressed genes related to patient class and survival data;
searching for models of class prediction; using random forests to search for
minimal models for class prediction or for large subsets of genes with
predictive capacity; searching for molecular signatures and predictive genes
with survival data; detecting regions of genomic DNA gain or loss. The
capability to send results between different applications, access to additional
functional information, and parallelized computation make our suite unique and
exploit features only available to web-based applications.Comment: web based application; 3 figure
- …