19,195 research outputs found
A cDNA Microarray Gene Expression Data Classifier for Clinical Diagnostics Based on Graph Theory
Despite great advances in discovering cancer molecular profiles, the proper application of microarray technology to routine clinical diagnostics is still a challenge. Current practices in the classification of microarrays' data show two main limitations: the reliability of the training data sets used to build the classifiers, and the classifiers' performances, especially when the sample to be classified does not belong to any of the available classes. In this case, state-of-the-art algorithms usually produce a high rate of false positives that, in real diagnostic applications, are unacceptable. To address this problem, this paper presents a new cDNA microarray data classification algorithm based on graph theory and is able to overcome most of the limitations of known classification methodologies. The classifier works by analyzing gene expression data organized in an innovative data structure based on graphs, where vertices correspond to genes and edges to gene expression relationships. To demonstrate the novelty of the proposed approach, the authors present an experimental performance comparison between the proposed classifier and several state-of-the-art classification algorithm
NBLDA: Negative Binomial Linear Discriminant Analysis for RNA-Seq Data
RNA-sequencing (RNA-Seq) has become a powerful technology to characterize
gene expression profiles because it is more accurate and comprehensive than
microarrays. Although statistical methods that have been developed for
microarray data can be applied to RNA-Seq data, they are not ideal due to the
discrete nature of RNA-Seq data. The Poisson distribution and negative binomial
distribution are commonly used to model count data. Recently, Witten (2011)
proposed a Poisson linear discriminant analysis for RNA-Seq data. The Poisson
assumption may not be as appropriate as negative binomial distribution when
biological replicates are available and in the presence of overdispersion
(i.e., when the variance is larger than the mean). However, it is more
complicated to model negative binomial variables because they involve a
dispersion parameter that needs to be estimated. In this paper, we propose a
negative binomial linear discriminant analysis for RNA-Seq data. By Bayes'
rule, we construct the classifier by fitting a negative binomial model, and
propose some plug-in rules to estimate the unknown parameters in the
classifier. The relationship between the negative binomial classifier and the
Poisson classifier is explored, with a numerical investigation of the impact of
dispersion on the discriminant score. Simulation results show the superiority
of our proposed method. We also analyze four real RNA-Seq data sets to
demonstrate the advantage of our method in real-world applications
A graph-based representation of Gene Expression profiles in DNA microarrays
This paper proposes a new and very flexible data model, called gene expression graph (GEG), for genes expression analysis and classification. Three features differentiate GEGs from other available microarray data representation structures: (i) the memory occupation of a GEG is independent of the number of samples used to built it; (ii) a GEG more clearly expresses relationships among expressed and non expressed genes in both healthy and diseased tissues experiments; (iii) GEGs allow to easily implement very efficient classifiers. The paper also presents a simple classifier for sample-based classification to show the flexibility and user-friendliness of the proposed data structur
Integrative Model-based clustering of microarray methylation and expression data
In many fields, researchers are interested in large and complex biological
processes. Two important examples are gene expression and DNA methylation in
genetics. One key problem is to identify aberrant patterns of these processes
and discover biologically distinct groups. In this article we develop a
model-based method for clustering such data. The basis of our method involves
the construction of a likelihood for any given partition of the subjects. We
introduce cluster specific latent indicators that, along with some standard
assumptions, impose a specific mixture distribution on each cluster. Estimation
is carried out using the EM algorithm. The methods extend naturally to multiple
data types of a similar nature, which leads to an integrated analysis over
multiple data platforms, resulting in higher discriminating power.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS533 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
maigesPack: A Computational Environment for Microarray Data Analysis
Microarray technology is still an important way to assess gene expression in
molecular biology, mainly because it measures expression profiles for thousands
of genes simultaneously, what makes this technology a good option for some
studies focused on systems biology. One of its main problem is complexity of
experimental procedure, presenting several sources of variability, hindering
statistical modeling. So far, there is no standard protocol for generation and
evaluation of microarray data. To mitigate the analysis process this paper
presents an R package, named maigesPack, that helps with data organization.
Besides that, it makes data analysis process more robust, reliable and
reproducible. Also, maigesPack aggregates several data analysis procedures
reported in literature, for instance: cluster analysis, differential
expression, supervised classifiers, relevance networks and functional
classification of gene groups or gene networks
Correcting for selection bias via cross-validation in the classification of microarray data
There is increasing interest in the use of diagnostic rules based on
microarray data. These rules are formed by considering the expression levels of
thousands of genes in tissue samples taken on patients of known classification
with respect to a number of classes, representing, say, disease status or
treatment strategy. As the final versions of these rules are usually based on a
small subset of the available genes, there is a selection bias that has to be
corrected for in the estimation of the associated error rates. We consider the
problem using cross-validation. In particular, we present explicit formulae
that are useful in explaining the layers of validation that have to be
performed in order to avoid improperly cross-validated estimates.Comment: Published in at http://dx.doi.org/10.1214/193940307000000284 the IMS
Collections (http://www.imstat.org/publications/imscollections.htm) by the
Institute of Mathematical Statistics (http://www.imstat.org
Application of Volcano Plots in Analyses of mRNA Differential Expressions with Microarrays
Volcano plot displays unstandardized signal (e.g. log-fold-change) against
noise-adjusted/standardized signal (e.g. t-statistic or -log10(p-value) from
the t test). We review the basic and an interactive use of the volcano plot,
and its crucial role in understanding the regularized t-statistic. The joint
filtering gene selection criterion based on regularized statistics has a curved
discriminant line in the volcano plot, as compared to the two perpendicular
lines for the "double filtering" criterion. This review attempts to provide an
unifying framework for discussions on alternative measures of differential
expression, improved methods for estimating variance, and visual display of a
microarray analysis result. We also discuss the possibility to apply volcano
plots to other fields beyond microarray.Comment: 8 figure
Over-optimism in bioinformatics: an illustration
In statistical bioinformatics research, different optimization mechanisms potentially lead to "over-optimism" in published papers. The present empirical study illustrates these mechanisms through a concrete example from an active research field. The investigated sources of over-optimism include the optimization of the data sets, of the settings, of the competing methods and, most importantly, of the methodâs characteristics. We consider a "promising" new classification algorithm that turns out to yield disappointing results in terms of error rate, namely linear discriminant analysis incorporating prior knowledge on gene functional groups through an appropriate shrinkage of the within-group covariance matrix. We quantitatively demonstrate that this disappointing method can artificially seem superior to existing approaches if we "fish for significanceâ. We conclude that, if the improvement of a quantitative criterion such as the error rate is the main contribution of a paper, the superiority of new algorithms should be validated using "fresh" validation data sets
- âŠ