2,604 research outputs found
A unified framework for finding differentially expressed genes from microarray experiments
<p>Abstract</p> <p>Background</p> <p>This paper presents a unified framework for finding differentially expressed genes (DEGs) from the microarray data. The proposed framework has three interrelated modules: (i) gene ranking, ii) significance analysis of genes and (iii) validation. The first module uses two gene selection algorithms, namely, a) two-way clustering and b) combined adaptive ranking to rank the genes. The second module converts the gene ranks into p-values using an R-test and fuses the two sets of p-values using the Fisher's omnibus criterion. The DEGs are selected using the FDR analysis. The third module performs three fold validations of the obtained DEGs. The robustness of the proposed unified framework in gene selection is first illustrated using false discovery rate analysis. In addition, the clustering-based validation of the DEGs is performed by employing an adaptive subspace-based clustering algorithm on the training and the test datasets. Finally, a projection-based visualization is performed to validate the DEGs obtained using the unified framework.</p> <p>Results</p> <p>The performance of the unified framework is compared with well-known ranking algorithms such as t-statistics, Significance Analysis of Microarrays (SAM), Adaptive Ranking, Combined Adaptive Ranking and Two-way Clustering. The performance curves obtained using 50 simulated microarray datasets each following two different distributions indicate the superiority of the unified framework over the other reported algorithms. Further analyses on 3 real cancer datasets and 3 Parkinson's datasets show the similar improvement in performance. First, a 3 fold validation process is provided for the two-sample cancer datasets. In addition, the analysis on 3 sets of Parkinson's data is performed to demonstrate the scalability of the proposed method to multi-sample microarray datasets.</p> <p>Conclusion</p> <p>This paper presents a unified framework for the robust selection of genes from the two-sample as well as multi-sample microarray experiments. Two different ranking methods used in module 1 bring diversity in the selection of genes. The conversion of ranks to p-values, the fusion of p-values and FDR analysis aid in the identification of significant genes which cannot be judged based on gene ranking alone. The 3 fold validation, namely, robustness in selection of genes using FDR analysis, clustering, and visualization demonstrate the relevance of the DEGs. Empirical analyses on 50 artificial datasets and 6 real microarray datasets illustrate the efficacy of the proposed approach. The analyses on 3 cancer datasets demonstrate the utility of the proposed approach on microarray datasets with two classes of samples. The scalability of the proposed unified approach to multi-sample (more than two sample classes) microarray datasets is addressed using three sets of Parkinson's Data. Empirical analyses show that the unified framework outperformed other gene selection methods in selecting differentially expressed genes from microarray data.</p
Diverse correlation structures in gene expression data and their utility in improving statistical inference
It is well known that correlations in microarray data represent a serious
nuisance deteriorating the performance of gene selection procedures. This paper
is intended to demonstrate that the correlation structure of microarray data
provides a rich source of useful information. We discuss distinct correlation
substructures revealed in microarray gene expression data by an appropriate
ordering of genes. These substructures include stochastic proportionality of
expression signals in a large percentage of all gene pairs, negative
correlations hidden in ordered gene triples, and a long sequence of weakly
dependent random variables associated with ordered pairs of genes. The reported
striking regularities are of general biological interest and they also have
far-reaching implications for theory and practice of statistical methods of
microarray data analysis. We illustrate the latter point with a method for
testing differential expression of nonoverlapping gene pairs. While designed
for testing a different null hypothesis, this method provides an order of
magnitude more accurate control of type 1 error rate compared to conventional
methods of individual gene expression profiling. In addition, this method is
robust to the technical noise. Quantitative inference of the correlation
structure has the potential to extend the analysis of microarray data far
beyond currently practiced methods.Comment: Published in at http://dx.doi.org/10.1214/07-AOAS120 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Sequential stopping for high-throughput experiments
In high-throughput experiments, the sample size is typically chosen informally. Most formal sample-size calculations depend critically on prior knowledge. We propose a sequential strategy that, by updating knowledge when new data are available, depends less critically on prior assumptions. Experiments are stopped or continued based on the potential benefits in obtaining additional data. The underlying decision-theoretic framework guarantees the design to proceed in a coherent fashion. We propose intuitively appealing, easy-to-implement utility functions. As in most sequential design problems, an exact solution is prohibitive. We propose a simulation-based approximation that uses decision boundaries. We apply the method to RNA-seq, microarray, and reverse-phase protein array studies and show its potential advantages. The approach has been added to the Bioconductor package gaga
Understanding pathways
The challenge with todays microarray experiments is to infer biological conclusions
from them. There are two crucial difficulties to be surmounted in this challenge:(1)
A lack of suitable biological repository that can be easily integrated into computational
algorithms. (2) Contemporary algorithms used to analyze microarray data are unable to
draw consistent biological results from diverse datasets of the same disease.
To deal with the first difficulty, we believe a core database that unifies available
biological repositories is important. Towards this end, we create a unified biological
database from three popular biological repositories (KEGG, Ingenuity and Wikipathways).
This database provides computer scientists the flexibility of easily integrating
biological information using simple API calls or SQL queries.
To deal with the second difficulty of deriving consistent biological results from the
experiments, we first conceptualize the notion of “subnetworks”, which refers to a
connected portion in a biological pathway. Then we propose a method that identifies
subnetworks that are consistently expressed by patients of he same disease phenotype.
We test our technique on independent datasets of several diseases, including ALL,
DMD and lung cancer. For each of these diseases, we obtain two independent microarray
datasets produced by distinct labs on distinct platforms. In each case, our technique
consistently produces overlapping lists of significant nontrivial subnetworks from two
independent sets of microarray data. The gene-level agreement of these significant
subnetworks is between 66.67% to 91.87%. In contrast, when the same pairs of
microarray datasets were analysed using GSEA and t-test, this percentage fell between
37% to 55.75% (GSEA) and between 2.55% to 19.23% (t-test). Furthermore, the genes
selected using GSEA and t-test do not form subnetworks of substantial size. Thus
it is more probable that the subnetworks selected by our technique can provide the
researcher with more descriptive information on the portions of the pathway which
actually associates with the disease.
Keywords: pathway analysis, microarra
Recommended from our members
An Automated Bayesian Framework for Integrative Gene Expression Analysis and Predictive Medicine
Motivation: This work constructs a closed loop Bayesian Network framework for predictive medicine via integrative analysis of publicly available gene expression findings pertaining to various diseases. Results: An automated pipeline was successfully constructed. Integrative models were made based on gene expression data obtained from GEO experiments relating to four different diseases using Bayesian statistical methods. Many of these models demonstrated a high level of accuracy and predictive ability. The approach described in this paper can be applied to any complex disorder and can include any number and type of genome-scale studies
A statistical framework for the analysis of microarray probe-level data
In microarray technology, a number of critical steps are required to convert
the raw measurements into the data relied upon by biologists and clinicians.
These data manipulations, referred to as preprocessing, influence the quality
of the ultimate measurements and studies that rely upon them. Standard
operating procedure for microarray researchers is to use preprocessed data as
the starting point for the statistical analyses that produce reported results.
This has prevented many researchers from carefully considering their choice of
preprocessing methodology. Furthermore, the fact that the preprocessing step
affects the stochastic properties of the final statistical summaries is often
ignored. In this paper we propose a statistical framework that permits the
integration of preprocessing into the standard statistical analysis flow of
microarray data. This general framework is relevant in many microarray
platforms and motivates targeted analysis methods for specific applications. We
demonstrate its usefulness by applying the idea in three different applications
of the technology.Comment: Published in at http://dx.doi.org/10.1214/07-AOAS116 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Next station in microarray data analysis: GEPAS
The Gene Expression Profile Analysis Suite (GEPAS) has been running for more than four years. During this time it has evolved to keep pace with the new interests and trends in the still changing world of microarray data analysis. GEPAS has been designed to provide an intuitive although powerful web-based interface that offers diverse analysis options from the early step of preprocessing (normalization of Affymetrix and two-colour microarray experiments and other preprocessing options), to the final step of the functional annotation of the experiment (using Gene Ontology, pathways, PubMed abstracts etc.), and include different possibilities for clustering, gene selection, class prediction and array-comparative genomic hybridization management. GEPAS is extensively used by researchers of many countries and its records indicate an average usage rate of 400 experiments per day. The web-based pipeline for microarray gene expression data, GEPAS, is available at
- …