587 research outputs found

    Making Informed Choices about Microarray Data Analysis

    Get PDF
    This article describes the typical stages in the analysis of microarray data for non-specialist researchers in systems biology and medicine. Particular attention is paid to significant data analysis issues that are commonly encountered among practitioners, some of which need wider airing. The issues addressed include experimental design, quality assessment, normalization, and summarization of multiple-probe data. This article is based on the ISMB 2008 tutorial on microarray data analysis. An expanded version of the material in this article and the slides from the tutorial can be found at http://www.people.vcu.edu/~mreimers/OGMDA/index.html

    caCORRECT2: Improving the accuracy and reliability of microarray data in the presence of artifacts

    Get PDF
    © 2011 Moffitt et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.DOI: 10.1186/1471-2105-12-383Background. In previous work, we reported the development of caCORRECT, a novel microarray quality control system built to identify and correct spatial artifacts commonly found on Affymetrix arrays. We have made recent improvements to caCORRECT, including the development of a model-based data-replacement strategy and integration with typical microarray workflows via caCORRECT's web portal and caBIG grid services. In this report, we demonstrate that caCORRECT improves the reproducibility and reliability of experimental results across several common Affymetrix microarray platforms. caCORRECT represents an advance over state-of-art quality control methods such as Harshlighting, and acts to improve gene expression calculation techniques such as PLIER, RMA and MAS5.0, because it incorporates spatial information into outlier detection as well as outlier information into probe normalization. The ability of caCORRECT to recover accurate gene expressions from low quality probe intensity data is assessed using a combination of real and synthetic artifacts with PCR follow-up confirmation and the affycomp spike in data. The caCORRECT tool can be accessed at the website: http://cacorrect.bme.gatech.edu webcite. Results. We demonstrate that (1) caCORRECT's artifact-aware normalization avoids the undesirable global data warping that happens when any damaged chips are processed without caCORRECT; (2) When used upstream of RMA, PLIER, or MAS5.0, the data imputation of caCORRECT generally improves the accuracy of microarray gene expression in the presence of artifacts more than using Harshlighting or not using any quality control; (3) Biomarkers selected from artifactual microarray data which have undergone the quality control procedures of caCORRECT are more likely to be reliable, as shown by both spike in and PCR validation experiments. Finally, we present a case study of the use of caCORRECT to reliably identify biomarkers for renal cell carcinoma, yielding two diagnostic biomarkers with potential clinical utility, PRKAB1 and NNMT. Conclusions. caCORRECT is shown to improve the accuracy of gene expression, and the reproducibility of experimental results in clinical application. This study suggests that caCORRECT will be useful to clean up possible artifacts in new as well as archived microarray data

    PathEx: a novel multi factors based datasets selector web tool

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarray experiments have become very popular in life science research. However, if such experiments are only considered independently, the possibilities for analysis and interpretation of many life science phenomena are reduced. The accumulation of publicly available data provides biomedical researchers with a valuable opportunity to either discover new phenomena or improve the interpretation and validation of other phenomena that partially understood or well known. This can only be achieved by intelligently exploiting this rich mine of information.</p> <p>Description</p> <p>Considering that technologies like microarrays remain prohibitively expensive for researchers with limited means to order their own experimental chips, it would be beneficial to re-use previously published microarray data. For certain researchers interested in finding gene groups (requiring many replicates), there is a great need for tools to help them to select appropriate datasets for analysis. These tools may be effective, if and only if, they are able to re-use previously deposited experiments or to create new experiments not initially envisioned by the depositors. However, the generation of new experiments requires that all published microarray data be completely annotated, which is not currently the case. Thus, we propose the PathEx approach.</p> <p>Conclusion</p> <p>This paper presents PathEx, a human-focused web solution built around a two-component system: one database component, enriched with relevant biological information (expression array, omics data, literature) from different sources, and another component comprising sophisticated web interfaces that allow users to perform complex dataset building queries on the contents integrated into the PathEx database.</p

    Methods for Joint Normalization and Comparison of Hi-C data

    Get PDF
    The development of chromatin conformation capture technology has opened new avenues of study into the 3D structure and function of the genome. Chromatin structure is known to influence gene regulation, and differences in structure are now emerging as a mechanism of regulation between, e.g., cell differentiation and disease vs. normal states. Hi-C sequencing technology now provides a way to study the 3D interactions of the chromatin over the whole genome. However, like all sequencing technologies, Hi-C suffers from several forms of bias stemming from both the technology and the DNA sequence itself. Several normalization methods have been developed for normalizing individual Hi-C datasets, but little work has been done on developing joint normalization methods for comparing two or more Hi-C datasets. To make full use of Hi-C data, joint normalization and statistical comparison techniques are needed to carry out experiments to identify regions where chromatin structure differs between conditions. We develop methods for the joint normalization and comparison of two Hi-C datasets, which we then extended to more complex experimental designs. Our normalization method is novel in that it makes use of the distance-dependent nature of chromatin interactions. Our modification of the Minus vs. Average (MA) plot to the Minus vs. Distance (MD) plot allows for a nonparametric data-driven normalization technique using loess smoothing. Additionally, we present a simple statistical method using Z-scores for detecting differentially interacting regions between two datasets. Our initial method was published as the Bioconductor R package HiCcompare [http://bioconductor.org/packages/HiCcompare/](http://bioconductor.org/packages/HiCcompare/). We then further extended our normalization and comparison method for use in complex Hi-C experiments with more than two datasets and optional covariates. We extended the normalization method to jointly normalize any number of Hi-C datasets by using a cyclic loess procedure on the MD plot. The cyclic loess normalization technique can remove between dataset biases efficiently and effectively even when several datasets are analyzed at one time. Our comparison method implements a generalized linear model-based approach for comparing complex Hi-C experiments, which may have more than two groups and additional covariates. The extended methods are also available as a Bioconductor R package [http://bioconductor.org/packages/multiHiCcompare/](http://bioconductor.org/packages/multiHiCcompare/). Finally, we demonstrate the use of HiCcompare and multiHiCcompare in several test cases on real data in addition to comparing them to other similar methods (https://doi.org/10.1002/cpbi.76)

    Mining SOM expression portraits: Feature selection and integrating concepts of molecular function

    Get PDF
    Background: &#xd;&#xa;Self organizing maps (SOM) enable the straightforward portraying of high-dimensional data of large sample collections in terms of sample-specific images. The analysis of their texture provides so-called spot-clusters of co-expressed genes which require subsequent significance filtering and functional interpretation. We address feature selection in terms of the gene ranking problem and the interpretation of the obtained spot-related lists using concepts of molecular function.&#xd;&#xa;&#xd;&#xa;Results: &#xd;&#xa;Different expression scores based either on simple fold change-measures or on regularized Students t-statistics are applied to spot-related gene lists and compared with special emphasis on the error characteristics of microarray expression data. The spot-clusters are analyzed using different methods of gene set enrichment analysis with the focus on overexpression and/or overrepresentation of predefined sets of genes. Metagene-related overrepresentation of selected gene sets was mapped into the SOM images to assign gene function to different regions. Alternatively we estimated set-related overexpression profiles over all samples studied using a gene set enrichment score. It was also applied to the spot-clusters to generate lists of enriched gene sets. We used the tissue body index data set, a collection of expression data of human tissues, as an illustrative example. We found that tissue related spots typically contain enriched populations of gene sets well corresponding to molecular processes in the respective tissues. In addition, we display special sets of housekeeping and of consistently weak and highly expressed genes using SOM data filtering. &#xd;&#xa;&#xd;&#xa;Conclusions:&#xd;&#xa;The presented methods allow the comprehensive downstream analysis of SOM-transformed expression data in terms of cluster-related gene lists and enriched gene sets for functional interpretation. SOM clustering implies the ability to define either new gene sets using selected SOM spots or to verify and/or to amend existing ones

    Risk-conscious correction of batch effects: maximising information extraction from high-throughput genomic datasets

    Get PDF
    Contains additional information and discussion on gPCA (Reese et al., 2013). Table S1. Demonstrates the inverse proportionality between gPCA p-value and the associated ‘delta’ score, reflecting unadjusted relative magnitude of batch effects (Reese et al., 2013). The table shows the scores for all three datasets. Figure S1. Contains an Illustration to further help interpret gPCA p-value vs preserved data variance plots. (DOCX 60 kb

    How to do quantile normalization correctly for gene expression data analyses.

    Full text link
    Quantile normalization is an important normalization technique commonly used in high-dimensional data analysis. However, it is susceptible to class-effect proportion effects (the proportion of class-correlated variables in a dataset) and batch effects (the presence of potentially confounding technical variation) when applied blindly on whole data sets, resulting in higher false-positive and false-negative rates. We evaluate five strategies for performing quantile normalization, and demonstrate that good performance in terms of batch-effect correction and statistical feature selection can be readily achieved by first splitting data by sample class-labels before performing quantile normalization independently on each split ("Class-specific"). Via simulations with both real and simulated batch effects, we demonstrate that the "Class-specific" strategy (and others relying on similar principles) readily outperform whole-data quantile normalization, and is robust-preserving useful signals even during the combined analysis of separately-normalized datasets. Quantile normalization is a commonly used procedure. But when carelessly applied on whole datasets without first considering class-effect proportion and batch effects, can result in poor performance. If quantile normalization must be used, then we recommend using the "Class-specific" strategy

    Assessing the Impact of Sample Heterogeneity on Transcriptome Analysis of Human Diseases Using MDP Webtool

    Get PDF
    Transcriptome analyses have increased our understanding of the molecular mechanisms underlying human diseases. Most approaches aim to identify significant genes by comparing their expression values between healthy subjects and a group of patients with a certain disease. Given that studies normally contain few samples, the heterogeneity among individuals caused by environmental factors or undetected illnesses can impact gene expression analyses. We present a systematic analysis of sample heterogeneity in a variety of gene expression studies relating to inflammatory and infectious diseases and show that novel immunological insights may arise once heterogeneity is addressed. The perturbation score of samples is quantified using nonperturbed subjects (i.e., healthy subjects) as a reference group. Such a score allows us to detect outlying samples and subgroups of diseased patients and even assess the molecular perturbation of single cells infected with viruses. We also show how removal of outlying samples can improve the “signal” of the disease and impact detection of differentially expressed genes. The method is made available via the mdp Bioconductor R package and as a user-friendly webtool, webMDP, available at http://mdp.sysbio.tools
    corecore