59 research outputs found

    A New Approach to Intensity-Dependent Normalization of Two-Channel Microarrays

    Get PDF
    A two-channel microarray measures the relative expression levels of thousands of genes from a pair of biological samples. In order to reliably compare gene expression levels between and within arrays, it is necessary to remove systematic errors that distort the biological signal of interest. The standard for accomplishing this is smoothing MA-plots to remove intensity-dependent dye bias and array-specific effects. However, MA methods require strong assumptions. We review these assumptions and derive several practical scenarios in which they fail. The dye-swap normalization method has been much less frequently used because it requires two arrays per pair of samples. We show that a dye-swap is accurate under general assumptions, even under intensity-dependent dye bias, and that a dye-swap provides the minimal information required for removing dye bias from a pair of samples in general. Based on a flexible model of the relationship between mRNA amount and single channel fluorescence intensity, we demonstrate the general applicability of a dye-swap approach. We then propose a common array dye-swap (CADS) method for the normalization of two-channel microarrays. We show that CADS removes both dye-bias and array-specific effects, and preserves the true differential expression signal for every gene. Finally, we discuss some possible extensions of CADS that circumvent the need to use two arrays per pair of samples

    Optimal Feature Selection for Nearest Centroid Classifiers, With Applications to Gene Expression Microarrays

    Get PDF
    Nearest centroid classifiers have recently been successfully employed in high-dimensional applications. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is typically carried out by computing univariate statistics for each feature individually, without consideration for how a subset of features performs as a whole. For subsets of a given size, we characterize the optimal choice of features, corresponding to those yielding the smallest misclassification rate. Furthermore, we propose an algorithm for estimating this optimal subset in practice. Finally, we investigate the applicability of shrinkage ideas to nearest centroid classifiers. We use gene-expression microarrays for our illustrative examples, demonstrating that our proposed algorithms can improve the performance of a nearest centroid classifier

    Normalization of two-channel microarrays accounting for experimental design and intensity-dependent relationships

    Get PDF
    eCADS is a new method for multiple array normalization of two-channel microarrays that takes into account general experimental designs and intensity-dependent relationships and allows for a more efficient dye-swap design that requires only one array per sample pair

    Liquid Chromatography Mass Spectrometry-Based Proteomics: Biological and Technological Aspects

    Get PDF
    Mass spectrometry-based proteomics has become the tool of choice for identifying and quantifying the proteome of an organism. Though recent years have seen a tremendous improvement in instrument performance and the computational tools used, significant challenges remain, and there are many opportunities for statisticians to make important contributions. In the most widely used "bottom-up" approach to proteomics, complex mixtures of proteins are first subjected to enzymatic cleavage, the resulting peptide products are separated based on chemical or physical properties and analyzed using a mass spectrometer. The two fundamental challenges in the analysis of bottom-up MS-based proteomics are as follows: (1) Identifying the proteins that are present in a sample, and (2) Quantifying the abundance levels of the identified proteins. Both of these challenges require knowledge of the biological and technological context that gives rise to observed data, as well as the application of sound statistical principles for estimation and inference. We present an overview of bottom-up proteomics and outline the key statistical issues that arise in protein identification and quantification.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS341 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Normalization and missing value imputation for label-free LC-MS analysis

    Get PDF
    Shotgun proteomic data are affected by a variety of known and unknown systematic biases as well as high proportions of missing values. Typically, normalization is performed in an attempt to remove systematic biases from the data before statistical inference, sometimes followed by missing value imputation to obtain a complete matrix of intensities. Here we discuss several approaches to normalization and dealing with missing values, some initially developed for microarray data and some developed specifically for mass spectrometry-based data

    Optimality Driven Nearest Centroid Classification from Genomic Data

    Get PDF
    Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is frequently carried out by computing univariate scores for each feature individually, without consideration for how a subset of features performs as a whole. We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here. This allows us to develop a new greedy algorithm to estimate this optimal nearest-centroid classifier with a given number of features. In addition, whereas the centroids are usually formed from maximum likelihood estimates, we investigate the applicability of high-dimensional shrinkage estimates of centroids. We apply the proposed method to clinical classification based on gene-expression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers

    Diet Complexity and Estrogen Receptor β Status Affect the Composition of the Murine Intestinal Microbiota

    Get PDF
    ABSTRACT Intestinal microbial dysbiosis contributes to the dysmetabolism of luminal factors, including steroid hormones (sterones) that affect the development of chronic gastrointestinal inflammation and the incidence of sterone-responsive cancers of the breast, prostate, and colon. Little is known, however, about the role of specific host sterone nucleoreceptors, including estrogen receptor β (ERβ), in microbiota maintenance. Herein, we test the hypothesis that ERβ status affects microbiota composition and determine if such compositionally distinct microbiota respond differently to changes in diet complexity that favor Proteobacteria enrichment. To this end, conventionally raised female ERβ +/+ and ERβ −/− C57BL/6J mice (mean age of 27 weeks) were initially reared on 8604, a complex diet containing estrogenic isoflavones, and then fed AIN-76, an isoflavone-free semisynthetic diet, for 2 weeks. 16S rRNA gene surveys revealed that the fecal microbiota of 8604-fed mice and AIN-76-fed mice differed, as expected. The relative diversity of Proteobacteria , especially the Alphaproteobacteria and Gammaproteobacteria , increased significantly following the transition to AIN-76. Distinct patterns for beneficial Lactobacillales were exclusive to and highly abundant among 8604-fed mice, whereas several Proteobacteria were exclusive to AIN-76-fed mice. Interestingly, representative orders of the phyla Proteobacteria , Bacteroidetes , and Firmicutes , including the Lactobacillales , also differed as a function of murine ERβ status. Overall, these interactions suggest that sterone nucleoreceptor status and diet complexity may play important roles in microbiota maintenance. Furthermore, we envision that this model for gastrointestinal dysbiosis may be used to identify novel probiotics, prebiotics, nutritional strategies, and pharmaceuticals for the prevention and resolution of Proteobacteria -rich dysbiosis

    An Introspective Comparison of Random Forest-Based Classifiers for the Analysis of Cluster-Correlated Data by Way of RF++

    Get PDF
    Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license

    Genome wide association mapping for arabinoxylan content in a collection of tetraploid wheats

    Get PDF
    BACKGROUND: Arabinoxylans (AXs) are major components of plant cell walls in bread wheat and are important in bread-making and starch extraction. Furthermore, arabinoxylans are components of soluble dietary fibre that has potential health-promoting effects in human nutrition. Despite their high value for human health, few studies have been carried out on the genetics of AX content in durum wheat. RESULTS: The genetic variability of AX content was investigated in a set of 104 tetraploid wheat genotypes and regions attributable to AX content were identified through a genome wide association study (GWAS). The amount of arabinoxylan, expressed as percentage (w/w) of the dry weight of the kernel, ranged from 1.8% to 5.5% with a mean value of 4.0%. The GWAS revealed a total of 37 significant marker-trait associations (MTA), identifying 19 quantitative trait loci (QTL) associated with AX content. The highest number of MTAs was identified on chromosome 5A (seven), where three QTL regions were associated with AX content, while the lowest number of MTAs was detected on chromosomes 2B and 4B, where only one MTA identified a single locus. Conservation of synteny between SNP marker sequences and the annotated genes and proteins in Brachypodium distachyon, Oryza sativa and Sorghum bicolor allowed the identification of nine QTL coincident with candidate genes. These included a glycosyl hydrolase GH35, which encodes Gal7 and a glucosyltransferase GT31 on chromosome 1A; a cluster of GT1 genes on chromosome 2B that includes TaUGT1 and cisZog1; a glycosyl hydrolase that encodes a CelC gene on chromosome 3A; Ugt12887 and TaUGT1genes on chromosome 5A; a (1,3)-β-D-glucan synthase (Gsl12 gene) and a glucosyl hydrolase (Cel8 gene) on chromosome 7A. CONCLUSIONS: This study identifies significant MTAs for the AX content in the grain of tetraploid wheat genotypes. We propose that these may be used for molecular breeding of durum wheat varieties with higher soluble fibre content.Ilaria Marcotuli, Kelly Houston, Robbie Waugh, Geoffrey B. Fincher, Rachel A. Burton, Antonio Blanco, Agata Gadalet
    • …
    corecore