139 research outputs found
X-ray analysis of fatigue damage in copper
X-ray analysis of copper deformed in fatigue for average coherently diffracting domain size and root mean square strain distributio
Statistical analysis and significance testing of serial analysis of gene expression data using a Poisson mixture model
<p>Abstract</p> <p>Background</p> <p>Serial analysis of gene expression (SAGE) is used to obtain quantitative snapshots of the transcriptome. These profiles are count-based and are assumed to follow a Binomial or Poisson distribution. However, tag counts observed across multiple libraries (for example, one or more groups of biological replicates) have additional variance that cannot be accommodated by this assumption alone. Several models have been proposed to account for this effect, all of which utilize a continuous prior distribution to explain the excess variance. Here, a Poisson mixture model, which assumes excess variability arises from sampling a mixture of distinct components, is proposed and the merits of this model are discussed and evaluated.</p> <p>Results</p> <p>The goodness of fit of the Poisson mixture model on 15 sets of biological SAGE replicates is compared to the previously proposed hierarchical gamma-Poisson (negative binomial) model, and a substantial improvement is seen. In further support of the mixture model, there is observed: 1) an increase in the number of mixture components needed to fit the expression of tags representing more than one transcript; and 2) a tendency for components to cluster libraries into the same groups. A confidence score is presented that can identify tags that are differentially expressed between groups of SAGE libraries. Several examples where this test outperforms those previously proposed are highlighted.</p> <p>Conclusion</p> <p>The Poisson mixture model performs well as a) a method to represent SAGE data from biological replicates, and b) a basis to assign significance when testing for differential expression between multiple groups of replicates. Code for the R statistical software package is included to assist investigators in applying this model to their own data.</p
Robust Detection of Hierarchical Communities from Escherichia coli Gene Expression Data
Determining the functional structure of biological networks is a central goal
of systems biology. One approach is to analyze gene expression data to infer a
network of gene interactions on the basis of their correlated responses to
environmental and genetic perturbations. The inferred network can then be
analyzed to identify functional communities. However, commonly used algorithms
can yield unreliable results due to experimental noise, algorithmic
stochasticity, and the influence of arbitrarily chosen parameter values.
Furthermore, the results obtained typically provide only a simplistic view of
the network partitioned into disjoint communities and provide no information of
the relationship between communities. Here, we present methods to robustly
detect coregulated and functionally enriched gene communities and demonstrate
their application and validity for Escherichia coli gene expression data.
Applying a recently developed community detection algorithm to the network of
interactions identified with the context likelihood of relatedness (CLR)
method, we show that a hierarchy of network communities can be identified.
These communities significantly enrich for gene ontology (GO) terms, consistent
with them representing biologically meaningful groups. Further, analysis of the
most significantly enriched communities identified several candidate new
regulatory interactions. The robustness of our methods is demonstrated by
showing that a core set of functional communities is reliably found when
artificial noise, modeling experimental noise, is added to the data. We find
that noise mainly acts conservatively, increasing the relatedness required for
a network link to be reliably assigned and decreasing the size of the core
communities, rather than causing association of genes into new communities.Comment: Due to appear in PLoS Computational Biology. Supplementary Figure S1
was not uploaded but is available by contacting the author. 27 pages, 5
figures, 15 supplementary file
Empirical bayes analysis of sequencing-based transcriptional profiling without replicates
Background:
Recent technological advancements have made high throughput sequencing an increasingly popular approach for transcriptome analysis. Advantages of sequencing-based transcriptional profiling over microarrays have been reported, including lower technical variability. However, advances in technology do not remove biological variation between replicates and this variation is often neglected in many analyses.
Results:
We propose an empirical Bayes method, titled Analysis of Sequence Counts (ASC), to detect differential expression based on sequencing technology. ASC borrows information across sequences to establish prior distribution of sample variation, so that biological variation can be accounted for even when replicates are not available. Compared to current approaches that simply tests for equality of proportions in two samples, ASC is less biased towards highly expressed sequences and can identify more genes with a greater log fold change at lower overall abundance.
Conclusions:
ASC unifies the biological and statistical significance of differential expression by estimating the posterior mean of log fold change and estimating false discovery rates based on the posterior mean. The implementation in R is available at http://www.stat.brown.edu/Zwu/research.aspx
A machine learning pipeline for quantitative phenotype prediction from genotype data
<p>Abstract</p> <p>Background</p> <p>Quantitative phenotypes emerge everywhere in systems biology and biomedicine due to a direct interest for quantitative traits, or to high individual variability that makes hard or impossible to classify samples into distinct categories, often the case with complex common diseases. Machine learning approaches to genotype-phenotype mapping may significantly improve Genome-Wide Association Studies (GWAS) results by explicitly focusing on predictivity and optimal feature selection in a multivariate setting. It is however essential that stringent and well documented Data Analysis Protocols (DAP) are used to control sources of variability and ensure reproducibility of results. We present a genome-to-phenotype pipeline of machine learning modules for quantitative phenotype prediction. The pipeline can be applied for the direct use of whole-genome information in functional studies. As a realistic example, the problem of fitting complex phenotypic traits in heterogeneous stock mice from single nucleotide polymorphims (SNPs) is here considered.</p> <p>Methods</p> <p>The core element in the pipeline is the L1L2 regularization method based on the naïve elastic net. The method gives at the same time a regression model and a dimensionality reduction procedure suitable for correlated features. Model and SNP markers are selected through a DAP originally developed in the MAQC-II collaborative initiative of the U.S. FDA for the identification of clinical biomarkers from microarray data. The L1L2 approach is compared with standard Support Vector Regression (SVR) and with Recursive Jump Monte Carlo Markov Chain (MCMC). Algebraic indicators of stability of partial lists are used for model selection; the final panel of markers is obtained by a procedure at the chromosome scale, termed ’saturation’, to recover SNPs in Linkage Disequilibrium with those selected.</p> <p>Results</p> <p>With respect to both MCMC and SVR, comparable accuracies are obtained by the L1L2 pipeline. Good agreement is also found between SNPs selected by the L1L2 algorithms and candidate loci previously identified by a standard GWAS. The combination of L1L2-based feature selection with a saturation procedure tackles the issue of neglecting highly correlated features that affects many feature selection algorithms.</p> <p>Conclusions</p> <p>The L1L2 pipeline has proven effective in terms of marker selection and prediction accuracy. This study indicates that machine learning techniques may support quantitative phenotype prediction, provided that adequate DAPs are employed to control bias in model selection.</p
Integrated genomic analyses of ovarian carcinoma
A catalogue of molecular aberrations that cause ovarian cancer is critical for developing and deploying therapies that will improve patients’ lives. The Cancer Genome Atlas project has analysed messenger RNA expression, microRNA expression, promoter methylation and DNA copy number in 489 high-grade serous ovarian adenocarcinomas and the DNA sequences of exons from coding genes in 316 of these tumours. Here we report that high-grade serous ovarian cancer is characterized by TP53 mutations in almost all tumours (96%); low prevalence but statistically recurrent somatic mutations in nine further genes including NF1, BRCA1, BRCA2, RB1 and CDK12; 113 significant focal DNA copy number aberrations; and promoter methylation events involving 168 genes. Analyses delineated four ovarian cancer transcriptional subtypes, three microRNA subtypes, four promoter methylation subtypes and a transcriptional signature associated with survival duration, and shed new light on the impact that tumours with BRCA1/2 (BRCA1 or BRCA2) and CCNE1 aberrations have on survival. Pathway analyses suggested that homologous recombination is defective in about half of the tumours analysed, and that NOTCH and FOXM1 signalling are involved in serous ovarian cancer pathophysiology.National Institutes of Health (U.S.) (Grant U54HG003067)National Institutes of Health (U.S.) (Grant U54HG003273)National Institutes of Health (U.S.) (Grant U54HG003079)National Institutes of Health (U.S.) (Grant U24CA126543)National Institutes of Health (U.S.) (Grant U24CA126544)National Institutes of Health (U.S.) (Grant U24CA126546)National Institutes of Health (U.S.) (Grant U24CA126551)National Institutes of Health (U.S.) (Grant U24CA126554)National Institutes of Health (U.S.) (Grant U24CA126561)National Institutes of Health (U.S.) (Grant U24CA126563)National Institutes of Health (U.S.) (Grant U24CA143882)National Institutes of Health (U.S.) (Grant U24CA143731)National Institutes of Health (U.S.) (Grant U24CA143835)National Institutes of Health (U.S.) (Grant U24CA143845)National Institutes of Health (U.S.) (Grant U24CA143858)National Institutes of Health (U.S.) (Grant U24CA144025)National Institutes of Health (U.S.) (Grant U24CA143866)National Institutes of Health (U.S.) (Grant U24CA143867)National Institutes of Health (U.S.) (Grant U24CA143848)National Institutes of Health (U.S.) (Grant U24CA143843)National Institutes of Health (U.S.) (Grant R21CA135877
Dormancy within Staphylococcus epidermidis biofilms : a transcriptomic analysis by RNA-seq
The proportion of dormant bacteria within Staphylococcus epidermidis biofilms may determine its inflammatory profile. Previously, we have shown that S. epidermidis biofilms with higher proportions of dormant bacteria have reduced activation of murine macrophages. RNA-sequencing was used to identify the major transcriptomic differences between S. epidermidis biofilms with different proportions of dormant bacteria. To accomplish this goal, we used an in vitro model where magnesium allowed modulation of the proportion of dormant bacteria within S. epidermidis biofilms. Significant differences were found in the expression of 147 genes. A detailed analysis of the results was performed based on direct and functional gene interactions. Biological processes among the differentially expressed genes were mainly related to oxidation-reduction processes and acetyl-CoA metabolic processes. Gene set enrichment revealed that the translation process is related to the proportion of dormant bacteria. Transcription of mRNAs involved in oxidation-reduction processes was associated with higher proportions of dormant bacteria within S. epidermidis biofilm. Moreover, the pH of the culture medium did not change after the addition of magnesium, and genes related to magnesium transport did not seem to impact entrance of bacterial cells into dormancy.The authors thank Stephen Lorry at Harvard Medical School for providing CLC Genomics software. This work was funded by Fundacao para a Ciencia e a Tecnologia (FCT) and COMPETE grants PTDC/BIA-MIC/113450/2009, FCOMP-01-0124-FEDER-014309, FCOMP-01-0124-FEDER-022718 (FCT PEst-C/SAU/LA0002/2011), QOPNA research unit (project PEst-C/QUI/UI0062/2011), and CENTRO-07-ST24-FEDER-002034. The following authors had an individual FCT fellowship: VC (SFRH/BD/78235/2011) and AF (2SFRH/BD/62359/2009)
Reproducible Cancer Biomarker Discovery in SELDI-TOF MS Using Different Pre-Processing Algorithms
BACKGROUND: There has been much interest in differentiating diseased and normal samples using biomarkers derived from mass spectrometry (MS) studies. However, biomarker identification for specific diseases has been hindered by irreproducibility. Specifically, a peak profile extracted from a dataset for biomarker identification depends on a data pre-processing algorithm. Until now, no widely accepted agreement has been reached. RESULTS: In this paper, we investigated the consistency of biomarker identification using differentially expressed (DE) peaks from peak profiles produced by three widely used average spectrum-dependent pre-processing algorithms based on SELDI-TOF MS data for prostate and breast cancers. Our results revealed two important factors that affect the consistency of DE peak identification using different algorithms. One factor is that some DE peaks selected from one peak profile were not detected as peaks in other profiles, and the second factor is that the statistical power of identifying DE peaks in large peak profiles with many peaks may be low due to the large scale of the tests and small number of samples. Furthermore, we demonstrated that the DE peak detection power in large profiles could be improved by the stratified false discovery rate (FDR) control approach and that the reproducibility of DE peak detection could thereby be increased. CONCLUSIONS: Comparing and evaluating pre-processing algorithms in terms of reproducibility can elucidate the relationship among different algorithms and also help in selecting a pre-processing algorithm. The DE peaks selected from small peak profiles with few peaks for a dataset tend to be reproducibly detected in large peak profiles, which suggests that a suitable pre-processing algorithm should be able to produce peaks sufficient for identifying useful and reproducible biomarkers
- …