190 research outputs found
Bias correction and Bayesian analysis of aggregate counts in SAGE libraries
<p>Abstract</p> <p>Background</p> <p>Tag-based techniques, such as SAGE, are commonly used to sample the mRNA pool of an organism's transcriptome. Incomplete digestion during the tag formation process may allow for multiple tags to be generated from a given mRNA transcript. The probability of forming a tag varies with its relative location. As a result, the observed tag counts represent a biased sample of the actual transcript pool. In SAGE this bias can be avoided by ignoring all but the 3' most tag but will discard a large fraction of the observed data. Taking this bias into account should allow more of the available data to be used leading to increased statistical power.</p> <p>Results</p> <p>Three new hierarchical models, which directly embed a model for the variation in tag formation probability, are proposed and their associated Bayesian inference algorithms are developed. These models may be applied to libraries at both the tag and aggregate level. Simulation experiments and analysis of real data are used to contrast the accuracy of the various methods. The consequences of tag formation bias are discussed in the context of testing differential expression. A description is given as to how these algorithms can be applied in that context.</p> <p>Conclusions</p> <p>Several Bayesian inference algorithms that account for tag formation effects are compared with the DPB algorithm providing clear evidence of superior performance. The accuracy of inferences when using a particular non-informative prior is found to depend on the expression level of a given gene. The multivariate nature of the approach easily allows both univariate and joint tests of differential expression. Calculations demonstrate the potential for false positive and negative findings due to variation in tag formation probabilities across samples when testing for differential expression.</p
Statistical analysis and significance testing of serial analysis of gene expression data using a Poisson mixture model
<p>Abstract</p> <p>Background</p> <p>Serial analysis of gene expression (SAGE) is used to obtain quantitative snapshots of the transcriptome. These profiles are count-based and are assumed to follow a Binomial or Poisson distribution. However, tag counts observed across multiple libraries (for example, one or more groups of biological replicates) have additional variance that cannot be accommodated by this assumption alone. Several models have been proposed to account for this effect, all of which utilize a continuous prior distribution to explain the excess variance. Here, a Poisson mixture model, which assumes excess variability arises from sampling a mixture of distinct components, is proposed and the merits of this model are discussed and evaluated.</p> <p>Results</p> <p>The goodness of fit of the Poisson mixture model on 15 sets of biological SAGE replicates is compared to the previously proposed hierarchical gamma-Poisson (negative binomial) model, and a substantial improvement is seen. In further support of the mixture model, there is observed: 1) an increase in the number of mixture components needed to fit the expression of tags representing more than one transcript; and 2) a tendency for components to cluster libraries into the same groups. A confidence score is presented that can identify tags that are differentially expressed between groups of SAGE libraries. Several examples where this test outperforms those previously proposed are highlighted.</p> <p>Conclusion</p> <p>The Poisson mixture model performs well as a) a method to represent SAGE data from biological replicates, and b) a basis to assign significance when testing for differential expression between multiple groups of replicates. Code for the R statistical software package is included to assist investigators in applying this model to their own data.</p
Multicentric validation of proteomic biomarkers in urine specific for diabetic nephropathy
Background: Urine proteome analysis is rapidly emerging as a tool for diagnosis and prognosis in disease states. For diagnosis of diabetic nephropathy (DN), urinary proteome analysis was successfully applied in a pilot study. The validity of the previously established proteomic biomarkers with respect to the diagnostic and prognostic potential was assessed on a separate set of patients recruited at three different European centers. In this case-control study of 148 Caucasian patients with diabetes mellitus type 2 and duration >= 5 years, cases of DN were defined as albuminuria >300 mg/d and diabetic retinopathy (n = 66). Controls were matched for gender and diabetes duration (n = 82).
Methodology/Principal Findings: Proteome analysis was performed blinded using high-resolution capillary electrophoresis coupled with mass spectrometry (CE-MS). Data were evaluated employing the previously developed model for DN. Upon unblinding, the model for DN showed 93.8% sensitivity and 91.4% specificity, with an AUC of 0.948 (95% CI 0.898-0.978). Of 65 previously identified peptides, 60 were significantly different between cases and controls of this study. In <10% of cases and controls classification by proteome analysis not entirely resulted in the expected clinical outcome. Analysis of patient's subsequent clinical course revealed later progression to DN in some of the false positive classified DN control patients.
Conclusions: These data provide the first independent confirmation that profiling of the urinary proteome by CE-MS can adequately identify subjects with DN, supporting the generalizability of this approach. The data further establish urinary collagen fragments as biomarkers for diabetes-induced renal damage that may serve as earlier and more specific biomarkers than the currently used urinary albumin
Modeling SAGE tag formation and its effects on data interpretation within a Bayesian framework
<p>Abstract</p> <p>Background</p> <p>Serial Analysis of Gene Expression (SAGE) is a high-throughput method for inferring mRNA expression levels from the experimentally generated sequence based tags. Standard analyses of SAGE data, however, ignore the fact that the probability of generating an observable tag varies across genes and between experiments. As a consequence, these analyses result in biased estimators and posterior probability intervals for gene expression levels in the transcriptome.</p> <p>Results</p> <p>Using the yeast <it>Saccharomyces cerevisiae </it>as an example, we introduce a new Bayesian method of data analysis which is based on a model of SAGE tag formation. Our approach incorporates the variation in the probability of tag formation into the interpretation of SAGE data and allows us to derive exact joint and approximate marginal posterior distributions for the mRNA frequency of genes detectable using SAGE. Our analysis of these distributions indicates that the frequency of a gene in the tag pool is influenced by its mRNA frequency, the cleavage efficiency of the anchoring enzyme (AE), and the number of informative and uninformative AE cleavage sites within its mRNA.</p> <p>Conclusion</p> <p>With a mechanistic, model based approach for SAGE data analysis, we find that inter-genic variation in SAGE tag formation is large. However, this variation can be estimated and, importantly, accounted for using the methods we develop here. As a result, SAGE based estimates of mRNA frequencies can be adjusted to remove the bias introduced by the SAGE tag formation process.</p
BRCA2 inhibition enhances cisplatin-mediated alterations in tumor cell proliferation, metabolism, and metastasis
Tumor cells have unstable genomes relative to non-tumor cells. Decreased DNA integrity resulting from tumor cell instability is important in generating favorable therapeutic indices, and intact DNA repair mediates resistance to therapy. Targeting DNA repair to promote the action of anti-cancer agents is therefore an attractive therapeutic strategy. BRCA2 is involved in homologous recombination repair. BRCA2 defects increase cancer risk but, paradoxically, cancer patients with BRCA2 mutations have better survival rates. We queried TCGA data and found that BRCA2 alterations led to increased survival in patients with ovarian and endometrial cancer. We developed a BRCA2-targeting second-generation antisense oligonucleotide (ASO), which sensitized human lung, ovarian, and breast cancer cells to cisplatin by as much as 60%. BRCA2 ASO treatment overcame acquired cisplatin resistance in head and neck cancer cells, but induced minimal cisplatin sensitivity in non-tumor cells. BRCA2 ASO plus cisplatin reduced respiration as an early event preceding cell death, concurrent with increased glucose uptake without a difference in glycolysis. BRCA2 ASO and cisplatin decreased metastatic frequency invivo by 77%. These results implicate BRCA2 as a regulator of metastatic frequency and cellular metabolic response following cisplatin treatment. BRCA2 ASO, in combination with cisplatin, is a potential therapeutic anti-cancer agent
Robust Detection of Hierarchical Communities from Escherichia coli Gene Expression Data
Determining the functional structure of biological networks is a central goal
of systems biology. One approach is to analyze gene expression data to infer a
network of gene interactions on the basis of their correlated responses to
environmental and genetic perturbations. The inferred network can then be
analyzed to identify functional communities. However, commonly used algorithms
can yield unreliable results due to experimental noise, algorithmic
stochasticity, and the influence of arbitrarily chosen parameter values.
Furthermore, the results obtained typically provide only a simplistic view of
the network partitioned into disjoint communities and provide no information of
the relationship between communities. Here, we present methods to robustly
detect coregulated and functionally enriched gene communities and demonstrate
their application and validity for Escherichia coli gene expression data.
Applying a recently developed community detection algorithm to the network of
interactions identified with the context likelihood of relatedness (CLR)
method, we show that a hierarchy of network communities can be identified.
These communities significantly enrich for gene ontology (GO) terms, consistent
with them representing biologically meaningful groups. Further, analysis of the
most significantly enriched communities identified several candidate new
regulatory interactions. The robustness of our methods is demonstrated by
showing that a core set of functional communities is reliably found when
artificial noise, modeling experimental noise, is added to the data. We find
that noise mainly acts conservatively, increasing the relatedness required for
a network link to be reliably assigned and decreasing the size of the core
communities, rather than causing association of genes into new communities.Comment: Due to appear in PLoS Computational Biology. Supplementary Figure S1
was not uploaded but is available by contacting the author. 27 pages, 5
figures, 15 supplementary file
Relative impact of key sources of systematic noise in Affymetrix and Illumina gene-expression microarray experiments
<p>Abstract</p> <p>Background</p> <p>Systematic processing noise, which includes batch effects, is very common in microarray experiments but is often ignored despite its potential to confound or compromise experimental results. Compromised results are most likely when re-analysing or integrating datasets from public repositories due to the different conditions under which each dataset is generated. To better understand the relative noise-contributions of various factors in experimental-design, we assessed several Illumina and Affymetrix datasets for technical variation between replicate hybridisations of Universal Human Reference (UHRR) and individual or pooled breast-tumour RNA.</p> <p>Results</p> <p>A varying degree of systematic noise was observed in each of the datasets, however in all cases the relative amount of variation between standard control RNA replicates was found to be greatest at earlier points in the sample-preparation workflow. For example, 40.6% of the total variation in reported expressions were attributed to replicate extractions, compared to 13.9% due to amplification/labelling and 10.8% between replicate hybridisations. Deliberate probe-wise batch-correction methods were effective in reducing the magnitude of this variation, although the level of improvement was dependent on the sources of noise included in the model. Systematic noise introduced at the chip, run, and experiment levels of a combined Illumina dataset were found to be highly dependant upon the experimental design. Both UHRR and pools of RNA, which were derived from the samples of interest, modelled technical variation well although the pools were significantly better correlated (4% average improvement) and better emulated the effects of systematic noise, over all probes, than the UHRRs. The effect of this noise was not uniform over all probes, with low GC-content probes found to be more vulnerable to batch variation than probes with a higher GC-content.</p> <p>Conclusions</p> <p>The magnitude of systematic processing noise in a microarray experiment is variable across probes and experiments, however it is generally the case that procedures earlier in the sample-preparation workflow are liable to introduce the most noise. Careful experimental design is important to protect against noise, detailed meta-data should always be provided, and diagnostic procedures should be routinely performed prior to downstream analyses for the detection of bias in microarray studies.</p
Empirical bayes analysis of sequencing-based transcriptional profiling without replicates
Background:
Recent technological advancements have made high throughput sequencing an increasingly popular approach for transcriptome analysis. Advantages of sequencing-based transcriptional profiling over microarrays have been reported, including lower technical variability. However, advances in technology do not remove biological variation between replicates and this variation is often neglected in many analyses.
Results:
We propose an empirical Bayes method, titled Analysis of Sequence Counts (ASC), to detect differential expression based on sequencing technology. ASC borrows information across sequences to establish prior distribution of sample variation, so that biological variation can be accounted for even when replicates are not available. Compared to current approaches that simply tests for equality of proportions in two samples, ASC is less biased towards highly expressed sequences and can identify more genes with a greater log fold change at lower overall abundance.
Conclusions:
ASC unifies the biological and statistical significance of differential expression by estimating the posterior mean of log fold change and estimating false discovery rates based on the posterior mean. The implementation in R is available at http://www.stat.brown.edu/Zwu/research.aspx
Public Availability of Published Research Data in High-Impact Journals
BACKGROUND: There is increasing interest to make primary data from published research publicly available. We aimed to assess the current status of making research data available in highly-cited journals across the scientific literature. METHODS AND RESULTS: We reviewed the first 10 original research papers of 2009 published in the 50 original research journals with the highest impact factor. For each journal we documented the policies related to public availability and sharing of data. Of the 50 journals, 44 (88%) had a statement in their instructions to authors related to public availability and sharing of data. However, there was wide variation in journal requirements, ranging from requiring the sharing of all primary data related to the research to just including a statement in the published manuscript that data can be available on request. Of the 500 assessed papers, 149 (30%) were not subject to any data availability policy. Of the remaining 351 papers that were covered by some data availability policy, 208 papers (59%) did not fully adhere to the data availability instructions of the journals they were published in, most commonly (73%) by not publicly depositing microarray data. The other 143 papers that adhered to the data availability instructions did so by publicly depositing only the specific data type as required, making a statement of willingness to share, or actually sharing all the primary data. Overall, only 47 papers (9%) deposited full primary raw data online. None of the 149 papers not subject to data availability policies made their full primary data publicly available. CONCLUSION: A substantial proportion of original research papers published in high-impact journals are either not subject to any data availability policies, or do not adhere to the data availability instructions in their respective journals. This empiric evaluation highlights opportunities for improvement
- …