678 research outputs found

    MGMR: leveraging RNA-Seq population data to optimize expression estimation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>RNA-Seq is a technique that uses Next Generation Sequencing to identify transcripts and estimate transcription levels. When applying this technique for quantification, one must contend with reads that align to multiple positions in the genome (multireads). Previous efforts to resolve multireads have shown that RNA-Seq expression estimation can be improved using probabilistic allocation of reads to genes. These methods use a probabilistic generative model for data generation and resolve ambiguity using likelihood-based approaches. In many instances, RNA-seq experiments are performed in the context of a population. The generative models of current methods do not take into account such population information, and it is an open question whether this information can improve quantification of the individual samples</p> <p>Results</p> <p>In order to explore the contribution of population level information in RNA-seq quantification, we apply a hierarchical probabilistic generative model, which assumes that expression levels of different individuals are sampled from a Dirichlet distribution with parameters specific to the population, and reads are sampled from the distribution of expression levels. We introduce an optimization procedure for the estimation of the model parameters, and use HapMap data and simulated data to demonstrate that the model yields a significant improvement in the accuracy of expression levels of paralogous genes.</p> <p>Conclusions</p> <p>We provide a proof of principal of the benefit of drawing on population commonalities to estimate expression. The results of our experiments demonstrate this approach can be beneficial, primarily for estimation at the gene level.</p

    Exposures to Air Pollutants during Pregnancy and Preterm Delivery

    Get PDF
    The association between preterm delivery (PTD) and exposure to air pollutants has recently become a major concern. We investigated this relationship in Incheon, Republic of Korea, using spatial and temporal modeling to better infer individual exposures. The birth cohort consisted of 52,113 singleton births in 2001–2002, and data included residential address, gestational age, sex, birth date and order, and parental age and education. We used a geographic information system and kriging methods to construct spatial and temporal exposure models. Associations between exposure and PTD were evaluated using univariate and multivariate log-binomial regressions. Given the gestational age, birth date, and the mother’s residential address, we estimated each mother’s potential exposure to air pollutants during critical periods of the pregnancy. The adjusted risk ratios for PTD in the highest quartiles of the first trimester exposure were 1.26 [95% confidence interval (CI), 1.11–1.44] for carbon monoxide, 1.27 (95% CI, 1.04–1.56) for particulate matter with aerodynamic diameter ≤ 10 μm, 1.24 (95% CI, 1.09–1.41) for nitrogen dioxide, and 1.21 (95% CI, 1.04–1.42) for sulfur dioxide. The relationships between PTD and exposures to CO, NO(2), and SO(2) were dose dependent (p < 0.001, p < 0.02, p < 0.02, respectively). In addition, the results of our study indicated a significant association between air pollution and PTD during the third trimester of pregnancy. In conclusion, our study showed that relatively low concentrations of air pollution under current air quality standards during pregnancy may contribute to an increased risk of PTD. A biologic mechanism through increased prostaglandin levels that are triggered by inflammatory mediators during exposure periods is discussed

    rnaSeqMap: a Bioconductor package for RNA sequencing data exploration

    Get PDF
    BACKGROUND: The throughput of commercially available sequencers has recently significantly increased. It has reached the point where measuring the RNA expression by the depth of coverage has become feasible even for largest genomes. The development of software tools is constantly following the progress of biological hardware. In particular, as RNA sequencing software can be regarded genome browsers, exon junction tools and statistical tools operating on counts of reads in predefined regions. The library rnaSeqMap, freely available via Bioconductor, is an RNA sequencing software which is independent of any biological hardware platform. It is based upon standard Bioconductor infrastructure for sequencing data and includes several novel features focused on deeper understanding of coverage expression profiles and discovery of novel transcription regions. RESULTS: rnaSeqMap is a toolbox for analyses that may be performed with the use of gene annotations or alternatively, in an unsupervised mode, on any genomic region to find novel or non-standard transcripts. The data back-end may be a MySQL database or a set of files in standard BAM format. The processing in R can be run on a machine without any particular hardware requirements, and scales linearly with the number of genomic loci and number of samples analyzed. The main features of rnaSeqMap include coverage operations, discovering irreducible regions of high expression, significance search and splicing analyses with nucleotide granularity. CONCLUSIONS: This software may be used for a range of applications related to RNA sequencing by building customized analysis pipelines. The applicability and precision is expected to increase in parallel with the progress of the genome coverage in sequencers

    Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data

    Get PDF
    Background: In differential expression analysis of RNA-sequencing (RNA-seq) read count data for two sample groups, it is known that highly expressed genes (or longer genes) are more likely to be differentially expressed which is called read count bias (or gene length bias). This bias had great effect on the downstream Gene Ontology over-representation analysis. However, such a bias has not been systematically analyzed for different replicate types of RNA-seq data. Results: We show that the dispersion coefficient of a gene in the negative binomial modeling of read counts is the critical determinant of the read count bias (and gene length bias) by mathematical inference and tests for a number of simulated and real RNA-seq datasets. We demonstrate that the read count bias is mostly confined to data with small gene dispersions (e.g., technical replicates and some of genetically identical replicates such as cell lines or inbred animals), and many biological replicate data from unrelated samples do not suffer from such a bias except for genes with some small counts. It is also shown that the sample-permuting GSEA method yields a considerable number of false positives caused by the read count bias, while the preranked method does not. Conclusion: We showed the small gene variance (similarly, dispersion) is the main cause of read count bias (and gene length bias) for the first time and analyzed the read count bias for different replicate types of RNA-seq data and its effect on gene-set enrichment analysis

    Characterization and Comparison of the Leukocyte Transcriptomes of Three Cattle Breeds

    Get PDF
    In this study, mRNA-Seq was used to characterize and compare the leukocyte transcriptomes from two taurine breeds (Holstein and Jersey), and one indicine breed (Cholistani). At the genomic level, we identified breed-specific base changes in protein coding regions. Among 7,793,425 coding bases, only 165 differed between Holstein and Jersey, and 3,383 (0.04%) differed between Holstein and Cholistani, 817 (25%) of which resulted in amino acid changes in 627 genes. At the transcriptional level, we assembled transcripts and estimated their abundances including those from more than 3,000 unannotated intergeneic regions. Differential gene expression analysis showed a high similarity between Holstein and Jersey, and a much greater difference between the taurine breeds and the indicine breed. We identified gene ontology pathways that were systematically altered, including the electron transport chain and immune response pathways that may contribute to different levels of heat tolerance and disease resistance in taurine and indicine breeds. At the post-transcriptional level, sequencing mRNA allowed us to identify a number of genes undergoing differential alternative splicing among different breeds. This study provided a high-resolution survey of the variation between bovine transcriptomes at different levels and may provide important biological insights into the phenotypic differentiation among cattle breeds

    Optimizing a Massive Parallel Sequencing Workflow for Quantitative miRNA Expression Analysis

    Get PDF
    BACKGROUND: Massive Parallel Sequencing methods (MPS) can extend and improve the knowledge obtained by conventional microarray technology, both for mRNAs and short non-coding RNAs, e.g. miRNAs. The processing methods used to extract and interpret the information are an important aspect of dealing with the vast amounts of data generated from short read sequencing. Although the number of computational tools for MPS data analysis is constantly growing, their strengths and weaknesses as part of a complex analytical pipe-line have not yet been well investigated. PRIMARY FINDINGS: A benchmark MPS miRNA dataset, resembling a situation in which miRNAs are spiked in biological replication experiments was assembled by merging a publicly available MPS spike-in miRNAs data set with MPS data derived from healthy donor peripheral blood mononuclear cells. Using this data set we observed that short reads counts estimation is strongly under estimated in case of duplicates miRNAs, if whole genome is used as reference. Furthermore, the sensitivity of miRNAs detection is strongly dependent by the primary tool used in the analysis. Within the six aligners tested, specifically devoted to miRNA detection, SHRiMP and MicroRazerS show the highest sensitivity. Differential expression estimation is quite efficient. Within the five tools investigated, two of them (DESseq, baySeq) show a very good specificity and sensitivity in the detection of differential expression. CONCLUSIONS: The results provided by our analysis allow the definition of a clear and simple analytical optimized workflow for miRNAs digital quantitative analysis

    RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments.</p> <p>Results</p> <p>We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene.</p> <p>Conclusions</p> <p>RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.</p

    Evaluating methods for ranking differentially expressed genes applied to microArray quality control data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Statistical methods for ranking differentially expressed genes (DEGs) from gene expression data should be evaluated with regard to high sensitivity, specificity, and reproducibility. In our previous studies, we evaluated eight gene ranking methods applied to only Affymetrix GeneChip data. A more general evaluation that also includes other microarray platforms, such as the Agilent or Illumina systems, is desirable for determining which methods are suitable for each platform and which method has better inter-platform reproducibility.</p> <p>Results</p> <p>We compared the eight gene ranking methods using the MicroArray Quality Control (MAQC) datasets produced by five manufacturers: Affymetrix, Applied Biosystems, Agilent, GE Healthcare, and Illumina. The area under the curve (AUC) was used as a measure for both sensitivity and specificity. Although the highest AUC values can vary with the definition of "true" DEGs, the best methods were, in most cases, either the weighted average difference (WAD), rank products (RP), or intensity-based moderated <it>t </it>statistic (ibmT). The percentages of overlapping genes (POGs) across different test sites were mainly evaluated as a measure for both intra- and inter-platform reproducibility. The POG values for WAD were the highest overall, irrespective of the choice of microarray platform. The high intra- and inter-platform reproducibility of WAD was also observed at a higher biological function level.</p> <p>Conclusion</p> <p>These results for the five microarray platforms were consistent with our previous ones based on 36 real experimental datasets measured using the Affymetrix platform. Thus, recommendations made using the MAQC benchmark data might be universally applicable.</p
    corecore