376 research outputs found

    MGMR: leveraging RNA-Seq population data to optimize expression estimation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>RNA-Seq is a technique that uses Next Generation Sequencing to identify transcripts and estimate transcription levels. When applying this technique for quantification, one must contend with reads that align to multiple positions in the genome (multireads). Previous efforts to resolve multireads have shown that RNA-Seq expression estimation can be improved using probabilistic allocation of reads to genes. These methods use a probabilistic generative model for data generation and resolve ambiguity using likelihood-based approaches. In many instances, RNA-seq experiments are performed in the context of a population. The generative models of current methods do not take into account such population information, and it is an open question whether this information can improve quantification of the individual samples</p> <p>Results</p> <p>In order to explore the contribution of population level information in RNA-seq quantification, we apply a hierarchical probabilistic generative model, which assumes that expression levels of different individuals are sampled from a Dirichlet distribution with parameters specific to the population, and reads are sampled from the distribution of expression levels. We introduce an optimization procedure for the estimation of the model parameters, and use HapMap data and simulated data to demonstrate that the model yields a significant improvement in the accuracy of expression levels of paralogous genes.</p> <p>Conclusions</p> <p>We provide a proof of principal of the benefit of drawing on population commonalities to estimate expression. The results of our experiments demonstrate this approach can be beneficial, primarily for estimation at the gene level.</p

    FusionFinder: A Software Tool to Identify Expressed Gene Fusion Candidates from RNA-Seq Data

    Get PDF
    The hallmarks of many haematological malignancies and solid tumours are chromosomal translocations, which may lead to gene fusions. Recently, next-generation sequencing techniques at the transcriptome level (RNA-Seq) have been used to verify known and discover novel transcribed gene fusions. We present FusionFinder, a Perl-based software designed to automate the discovery of candidate gene fusion partners from single-end (SE) or paired-end (PE) RNA-Seq read data. FusionFinder was applied to data from a previously published analysis of the K562 chronic myeloid leukaemia (CML) cell line. Using FusionFinder we successfully replicated the findings of this study and detected additional previously unreported fusion genes in their dataset, which were confirmed experimentally. These included two isoforms of a fusion involving the genes BRK1 and VHL, whose co-deletion has previously been associated with the prevalence and severity of renal-cell carcinoma. FusionFinder is made freely available for non-commercial use and can be downloaded from the project website (http://bioinformatics.childhealthresearch.org.au/software/fusionfinder/)

    The Echinococcus canadensis (G7) genome: A key knowledge of parasitic platyhelminth human diseases

    Get PDF
    Background: The parasite Echinococcus canadensis (G7) (phylum Platyhelminthes, class Cestoda) is one of the causative agents of echinococcosis. Echinococcosis is a worldwide chronic zoonosis affecting humans as well as domestic and wild mammals, which has been reported as a prioritized neglected disease by the World Health Organisation. No genomic data, comparative genomic analyses or efficient therapeutic and diagnostic tools are available for this severe disease. The information presented in this study will help to understand the peculiar biological characters and to design species-specific control tools. Results: We sequenced, assembled and annotated the 115-Mb genome of E. canadensis (G7). Comparative genomic analyses using whole genome data of three Echinococcus species not only confirmed the status of E. canadensis (G7) as a separate species but also demonstrated a high nucleotide sequences divergence in relation to E. granulosus (G1). The E. canadensis (G7) genome contains 11,449 genes with a core set of 881 orthologs shared among five cestode species. Comparative genomics revealed that there are more single nucleotide polymorphisms (SNPs) between E. canadensis (G7) and E. granulosus (G1) than between E. canadensis (G7) and E. multilocularis. This result was unexpected since E. canadensis (G7) and E. granulosus (G1) were considered to belong to the species complex E. granulosus sensu lato. We described SNPs in known drug targets and metabolism genes in the E. canadensis (G7) genome. Regarding gene regulation, we analysed three particular features: CpG island distribution along the three Echinococcus genomes, DNA methylation system and small RNA pathway. The results suggest the occurrence of yet unknown gene regulation mechanisms in Echinococcus. Conclusions: This is the first work that addresses Echinococcus comparative genomics. The resources presented here will promote the study of mechanisms of parasite development as well as new tools for drug discovery. The availability of a high-quality genome assembly is critical for fully exploring the biology of a pathogenic organism. The E. canadensis (G7) genome presented in this study provides a unique opportunity to address the genetic diversity among the genus Echinococcus and its particular developmental features. At present, there is no unequivocal taxonomic classification of Echinococcus species; however, the genome-wide SNPs analysis performed here revealed the phylogenetic distance among these three Echinococcus species. Additional cestode genomes need to be sequenced to be able to resolve their phylogeny.Fil: Maldonado, Lucas Luciano. Consejo Nacional de Investigaciones CientΓ­ficas y TΓ©cnicas. Oficina de CoordinaciΓ³n Administrativa Houssay. Instituto de Investigaciones en MicrobiologΓ­a y ParasitologΓ­a MΓ©dica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en MicrobiologΓ­a y ParasitologΓ­a MΓ©dica; ArgentinaFil: Assis, Juliana. FundaciΓ³n Oswaldo Cruz; BrasilFil: Gomes AraΓΊjo, FlΓ‘vio M.. FundaciΓ³n Oswaldo Cruz; BrasilFil: Salim, Anna C. M.. FundaciΓ³n Oswaldo Cruz; BrasilFil: Macchiaroli, Natalia. Consejo Nacional de Investigaciones CientΓ­ficas y TΓ©cnicas. Oficina de CoordinaciΓ³n Administrativa Houssay. Instituto de Investigaciones en MicrobiologΓ­a y ParasitologΓ­a MΓ©dica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en MicrobiologΓ­a y ParasitologΓ­a MΓ©dica; ArgentinaFil: Cucher, Marcela Alejandra. Consejo Nacional de Investigaciones CientΓ­ficas y TΓ©cnicas. Oficina de CoordinaciΓ³n Administrativa Houssay. Instituto de Investigaciones en MicrobiologΓ­a y ParasitologΓ­a MΓ©dica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en MicrobiologΓ­a y ParasitologΓ­a MΓ©dica; ArgentinaFil: Camicia, Federico. Consejo Nacional de Investigaciones CientΓ­ficas y TΓ©cnicas. Oficina de CoordinaciΓ³n Administrativa Houssay. Instituto de Investigaciones en MicrobiologΓ­a y ParasitologΓ­a MΓ©dica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en MicrobiologΓ­a y ParasitologΓ­a MΓ©dica; ArgentinaFil: Fox, Adolfo. Consejo Nacional de Investigaciones CientΓ­ficas y TΓ©cnicas. Oficina de CoordinaciΓ³n Administrativa Houssay. Instituto de Investigaciones en MicrobiologΓ­a y ParasitologΓ­a MΓ©dica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en MicrobiologΓ­a y ParasitologΓ­a MΓ©dica; ArgentinaFil: Rosenzvit, Mara Cecilia. Consejo Nacional de Investigaciones CientΓ­ficas y TΓ©cnicas. Oficina de CoordinaciΓ³n Administrativa Houssay. Instituto de Investigaciones en MicrobiologΓ­a y ParasitologΓ­a MΓ©dica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en MicrobiologΓ­a y ParasitologΓ­a MΓ©dica; ArgentinaFil: Oliveira, Guilherme. Instituto TecnolΓ³gico Vale; Brasil. FundaciΓ³n Oswaldo Cruz; BrasilFil: Kamenetzky, Laura. Consejo Nacional de Investigaciones CientΓ­ficas y TΓ©cnicas. Oficina de CoordinaciΓ³n Administrativa Houssay. Instituto de Investigaciones en MicrobiologΓ­a y ParasitologΓ­a MΓ©dica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en MicrobiologΓ­a y ParasitologΓ­a MΓ©dica; Argentin

    Improving gene-set enrichment analysis of RNA-Seq data with small replicates

    Get PDF
    Deregulated pathways identified from transcriptome data of two sample groups have played a key role in many genomic studies. Gene-set enrichment analysis (GSEA) has been commonly used for pathway or functional analysis of microarray data, and it is also being applied to RNA-seq data. However, most RNA-seq data so far have only small replicates. This enforces to apply the gene-permuting GSEA method (or preranked GSEA) which results in a great number of false positives due to the inter-gene correlation in each gene-set. We demonstrate that incorporating the absolute gene statistic in one-tailed GSEA considerably improves the false-positive control and the overall discriminatory ability of the gene-permuting GSEA methods for RNA-seq data. To test the performance, a simulation method to generate correlated read counts within a gene-set was newly developed, and a dozen of currently available RNA-seq enrichment analysis methods were compared, where the proposed methods outperformed others that do not account for the inter-gene correlation. Analysis of real RNA-seq data also supported the proposed methods in terms of false positive control, ranks of true positives and biological relevance. An efficient R package (AbsFilterG- SEA) coded with C++ (Rcpp) is available from CRAN.open

    CodingQuarry: Highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts

    Get PDF
    Background: The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study. Results: CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes. Conclusions: We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available (https://sourceforge.net/projects/codingquarry/), and suitable for incorporation into genome annotation pipelines

    SeqGene: a comprehensive software solution for mining exome- and transcriptome- sequencing data

    Get PDF
    Abstract Background The popularity of massively parallel exome and transcriptome sequencing projects demands new data mining tools with a comprehensive set of features to support a wide range of analysis tasks. Results SeqGene, a new data mining tool, supports mutation detection and annotation, dbSNP and 1000 Genome data integration, RNA-Seq expression quantification, mutation and coverage visualization, allele specific expression (ASE), differentially expressed genes (DEGs) identification, copy number variation (CNV) analysis, and gene expression quantitative trait loci (eQTLs) detection. We also developed novel methods for testing the association between SNP and expression and identifying genotype-controlled DEGs. We showed that the results generated from SeqGene compares favourably to other existing methods in our case studies. Conclusion SeqGene is designed as a general-purpose software package. It supports both paired-end reads and single reads generated on most sequencing platforms; it runs on all major types of computers; it supports arbitrary genome assemblies for arbitrary organisms; and it scales well to support both large and small scale sequencing projects. The software homepage is http://seqgene.sourceforge.net.</p

    Field pathogenomics reveals the emergence of a diverse wheat yellow rust population

    Get PDF
    BACKGROUND: Emerging and re-emerging pathogens imperil public health and global food security. Responding to these threats requires improved surveillance and diagnostic systems. Despite their potential, genomic tools have not been readily applied to emerging or re-emerging plant pathogens such as the wheat yellow (stripe) rust pathogen Puccinia striiformis f. sp. tritici (PST). This is due largely to the obligate parasitic nature of PST, as culturing PST isolates for DNA extraction remains slow and tedious. RESULTS: To counteract the limitations associated with culturing PST, we developed and applied a field pathogenomics approach by transcriptome sequencing infected wheat leaves collected from the field in 2013. This enabled us to rapidly gain insights into this emerging pathogen population. We found that the PST population across the United Kingdom (UK) underwent a major shift in recent years. Population genetic structure analyses revealed four distinct lineages that correlated to the phenotypic groups determined through traditional pathology-based virulence assays. Furthermore, the genetic diversity between members of a single population cluster for all 2013 PST field samples was much higher than that displayed by historical UK isolates, revealing a more diverse population of PST. CONCLUSIONS: Our field pathogenomics approach uncovered a dramatic shift in the PST population in the UK, likely due to a recent introduction of a diverse set of exotic PST lineages. The methodology described herein accelerates genetic analysis of pathogen populations and circumvents the difficulties associated with obligate plant pathogens. In principle, this strategy can be widely applied to a variety of plant pathogens

    Noisy Splicing Drives mRNA Isoform Diversity in Human Cells

    Get PDF
    While the majority of multiexonic human genes show some evidence of alternative splicing, it is unclear what fraction of observed splice forms is functionally relevant. In this study, we examine the extent of alternative splicing in human cells using deep RNA sequencing and de novo identification of splice junctions. We demonstrate the existence of a large class of low abundance isoforms, encompassing approximately 150,000 previously unannotated splice junctions in our data. Newly-identified splice sites show little evidence of evolutionary conservation, suggesting that the majority are due to erroneous splice site choice. We show that sequence motifs involved in the recognition of exons are enriched in the vicinity of unconserved splice sites. We estimate that the average intron has a splicing error rate of approximately 0.7% and show that introns in highly expressed genes are spliced more accurately, likely due to their shorter length. These results implicate noisy splicing as an important property of genome evolution

    Comparative analysis of neural transcriptomes and functional implication of unannotated intronic expression

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The transcriptome and its regulation bridge the genome and the phenome. Recent RNA-seq studies unveiled complex transcriptomes with previously unknown transcripts and functions. To investigate the characteristics of neural transcriptomes and possible functions of previously unknown transcripts, we analyzed and compared nine recent RNA-seq datasets corresponding to tissues/organs ranging from stem cell, embryonic brain cortex to adult whole brain.</p> <p>Results</p> <p>We found that the neural and stem cell transcriptomes share global similarity in both gene and chromosomal expression, but are quite different from those of liver or muscle. We also found an unusually high level of unannotated expression in mouse embryonic brains. The intronic unannotated expression was found to be strongly associated with genes annotated for neurogenesis, axon guidance, negative regulation of transcription, and neural transmission. These functions are the hallmarks of the late embryonic stage cortex, and crucial for synaptogenesis and neural circuit formation.</p> <p>Conclusions</p> <p>Our results revealed unique global and local landscapes of neural transcriptomes. It also suggested potential functional roles for previously unknown transcripts actively expressed in the developing brain cortex. Our findings provide new insights into potentially novel genes, gene functions and regulatory mechanisms in early brain development.</p

    Population Differences in Transcript-Regulator Expression Quantitative Trait Loci

    Get PDF
    Gene expression quantitative trait loci (eQTL) are useful for identifying single nucleotide polymorphisms (SNPs) associated with diseases. At times, a genetic variant may be associated with a master regulator involved in the manifestation of a disease. The downstream target genes of the master regulator are typically co-expressed and share biological function. Therefore, it is practical to screen for eQTLs by identifying SNPs associated with the targets of a transcript-regulator (TR). We used a multivariate regression with the gene expression of known targets of TRs and SNPs to identify TReQTLs in European (CEU) and African (YRI) HapMap populations. A nominal p-value of <1Γ—10βˆ’6 revealed 234 SNPs in CEU and 154 in YRI as TReQTLs. These represent 36 independent (tag) SNPs in CEU and 39 in YRI affecting the downstream targets of 25 and 36 TRs respectively. At a false discovery rate (FDR)β€Š=β€Š45%, one cis-acting tag SNP (within 1 kb of a gene) in each population was identified as a TReQTL. In CEU, the SNP (rs16858621) in Pcnxl2 was found to be associated with the genes regulated by CREM whereas in YRI, the SNP (rs16909324) was linked to the targets of miRNA hsa-miR-125a. To infer the pathways that regulate expression, we ranked TReQTLs by connectivity within the structure of biological process subtrees. One TReQTL SNP (rs3790904) in CEU maps to Lphn2 and is associated (nominal p-valueβ€Š=β€Š8.1Γ—10βˆ’7) with the targets of the X-linked breast cancer suppressor Foxp3. The structure of the biological process subtree and a gene interaction network of the TReQTL revealed that tumor necrosis factor, NF-kappaB and variants in G-protein coupled receptors signaling may play a central role as communicators in Foxp3 functional regulation. The potential pleiotropic effect of the Foxp3 TReQTLs was gleaned from integrating mRNA-Seq data and SNP-set enrichment into the analysis
    • …
    corecore