78 research outputs found
Computational methods for transcriptome annotation and quantification using RNA-seq
High-throughput RNA sequencing (RNA-seq) promises a comprehensive picture of the transcriptome, allowing for the complete annotation and quantification of all genes and their isoforms across samples. Realizing this promise requires increasingly complex computational methods. These computational challenges fall into three main categories: (i) read mapping, (ii) transcriptome reconstruction and (iii) expression quantification. Here we explain the major conceptual and practical challenges, and the general classes of solutions for each category. Finally, we highlight the interdependence between these categories and discuss the benefits for different biological applications
Computational methods for transcriptome annotation and quantification using RNA-seq
High-throughput RNA sequencing (RNA-seq) promises a comprehensive picture of the transcriptome, allowing for the complete annotation and quantification of all genes and their isoforms across samples. Realizing this promise requires increasingly complex computational methods. These computational challenges fall into three main categories: (i) read mapping, (ii) transcriptome reconstruction and (iii) expression quantification. Here we explain the major conceptual and practical challenges, and the general classes of solutions for each category. Finally, we highlight the interdependence between these categories and discuss the benefits for different biological applications
Metatranscriptomics captures dynamic shifts in mycorrhizal coordination in boreal forests
Carbon storage and cycling in boreal forestsâthe largest terrestrial carbon storeâismoderated by complex interactions between trees and soil microorganisms. However,existing methods limit our ability to predict how changes in environmental conditionswill alter these associations and the essential ecosystem services they provide. To addressthis, we developed a metatranscriptomic approach to analyze the impact of nutrientenrichment on Norway sprucefine roots and the community structure, function, andtreeâmicrobe coordination of over 350 root-associated fungal species. In response toaltered nutrient status, host trees redefined their relationship with the fungal commu-nity by reducing sugar efflux carriers and enhancing defense processes. This resulted ina profound restructuring of the fungal community and a collapse in functional coordi-nation between the tree and the dominant Basidiomycete species, and an increase infunctional coordination with versatile Ascomycete species. As such, there was a func-tional  shift  in  community  dominance  from  Basidiomycetes  species,  with  importantroles in enzymatically cycling recalcitrant carbon, to Ascomycete species that have mela-nized cell walls that are highly resistant to degradation. These changes were accompa-nied  by  prominent  shifts  in  transcriptional  coordination  between  over  60  predictedfungal effectors, with more than 5,000 Norway spruce transcripts, providing mechanis-tic insight into the complex molecular dialogue coordinating host trees and their fungalpartners. The hostâmicrobe dynamics captured by this study functionally inform howthese complex and  sensitive biological  relationships may mediate  the carbon  storagepotential of boreal soils under changing nutrient conditions
Exploiting Nucleotide Composition to Engineer Promoters
The choice of promoter is a critical step in optimizing the efficiency and stability of recombinant protein production in mammalian cell lines. Artificial promoters that provide stable expression across cell lines and can be designed to the desired strength constitute an alternative to the use of viral promoters. Here, we show how the nucleotide characteristics of highly active human promoters can be modelled via the genome-wide frequency distribution of short motifs: by overlapping motifs that occur infrequently in the genome, we constructed contiguous sequence that is rich in GC and CpGs, both features of known promoters, but lacking homology to real promoters. We show that snippets from this sequence, at 100 base pairs or longer, drive gene expression in vitro in a number of mammalian cells, and are thus candidates for use in protein production. We further show that expression is driven by the general transcription factors TFIIB and TFIID, both being ubiquitously present across cell types, which results in less tissue- and species-specific regulation compared to the viral promoter SV40. We lastly found that the strength of a promoter can be tuned up and down by modulating the counts of GC and CpGs in localized regions. These results constitute a âproof-of-conceptâ for custom-designing promoters that are suitable for biotechnological and medical applications
Serendipitous meta-transcriptomics : the fungal community of Norway Spruce (Picea abies)
After performing de novo transcript assembly of >1 billion RNA-Sequencing reads obtained
from 22 samples of different Norway spruce (Picea abies) tissues that were not surface sterilized,
we found that assembled sequences captured a mix of plant, lichen, and fungal transcripts.
The latter were likely expressed by endophytic and epiphytic symbionts, indicating
that these organisms were present, alive, and metabolically active. Here, we show that these
serendipitously sequenced transcripts need not be considered merely as contamination, as is
common, but that they provide insight into the plantâs phyllosphere. Notably, we could classify
these transcripts as originating predominantly fromDothideomycetes and Leotiomycetes species,
with functional annotation of gene families indicating active growth and metabolism, with
particular regards to glucose intake and processing, as well as gene regulation.S1 Fig. Samples collected from Norway spruce. For each sample a brief description and sample
ID are shown below a representative image of the associated plant tissue, while the sampling
date is shown above.S2 Fig. Bioinformatics workflow of RNA data processing. We assembled reads from all samples
into a single assembly (left column), computed Tau scores, GC content, and mapped the
transcripts to the genome as well as to the Uniref90 protein database. For enriching for fungal
transcripts (right column), we applied GC content and expression breadth filters to the reads
and assembly respectively, clustered sequences by similarity, and performed functional annotation
as well as phylogenetic analyses.S3 Fig. Putative taxonomic characterization of transcripts via protein alignments. Bar plot
showing the number of transcripts by taxonomy (super)kingdoms. Parent summarises taxons
hierarchically higher than the represented (super)kingdoms, NA summarises transcripts with
no sequence similarity in the UniRef90 database. The number of transcripts is indicated at the
top of every bar.S4 Fig. Taxonomic class and phylum of the fungal transcripts. (a) Number of transcripts per
fungal phylum. The phylum are sorted by abundance top to bottom with Ascomycota
(n = 81,181) and Basidiomycota (n = 4,839) being the most represented; the remaining phyla
varying from n = 11 to n = 2. (b) A graph of the taxonomic hierarchy from species to phylum
of the fungal transcripts, showing the broad species diversity of the largest clusters: Ascomycota
(bottom) and Basidiomycota (top). (c) Similar to (a) for the fungal classes, with the Eurotiomycetes
and Dothideomycetes classes being over-represented among the fungal transcripts. (d)
Similar to (b) for the fungal classes (n = 24).S5 Fig. Characterisation of transcripts lacking taxonomic assignment by their GMAP alignments
to the P. abies genome. (a) Boxplot of the tau scores for the no taxon transcripts split
based on their GMAP alignments to the P. abies genome. The tau score ranges from 1 for complete
specificity to 0 for equal expression in all samples. The transcripts having a GMAP alignment
in the genome (99% of the GMAP hits cover 80% of the transcripts with at least a 90%
identity) show a wide tau score distribution indicative of the presence of ubiquitously expressed
transcripts as well as that of more tissue-specific transcripts. The transcripts having no GMAP
alignment show a distribution typical of only tissue-specific expression (mean tau score of 0.98). (b) Percentage GC density distribution of the no taxon transcripts split based on their
GMAP alignments to the P. abies genome. Transcripts having a GMAP alignment to the
genome present a GC distribution typical of the P. abies transcripts. The transcripts without a
GMAP alignment show a distribution enriched for higher percentage GC, similar to that of
fungi. The shoulder observed under the peak of transcripts with GMAP alignments may indicate
transcripts where the assembly contained gaps or created chimeras. (c) Scatterplot of log2
FPKM expression values vs. the percentage GC content for the transcripts with a GMAP alignment.
Colouring indicates density, which is shaded from yellow (high) to blue (low). The
expression of transcripts with a GMAP alignment resembles that of the Embryophita phylum.
(d) Scatterplot of log2 FPKM expression values vs. the percentage GC content for transcripts
with a GMAP alignment. Colouring as in (c). The expression of transcripts with no GMAP
alignment resembled that of the fungal kingdom.S6 Fig. Phylogeny built on four nuclear genes. Shown are maximum-likelihood phylogenies
based on fungal nucleotide sequences assembled from the spruce samples in context of known
sequences, with highest sequence similarity to: (a) phosphoenolpyruvate carboxykinase; (b)
NADP-dependent medium chain alcohol dehydrogenase; (c) beta lactamase; and (d) unspecific
lipid transporter. Only branch with support values > 0.9 are shown. While clusters with more
representative sequences yield better branch support (a, b), placement of clusters with fewer
sequences is less certain (c, d). However, in all cases, at least one sequence is grouped with
Dothideomycetes, and for (a,b) with Leotiomycetes.S1 Table. Sample IDs, description, and ENA submission IDs. Correspondence between the
sample IDs as described in Nystedt et al., (2013), this manuscript and the ENA are shown in
columns one to three. The fourth column contains a succinct description of the samples, refer
to Nystedt et al., (2013) for full details.http://www.plosone.orgam201
An Improved Canine Genome and a Comprehensive Catalogue of Coding Genes and Non-Coding Transcripts
The domestic dog, Canis familiaris, is a well-established model system for mapping trait and disease loci. While the original draft sequence was of good quality, gaps were abundant particularly in promoter regions of the genome, negatively impacting the annotation and study of candidate genes. Here, we present an improved genome build, canFam3.1, which includes 85 MB of novel sequence and now covers 99.8% of the euchromatic portion of the genome. We also present multiple RNA-Sequencing data sets from 10 different canine tissues to catalog âŒ175,000 expressed loci. While about 90% of the coding genes previously annotated by EnsEMBL have measurable expression in at least one sample, the number of transcript isoforms detected by our data expands the EnsEMBL annotations by a factor of four. Syntenic comparison with the human genome revealed an additional âŒ3,000 loci that are characterized as protein coding in human and were also expressed in the dog, suggesting that those were previously not annotated in the EnsEMBL canine gene set. In addition to âŒ20,700 high-confidence protein coding loci, we found âŒ4,600 antisense transcripts overlapping exons of protein coding genes, âŒ7,200 intergenic multi-exon transcripts without coding potential, likely candidates for long intergenic non-coding RNAs (lincRNAs) and âŒ11,000 transcripts were reported by two different library construction methods but did not fit any of the above categories. Of the lincRNAs, about 6,000 have no annotated orthologs in human or mouse. Functional analysis of two novel transcripts with shRNA in a mouse kidney cell line altered cell morphology and motility. All in all, we provide a much-improved annotation of the canine genome and suggest regulatory functions for several of the novel non-coding transcripts
Genomic Analysis of the Basal Lineage Fungus Rhizopus oryzae Reveals a Whole-Genome Duplication
Rhizopus oryzae is the primary cause of mucormycosis, an emerging, life-threatening infection characterized by rapid angioinvasive growth with an overall mortality rate that exceeds 50%. As a representative of the paraphyletic basal group of the fungal kingdom called âzygomycetes,â R. oryzae is also used as a model to study fungal evolution. Here we report the genome sequence of R. oryzae strain 99â880, isolated from a fatal case of mucormycosis. The highly repetitive 45.3 Mb genome assembly contains abundant transposable elements (TEs), comprising approximately 20% of the genome. We predicted 13,895 protein-coding genes not overlapping TEs, many of which are paralogous gene pairs. The order and genomic arrangement of the duplicated gene pairs and their common phylogenetic origin provide evidence for an ancestral whole-genome duplication (WGD) event. The WGD resulted in the duplication of nearly all subunits of the protein complexes associated with respiratory electron transport chains, the V-ATPase, and the ubiquitinâproteasome systems. The WGD, together with recent gene duplications, resulted in the expansion of multiple gene families related to cell growth and signal transduction, as well as secreted aspartic protease and subtilase protein families, which are known fungal virulence factors. The duplication of the ergosterol biosynthetic pathway, especially the major azole target, lanosterol 14α-demethylase (ERG11), could contribute to the variable responses of R. oryzae to different azole drugs, including voriconazole and posaconazole. Expanded families of cell-wall synthesis enzymes, essential for fungal cell integrity but absent in mammalian hosts, reveal potential targets for novel and R. oryzae-specific diagnostic and therapeutic treatments
- âŠ