Abstract

<p>These data files accompany the bioRxiv preprint "The genome of the tardigrade Hypsibius dujardini"</p> <p>Edinburgh genome assembly and annotation<br> ========================================</p> <p>1. nHd.2.3.abv500.fna.gz - Edinburgh (EDI) genome assembly version 2.3. Reads were assembled as single-end with CLC to calculate the insert size distributions of the libraries and check for contaminants. Insert size distributions are calculated by mapping the reads back to the assembly with CLC. The MP library insert distribution wasn't normally distributed. The single-end assembly is checked for contamination using the blobtools software package which creates a TAGC plot. Inspection of the TAGC plot revealed multiple contaminations with distinct coverage and GC content that did not have a reference genome in public databases. The PE reads were normalised with one-pass khmer and were assembled with Velvet using a k-mer size of 55. Contaminants in the Velvet assembly were identified based on the coverage and GC of the scaffolds. The non-normalised reads were mapped to the assembly using CLC and reads were removed if either pair mapped to a contig identified as contaminant. The process was repeated two more times since newly assembled contaminants could be identified. Gaps were filled in the final assembly using GapFiller. Finally the MP library was used to scaffold the gap-filled assembly with SSPACE, accepting only the information from reads mapping 2 kb from the ends of the scaffolds. The final assembly spans 140 megabases (Mb) with median coverage of 86X.</p> <p>2. nHd.2.3.1.aug.gff.gz - Gene model GFF file as predicted by Augustus for nHd.2.3 genome assembly. This is Augustus run as a second pass annotation (using transcriptome assembly as evidence) after a first pass Maker (see below)</p> <p>3. nHd.2.3.1.aug.proteins.fasta.gz - Protein fasta file generated by Augustus for nHd.2.3 genome assembly.</p> <p>4. nHd.2.3.1.aug.transcripts.fasta.gz - Transcript CDS fasta file generated by Augustus for nHd.2.3 genome assembly.</p> <p><br> Edinburgh genome assembly and annotation - intermediate files<br> =============================================================</p> <p>1. nHd.1.0.contigs.cov.fna.gz - Preliminary assembly of all data, without any contamination screening</p> <p>2. maker1.gff3.gz - Gene model GFF file as generated by MAKER run as a first pass to generate enough genes to train genefinders more thoroughly</p> <p>3. all.maker.proteins.edit.fasta.gz - Protein fasta file generated by MAKER run as a first pass.</p> <p>4. all.maker.transcripts.edit.fasta.gz - Transcript CDS file generated by MAKER run as a first pass.</p> <p>Blob plots<br> ==========</p> <p>1. nHd.2.3.nHd_lib350-cov.BlobDB.json.gz - A blobDB (a JSON file generated using the blobtools package) which contains mapping, assembly and taxonomic information for the Edinburgh assembly and our read data. http://drl.github.io/blobtools/</p> <p>2. nHd.1.0.BlobDB.json.gz - A blobDB (a JSON file generated using the blobtools package) which contains mapping, assembly and taxonomic information for the Edinburgh preliminary assembly nHd.1.0 and Edinburgh read data. http://drl.github.io/blobtools/</p> <p>3. unc.TG-cov.BlobDB.json.gz - A blobDB (a JSON file generated using the blobtools package) which contains mapping, assembly and taxonomic information for the UNC assembly and their read data.  http://drl.github.io/blobtools/</p> <p>4. unc.nHd-cov.uniref.nt.BlobDB.json.gz - A blobDB (a JSON file generated using the blobtools package) which contains mapping, assembly and taxonomic information for the UNC assembly and the Edinburgh read data. http://drl.github.io/blobtools/</p> <p>5. tardi_RNASeq.vs.unc.bam.reads_cov.catcolour.txt.gz - Space delimited text file with classification of each UNC scaffold by avg coverage of each base by PolyA-selected RNAseq reads</p> <p>6. tardi_RNASeq.vs.nHd.2.3.bam.reads_cov.catcolour.txt.gz - Space delimited text file with classification of each Edinburgh scaffold by avg coverage of each base by PolyA-selected RNAseq reads</p> <p>H dujardini transcriptome data<br> ==============================</p> <p>1. Trinity.fasta.c99.gz - Preliminary transcriptome assembly by Itai Yanai's lab. Please do not use in any publications without checking with yanailab.technion.ac.il first</p> <p> </p> <p>Abstract of bioRxiv paper at http://dx.doi.org/10.1101/033464</p> <p>====================================== <br> The genome of the tardigrade Hypsibius dujardini <br> ======================================</p> <p>Background: Tardigrades are meiofaunal ecdysozoans that may be key to understanding the origins of Arthropoda. Many species of Tardigrada can survive extreme conditions through adoption of a cryptobiotic state. A recent high profile paper suggested that the genome of a model tardigrade, Hypsibius dujardini, has been shaped by unprecedented levels of horizontal gene transfer (HGT) encompassing 17% of protein coding genes, and speculated that this was likely formative in the evolution of stress resistance. We tested these findings using an independently sequenced and assembled genome of H. dujardini, derived from the same original culture isolate. </p> <p>Results: Whole-organism sampling of meiofaunal species will perforce include gut and surface microbiotal contamination, and our raw data contained bacterial and algal sequences. Careful filtering generated a cleaned H. dujardini genome assembly, validated and annotated with GSSs, ESTs and RNA-Seq data, with superior assembly metrics compared to the published, HGT-rich assembly. A small amount of additional microbial contamination likely remains in our 135 Mb assembly. Our assembly length fits well with multiple empirical measurements of H. dujardini genome size, and is 120 Mb shorter than the HGT-rich version. Among 23,021 protein coding gene predictions we found 216 genes (0.9%) with similarity to prokaryotes, 196 of which were expressed, suggestive of HGT. We also identified ~400 genes (<2%) that could be HGT from other non-metazoan eukaryotes. Cross-comparison of the assemblies, using raw read and RNA-Seq data, confirmed that the overwhelming majority of the putative HGT candidates in the previous genome were predicted from scaffolds at very low coverage and were not transcribed. Crucially much of the natural contamination in both projects was non-overlapping, confirming it as foreign to the shared target animal genome. </p> <p>Conclusions: We find no support for massive horizontal gene transfer into the genome of H. dujardini. Many of the bacterial sequences in the previously published genome were not present in our raw reads. In construction of our assembly we removed most, but still not all, contamination with approaches derived from metagenomics, which we show are very appropriate for meiofaunal species. We conclude that HGT into H. dujardini accounts for 1-2% of genes and that the proposal that 17% of tardigrade genes originate from HGT events is an artefact of undetected contamination.</p

Similar works

Full text

thumbnail-image
oaioai:figshare.com:article/6357127Last time updated on 8/13/2018

This paper was published in FigShare.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.