Mountain View
biomedical and healthCAre Data Discovery Index Ecosystem
help Advanced Search
Title: The genome of the tardigrade Hypsibius dujardini      
dateReleased:
01-25-2016
privacy:
information not avaiable
aggregation:
instance of dataset
dateCreated:
01-25-2016
refinement:
raw
ID:
doi:10.5281/ZENODO.45162
creators:
Koutsovoulos, Georgios
Blaxter, Mark
Kumar, Sujai
Laetsch, Dominik R
Stevens, Lewis
Daub, Jennifer
Conlon, Claire
Maroon, Habib
Thomas, Fran
Aboobaker, Aziz
availability:
available
types:
other
description:
These data files accompany the bioRxiv preprint "The genome of the tardigrade Hypsibius dujardini" Edinburgh genome assembly and annotation ======================================== 1. nHd.2.3.abv500.fna.gz - Edinburgh (EDI) genome assembly version 2.3. Reads were assembled as single-end with CLC to calculate the insert size distributions of the libraries and check for contaminants. Insert size distributions are calculated by mapping the reads back to the assembly with CLC. The MP library insert distribution wasn't normally distributed. The single-end assembly is checked for contamination using the blobtools software package which creates a TAGC plot. Inspection of the TAGC plot revealed multiple contaminations with distinct coverage and GC content that did not have a reference genome in public databases. The PE reads were normalised with one-pass khmer and were assembled with Velvet using a k-mer size of 55. Contaminants in the Velvet assembly were identified based on the coverage and GC of the scaffolds. The non-normalised reads were mapped to the assembly using CLC and reads were removed if either pair mapped to a contig identified as contaminant. The process was repeated two more times since newly assembled contaminants could be identified. Gaps were filled in the final assembly using GapFiller. Finally the MP library was used to scaffold the gap-filled assembly with SSPACE, accepting only the information from reads mapping 2 kb from the ends of the scaffolds. The final assembly spans 140 megabases (Mb) with median coverage of 86X. 2. nHd.2.3.1.aug.gff.gz - Gene model GFF file as predicted by Augustus for nHd.2.3 genome assembly. This is Augustus run as a second pass annotation (using transcriptome assembly as evidence) after a first pass Maker (see below) 3. nHd.2.3.1.aug.proteins.fasta.gz - Protein fasta file generated by Augustus for nHd.2.3 genome assembly. 4. nHd.2.3.1.aug.transcripts.fasta.gz - Transcript CDS fasta file generated by Augustus for nHd.2.3 genome assembly. Edinburgh genome assembly and annotation - intermediate files ============================================================= 1. nHd.1.0.contigs.cov.fna.gz - Preliminary assembly of all data, without any contamination screening 2. maker1.gff3.gz - Gene model GFF file as generated by MAKER run as a first pass to generate enough genes to train genefinders more thoroughly 3. all.maker.proteins.edit.fasta.gz - Protein fasta file generated by MAKER run as a first pass. 4. all.maker.transcripts.edit.fasta.gz - Transcript CDS file generated by MAKER run as a first pass. Blob plots ========== 1. nHd.2.3.nHd_lib350-cov.BlobDB.json.gz - A blobDB (a JSON file generated using the blobtools package) which contains mapping, assembly and taxonomic information for the Edinburgh assembly and our read data. http://drl.github.io/blobtools/ 2. nHd.1.0.BlobDB.json.gz - A blobDB (a JSON file generated using the blobtools package) which contains mapping, assembly and taxonomic information for the Edinburgh preliminary assembly nHd.1.0 and Edinburgh read data. http://drl.github.io/blobtools/ 3. unc.TG-cov.BlobDB.json.gz - A blobDB (a JSON file generated using the blobtools package) which contains mapping, assembly and taxonomic information for the UNC assembly and their read data.  http://drl.github.io/blobtools/ 4. unc.nHd-cov.uniref.nt.BlobDB.json.gz - A blobDB (a JSON file generated using the blobtools package) which contains mapping, assembly and taxonomic information for the UNC assembly and the Edinburgh read data. http://drl.github.io/blobtools/ 5. tardi_RNASeq.vs.unc.bam.reads_cov.catcolour.txt.gz - Space delimited text file with classification of each UNC scaffold by avg coverage of each base by PolyA-selected RNAseq reads 6. tardi_RNASeq.vs.nHd.2.3.bam.reads_cov.catcolour.txt.gz - Space delimited text file with classification of each Edinburgh scaffold by avg coverage of each base by PolyA-selected RNAseq reads H dujardini transcriptome data ============================== 1. Trinity.fasta.c99.gz - Preliminary transcriptome assembly by Itai Yanai's lab. Please do not use in any publications without checking with yanailab.technion.ac.il first   Abstract of bioRxiv paper at http://dx.doi.org/10.1101/033464 ======================================  The genome of the tardigrade Hypsibius dujardini  ====================================== Background: Tardigrades are meiofaunal ecdysozoans that may be key to understanding the origins of Arthropoda. Many species of Tardigrada can survive extreme conditions through adoption of a cryptobiotic state. A recent high profile paper suggested that the genome of a model tardigrade, Hypsibius dujardini, has been shaped by unprecedented levels of horizontal gene transfer (HGT) encompassing 17% of protein coding genes, and speculated that this was likely formative in the evolution of stress resistance. We tested these findings using an independently sequenced and assembled genome of H. dujardini, derived from the same original culture isolate.  Results: Whole-organism sampling of meiofaunal species will perforce include gut and surface microbiotal contamination, and our raw data contained bacterial and algal sequences. Careful filtering generated a cleaned H. dujardini genome assembly, validated and annotated with GSSs, ESTs and RNA-Seq data, with superior assembly metrics compared to the published, HGT-rich assembly. A small amount of additional microbial contamination likely remains in our 135 Mb assembly. Our assembly length fits well with multiple empirical measurements of H. dujardini genome size, and is 120 Mb shorter than the HGT-rich version. Among 23,021 protein coding gene predictions we found 216 genes (0.9%) with similarity to prokaryotes, 196 of which were expressed, suggestive of HGT. We also identified ~400 genes (<2%) that could be HGT from other non-metazoan eukaryotes. Cross-comparison of the assemblies, using raw read and RNA-Seq data, confirmed that the overwhelming majority of the putative HGT candidates in the previous genome were predicted from scaffolds at very low coverage and were not transcribed. Crucially much of the natural contamination in both projects was non-overlapping, confirming it as foreign to the shared target animal genome.  Conclusions: We find no support for massive horizontal gene transfer into the genome of H. dujardini. Many of the bacterial sequences in the previously published genome were not present in our raw reads. In construction of our assembly we removed most, but still not all, contamination with approaches derived from metagenomics, which we show are very appropriate for meiofaunal species. We conclude that HGT into H. dujardini accounts for 1-2% of genes and that the proposal that 17% of tardigrade genes originate from HGT events is an artefact of undetected contamination.
accessURL: https://doi.org/10.5281/ZENODO.45162
storedIn:
Zenodo
qualifier:
not compressed
format:
HTML
accessType:
landing page
authentication:
none
authorization:
none
abbreviation:
ZENODO
homePage: https://zenodo.org/
ID:
SCR:004129
name:
ZENODO

Feedback?

If you are having problems using our tools, or if you would just like to send us some feedback, please post your questions on GitHub.