Search CORE

2 research outputs found

Automated download and clean-up of family-specific databases for kmer-based virus identification.

Author: Allesøe Rosa L
Clausen Philip TLC
Cotten Matthew
Florensa Alfred F
Koopmans Marion PG
Lemvigh Camilla K
Lund Ole
Phan My VT
Publication venue: Oxford University Press (OUP)
Publication date: 01/01/2021
Field of study

SUMMARY: Here, we present an automated pipeline for Download Of NCBI Entries (DONE) and continuous updating of a local sequence database based on user-specified queries. The database can be created with either protein or nucleotide sequences containing all entries or complete genomes only. The pipeline can automatically clean the database by removing entries with matches to a database of user-specified sequence contaminants. The default contamination entries include sequences from the UniVec database of plasmids, marker genes and sequencing adapters from NCBI, an E.coli genome, rRNA sequences, vectors and satellite sequences. Furthermore, duplicates are removed and the database is automatically screened for sequences from green fluorescent protein, luciferase and antibiotic resistance genes that might be present in some GenBank viral entries, and could lead to false positives in virus identification. For utilizing the database, we present a useful opportunity for dealing with possible human contamination. We show the applicability of DONE by downloading a virus database comprising 37 virus families. We observed an average increase of 16 776 new entries downloaded per month for the 37 families. In addition, we demonstrate the utility of a custom database compared to a standard reference database for classifying both simulated and real sequence data. AVAILABILITYAND IMPLEMENTATION: The DONE pipeline for downloading and cleaning is deposited in a publicly available repository (https://bitbucket.org/genomicepidemiology/done/src/master/). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

Crossref

LSHTM Research Online

Copenhagen University Research Information System

EUR Research Repository

Enlighten

Online Research Database In Technology

Recommended from our members

Critical Assessment of Metagenome Interpretation: the second round of challenges.

Author: Alser Mohammed
Antipov Dmitry
Beghini Francesco
Bertrand Denis
Brito Jaqueline J
Brown C Titus
Buchmann Jan
Buluç Aydin
Chen Bo
Chikhi Rayan
Clausen Philip TLC
Cristian Alexandru
Dabrowski Piotr Wojciech
Darling Aaron E
Deng Zhi-Luo
Egan Rob
Eskin Eleazar
Fritz Adrian
Garrido-Oter Ruben
Gastmeier Petra
Georganas Evangelos
Goltsman Eugene
Gray Melissa A
Gurevich Alexey
Hacquard Stephane
Hansen Lars Hestbjerg
Hofmeyr Steven
Huang Pingqin
Häußler Susanne
Irber Luiz
Jia Huijue
Jørgensen Tue Sparholt
Khaledi Ariane
Kieser Silas D
Klemetsen Terje
Kola Axel
Kolmogorov Mikhail
Korobeynikov Anton
Koslicki David
Kwan Jason
LaPierre Nathan
Lemaitre Claire
Lesker Till Robin
Li Chenhao
Limasset Antoine
Maechler Friederike
Malcher-Miranda Fabio
Mangul Serghei
Marcelino Vanessa R
Marchet Camille
Marijon Pierre
Meleshko Dmitry
Mende Daniel R
Mesny Fantin
Meyer Fernando
Milanese Alessio
Nagarajan Niranjan
Nissen Jakob
Nurk Sergey
Oliker Leonid
Paoli Lucas
Peterlongo Pierre
Piro Vitor C
Porter Jacob S
Radutoiu Simona
Rasmussen Simon
Rees Evan R
Reinert Knut
Renard Bernhard
Robertsen Espen Mikal
Robertson Gary
Rosen Gail L
Ruscheweyh Hans-Joachim
Sarwal Varuni
Schulze-Lefert Paul
Segata Nicola
Seiler Enrico
Shi Lizhen
Smit Nathiana
Strowig Till
Sun Fengzhu
Sunagawa Shinichi
Sørensen Søren Johannes
Thomas Ashleigh
Tong Chengxuan
Trajkovski Mirko
Tremblay Julien
Uritskiy Gherman
Vicedomini Riccardo
Wang Zhengyang
Wang Zhong
Wang Ziye
Warren Andrew
Willassen Nils Peder
Yelick Katherine
You Ronghui
Zeller Georg
Zhao Zhengqiao
Zhu Jie
Zhu Shanfeng
Publication venue: eScholarship, University of California
Publication date: 01/04/2022
Field of study

Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses

eScholarship - University of California