34 research outputs found

    Sequence data mining and characterisation of unclassified microbial diversity

    Get PDF
    In the last two decades, sequencing has become increasingly affordable and a routine tool to study the microbial community of a given environment. Metagenomics has revolutionised the way microbes are identified and studied in this age of biological data science because it provides a relatively unbiased view of the composition of microbial communities we interact with every day, which are integral to our ecosystem. These technological advances have led to an exponential growth of raw data repositories that save, distribute and archive these metagenomic datasets. Since metagenomics presents the ultimate opportunity to capture, explore and identify uncultivated microbial genomic sequences, these metagenomic datasets harbour a large proportion of unknown sequences that do not bear any similarity to known sequences readily available in the standard sequence data repositories. The aim of this thesis was to systematically catalogue, quantify and potentially characterise the unknown sequences embedded within the metagenomic datasets. To this end, a comprehensive, portable, modular framework called UnXplore was developed to determine the proportion of unknown sequences included in human microbiome datasets. UnXplore was applied to a range of different human microbiomes and showed that on average 2% of assembled sequences were categorised as unknown meaning that they did not bear any sequence similarity to known sequences. A third of the unknown sequences were shown to contain large open reading frames indicating the coding potential and biological origin of the unknowns. Furthermore, a small proportion of these potentially coding sequences were shown to have functional similarities as they were deemed to contain known protein domain signatures. These results indicated that unknown sequences captured through the UnXplore framework were not artefacts and were indeed of biological origin. To test this formally, supervised kmer-based machine learning models were devised, tested and validated. These models are currently distributed in a package called TetraPredX that can accurately predict whether a sequence originated from bacteria, archaea, virus or plasmid. TetraPredX models were applied to the unknown sequence dataset and revealed that the majority of unknown sequences are of biological origin. Furthermore, TetraPredX results demonstrated that >70% of all long unknown sequences (i.e. >1kb) are likely to be of virus origin indicating an unexplored diversity of viruses that is yet to be fully characterised and classified. In order to catalogue the diversity of virus sequences in human microbiome samples analysed here, an extensive virus discovery analysis was carried out on the contigs assembled through UnXplore. This helped to characterise a vast diversity of prokaryotic, eukaryotic and unclassified virus sequences captured in a range of human microbiomes. The results obtained here demonstrate the need to systematically interrogate metagenomic datasets to fully comprehend and compile the presence of both known and unknown uncultivated microbes within them. A comprehensive survey of metagenomic datasets carried out in this manner would provide a more complete picture of the known and unknown organisms that surround us

    Quantifying and cataloguing unknown sequences within human microbiomes

    Get PDF
    Advances in genome sequencing technologies and lower costs have enabled the exploration of a multitude of known and novel environments and microbiomes. This has led to an exponential growth in the raw sequence data that are deposited in online repositories. Metagenomic and metatranscriptomic data sets are typically analysed with regard to a specific biological question. However, it is widely acknowledged that these data sets are comprised of a proportion of sequences that bear no similarity to any currently known biological sequence, and this so-called “dark matter” is often excluded from downstream analyses. In this study, a systematic framework was developed to assemble, identify, and measure the proportion of unknown sequences present in distinct human microbiomes. This framework was applied to 40 distinct studies, comprising 963 samples, and covering 10 different human microbiomes including fecal, oral, lung, skin, and circulatory system microbiomes. We found that while the human microbiome is one of the most extensively studied, on average 2% of assembled sequences have not yet been taxonomically defined. However, this proportion varied extensively among different microbiomes and was as high as 25% for skin and oral microbiomes that have more interactions with the environment. A rate of taxonomic characterization of 1.64% of unknown sequences being characterized per month was calculated from these taxonomically unknown sequences discovered in this study. A cross-study comparison led to the identification of similar unknown sequences in different samples and/or microbiomes. Both our computational framework and the novel unknown sequences produced are publicly available for future cross-referencing. Our approach led to the discovery of several novel viral genomes that bear no similarity to sequences in the public databases. Some of these are widespread as they have been found in different microbiomes and studies. Hence, our study illustrates how the systematic characterization of unknown sequences can help the discovery of novel microbes, and we call on the research community to systematically collate and share the unknown sequences from metagenomic studies to increase the rate at which the unknown sequence space can be classified

    Metaviromics reveals unknown viral diversity in the biting midge Culicoides impunctatus

    Get PDF
    Biting midges (Culicoides species) are vectors of arboviruses and were responsible for the emergence and spread of Schmallenberg virus (SBV) in Europe in 2011 and are likely to be involved in the emergence of other arboviruses in Europe. Improved surveillance and better understanding of risks require a better understanding of the circulating viral diversity in these biting insects. In this study, we expand the sequence space of RNA viruses by identifying a number of novel RNA viruses from Culicoides impunctatus (biting midge) using a meta-transcriptomic approach. A novel metaviromic pipeline called MetaViC was developed specifically to identify novel virus sequence signatures from high throughput sequencing (HTS) datasets in the absence of a known host genome. MetaViC is a protein centric pipeline that looks for specific protein signatures in the reads and contigs generated as part of the pipeline. Several novel viruses, including an alphanodavirus with both segments, a novel relative of the Hubei sobemo-like virus 49, two rhabdo-like viruses and a chuvirus, were identified in the Scottish midge samples. The newly identified viruses were found to be phylogenetically distinct to those previous known. These findings expand our current knowledge of viral diversity in arthropods and especially in these understudied disease vectors

    Chapparvoviruses occur in at least three vertebrate classes and have a broad biogeographic distribution

    Get PDF
    Chapparvoviruses are a highly divergent group of parvoviruses (family Parvoviridae) that have recently been identified via metagenomic sampling of animal faeces. Here we report the sequences of six novel chapparvoviruses identified through both metagenomic sampling of bat tissues and in silico screening of published vertebrate genome assemblies. The novel chapparvoviruses share several distinctive genomic features, and group together as a robustly supported monophyletic clade in phylogenetic trees. Our data indicate that chapparvoviruses have a broad host range in vertebrates, and a global distribution

    ViCTree: an automated framework for taxonomic classification from protein sequences

    Get PDF
    Motivation: The increasing rate of submission of genetic sequences into public databases is providing a growing resource for classifying the organisms that these sequences represent. To aid viral classification, we have developed ViCTree, which automatically integrates the relevant sets of sequences in NCBI GenBank and transforms them into an interactive maximum likelihood phylogenetic tree that can be updated automatically. ViCTree incorporates ViCTreeView, which is a JavaScript-based visualisation tool that enables the tree to be explored interactively in the context of pairwise distance data. Results: To demonstrate utility, ViCTree was applied to subfamily Densovirinae of family Parvoviridae. This led to the identification of six new species of insect virus. Availability: ViCTree is open-source and can be run on any Linux- or Unix-based computer or cluster. A tutorial, the documentation and the source code are available under a GPL3 license, and can be accessed at http://bioinformatics.cvr.ac.uk/victree_web/

    Discovery of novel astrovirus and calicivirus identified in ruddy turnstones in Brazil

    Get PDF
    Birds are the natural reservoir of viruses with zoonotic potential, as well as contributing to the evolution, emergence, and dissemination of novel viruses. In this study, we applied a high-throughput screening approach to identify the diversity of viruses in 118 samples of birds captured between October 2006 to October 2010 in the North and Northeast regions of Brazil. We found nearly complete genomes of novel species of astrovirus and calicivirus in cloacal swabs of ruddy turnstones (Arenaria interpres) collected in Coroa do Avião islet, Pernambuco State. These viruses are positive-sense single-stranded RNA with a genome of ~7 to 8 kb, and were designated as Ruddy turnstone astrovirus (RtAstV) and Ruddy turnstone calicivirus (RTCV), respectively. Phylogenetic analysis showed that RtAstV and RTCV grouped in a monophyletic clade with viruses identified from poultry samples (i.e., chicken, goose, and turkey), including viruses associated with acute nephritis in chickens. Attempts of viral propagation in monkey and chicken cell lines for both viruses were unsuccessful. Also, we found genomes related with viral families that infect invertebrates and plants, suggesting that they might be ingested in the birds' diet. In sum, these findings shed new light on the diversity of viruses in migratory birds with the notable characterization of a novel astrovirus and calicivirus

    A novel Hepacivirus in wild rodents from South America

    Get PDF
    The Hepacivirus genus comprises single-stranded positive-sense RNA viruses within the family Flaviviridae. Several hepaciviruses have been identified in different mammals, including multiple rodent species in Africa, Asia, Europe, and North America. To date, no rodent hepacivirus has been identified in the South American continent. Here, we describe an unknown hepacivirus discovered during a metagenomic screen in Akodon montensis, Calomys tener, Oligoryzomys nigripes, Necromys lasiurus, and Mus musculus from São Paulo State, Brazil. Molecular detection of this novel hepacivirus by RT-PCR showed a frequency of 11.11% (2/18) in Oligoryzomys nigripes. This is the first identification of hepavivirus in sigmondonine rodents and in rodents from South America. In sum, our results expand the host range, viral diversity, and geographical distribution of the Hepacivirus genus

    Novel orthohepeviruses in wild rodents from São Paulo State, Brazil

    Get PDF
    The Hepeviridae comprise single-stranded positive-sense RNA viruses classified into two genera, Orthohepevirus and Piscihepevirus. Orthohepeviruses have a wide host range that includes rodents, but previous studies had been restricted to rodents of the Muridae family. In this study, we applied a high-throughput sequencing approach to examine the presence of orthohepeviruses in rodents from São Paulo State, Brazil. We also used RT-PCR to determine the frequency of orthohepeviruses in our sampled population. We identified novel orthohepeviruses in blood samples derived from Necromys lasiurus (1.19%) and Calomys tener (3.66%). Therefore, our results expand the host range and viral diversity of the Hepeviridae family

    Novel parvoviruses from wild and domestic animals in Brazil provide new insights into parvovirus distribution and diversity

    Get PDF
    Parvoviruses (family Parvoviridae) are small, single-stranded DNA viruses. Many parvoviral pathogens of medical, veterinary and ecological importance have been identified. In this study, we used high-throughput sequencing (HTS) to investigate the diversity of parvoviruses infecting wild and domestic animals in Brazil. We identified 21 parvovirus sequences (including twelve nearly complete genomes and nine partial genomes) in samples derived from rodents, bats, opossums, birds and cattle in Pernambuco, São Paulo, Paraná and Rio Grande do Sul states. These sequences were investigated using phylogenetic and distance-based approaches and were thereby classified into eight parvovirus species (six of which have not been described previously), representing six distinct genera in the subfamily Parvovirinae. Our findings extend the known biogeographic range of previously characterized parvovirus species and the known host range of three parvovirus genera (Dependovirus, Aveparvovirus and Tetraparvovirus). Moreover, our investigation provides a window into the ecological dynamics of parvovirus infections in vertebrates, revealing that many parvovirus genera contain well-defined sub-lineages that circulate widely throughout the world within particular taxonomic groups of hosts

    Pingu virus : a new picornavirus in penguins from Antarctica

    Get PDF
    Picornaviridae family comprises single-stranded, positive-sense RNA viruses distributed into forty-seven genera. Picornaviruses have a broad host range and geographic distribution in all continents. In this study, we applied a high-throughput sequencing approach to examine the presence of picornaviruses in penguins from King George Island, Antarctica. We discovered and characterized a novel picornavirus from cloacal swab samples of gentoo penguins (Pygoscelis papua), which we tentatively named Pingu virus. Also, using RT-PCR we detected this virus in 12.9 per cent of cloacal swabs derived from P. papua, but not in samples from adelie penguins (Pygoscelis adeliae) or chinstrap penguins (Pygoscelis antarcticus). Attempts to isolate the virus in a chicken cell line and in embryonated chicken eggs were unsuccessful. Our results expand the viral diversity, host range, and geographical distribution of the Picornaviridae52FUNDAÇÃO DE AMPARO À PESQUISA DO ESTADO DE SÃO PAULO - FAPESP13/14929-1; 17/13981-0; 12/24150-9; 15/05778-5; 14/20851-8, 16/01414-1; 06/00572-0This work was supported by the Fundação de Amparo à Pesquisa do Estado de São Paulo, Brazil (Grant no. 13/14929-1, and Scholarships nos. 17/13981-0; 12/24150-9; 15/05778-5; 14/20851-8; 16/01414-1; 06/00572-0). P.R.M. was supported by the Medical Research Council of the UK (Grant no. MC_UU_120/14/9
    corecore