thesis

Development of bioinformatics tools for the rapid and sensitive detection of known and unknown pathogens from next generation sequencing data

Abstract

Infectious diseases still remain one of the main causes of death across the globe. Despite huge advances in clinical diagnostics, establishing a clear etiology remains impossible in a proportion of cases. Since the emergence of next generation sequencing (NGS), a multitude of new research fields based on this technology have evolved. Especially its application in metagenomics – denoting the research on genomic material taken directly from its environment – has led to a rapid development of new applications. Metagenomic NGS has proven to be a promising tool in the field of pathogen related research and diagnostics. In this thesis, I present different approaches for the detection of known and the discovery of unknown pathogens from NGS data. These contributions subdivide into three newly developed methods and one publication on a real-world use case of methodology we developed and data analysis based on it. First, I present LiveKraken, a real-time read classification tool based on the core algorithm of Kraken. LiveKraken uses streams of raw data from Illumina sequencers to classify reads taxonomically. This way, we are able to produce results identical to those of Kraken the moment the sequencer finishes. We are furthermore able to provide comparable results in early stages of a sequencing run, allowing saving up to a week of sequencing time. While the number of classified reads grows over time, false classifications appear in negligible numbers and proportions of identified taxa are only affected to a minor extent. In the second project, we designed and implemented PathoLive, a real-time diagnostics pipeline which allows the detection of pathogens from clinical samples before the sequencing procedure is finished. We adapted the core algorithm of HiLive, a real-time read mapper, and enhanced its accuracy for our use case. Furthermore, probably irrelevant sequences automatically marked. The results are visualized in an interactive taxonomic tree that provides an intuitive overview and detailed metrics regarding the relevance of each identified pathogen. Testing PathoLive on the sequencing of a real plasma sample spiked with viruses, we could prove that we ranked the results more accurately throughout the complete sequencing run than any other tested tool did at the end of the sequencing run. With PathoLive, we shift the focus of NGS-based diagnostics from read quantification towards a more meaningful assessment of results in unprecedented turnaround time. The third project aims at the detection of novel pathogens from NGS data. We developed RAMBO-K, a tool which allows rapid and sensitive removal of unwanted host sequences from NGS datasets. RAMBO-K is faster than any tool we tested, while showing a consistently high sensitivity and specificity across different datasets. RAMBO-K rapidly and reliably separates reads from different species. It is suitable as a straightforward standard solution for workflows dealing with mixed datasets. In the fourth project, we used RAMBO-K as well as several other data analyses to discover Berlin squirrelpox virus, a deviant new poxvirus establishing a new genus of poxviridae. Near Berlin, Germany, several juvenile red squirrels (Sciurus vulgaris) were found with moist, crusty skin lesions. Histology, electron microscopy, and cell culture isolation revealed an orthopoxvirus-like infection. After standard workflows yielded no significant results, poxviral reads were assigned using RAMBO-K, enabling the assembly of the genome of the novel virus. With these projects, we established three new application-related methods each of which closes different research gaps. Taken together, we enhance the available repertoire of NGS-based pathogen related research tools and alleviate and fasten a variety of research projects

    Similar works