Antimicrobial resistance (AMR) is a persistent and growing threat to global health. Whole genome sequencing (WGS) has the potential to dramatically improve our ability to detect, understand, and monitor AMR. However, microbial diversity and complexity means that the analysis and interpretation of their genomes is challenging. In this thesis, I explore applications of de Bruijn graphs (DBGs) to the analysis of these data.
First, I present a tool, Mykrobe predictor, that uses DBGs to rapidly identify species and AMR from WGS data. I show that it is accurate, flexible, and efficient.
Next, I explore an extension of Mykrobe predictor to long read sequencing of direct clinical samples of M. tuberculosis. In doing so, I show that one could reduce the turn-around time for susceptibility testing of an M. tuberculosis isolate from 2 weeks to 12 hours.
Finally, I explore the challenges of DNA search in very large collections (millions) of microbial data sets. In particular, I address the super-linear scaling of existing k-mer indexing tools and present a novel representation and implementation of a probabilistic coloured de Bruijn graph, âColoured Bloom Graph" (CBG). I demonstrate its scalability by building a CBG of all publicly accessible microbial WGS data (almost half a million samples) and use it to run millisecond searches in these data.</p