9 research outputs found
Dynamic, adaptive sampling during nanopore sequencing using Bayesian experimental design
Nanopore sequencers can select which DNA molecules to sequence, rejecting a molecule after analysis of a small initial part. Currently, selection is based on predetermined regions of interest that remain constant throughout an experiment. Sequencing efforts, thus, cannot be re-focused on molecules likely contributing most to experimental success. Here we present BOSS-RUNS, an algorithmic framework and software to generate dynamically updated decision strategies. We quantify uncertainty at each genome position with real-time updates from data already observed. For each DNA fragment, we decide whether the expected decrease in uncertainty that it would provide warrants fully sequencing it, thus optimizing information gain. BOSS-RUNS mitigates coverage bias between and within members of a microbial community, leading to improved variant calling; for example, low-coverage sites of a species at 1% abundance were reduced by 87.5%, with 12.5% more single-nucleotide polymorphisms detected. Such data-driven updates to molecule selection are applicable to many sequencing scenarios, such as enriching for regions with increased divergence or low coverage, reducing time-to-answer
Stability of SARS-CoV-2 phylogenies.
Funder: Alfred P. Sloan Foundation; funder-id: http://dx.doi.org/10.13039/100000879Funder: European Molecular Biology Laboratory (EMBL)The SARS-CoV-2 pandemic has led to unprecedented, nearly real-time genetic tracing due to the rapid community sequencing response. Researchers immediately leveraged these data to infer the evolutionary relationships among viral samples and to study key biological questions, including whether host viral genome editing and recombination are features of SARS-CoV-2 evolution. This global sequencing effort is inherently decentralized and must rely on data collected by many labs using a wide variety of molecular and bioinformatic techniques. There is thus a strong possibility that systematic errors associated with lab-or protocol-specific practices affect some sequences in the repositories. We find that some recurrent mutations in reported SARS-CoV-2 genome sequences have been observed predominantly or exclusively by single labs, co-localize with commonly used primer binding sites and are more likely to affect the protein-coding sequences than other similarly recurrent mutations. We show that their inclusion can affect phylogenetic inference on scales relevant to local lineage tracing, and make it appear as though there has been an excess of recurrent mutation or recombination among viral lineages. We suggest how samples can be screened and problematic variants removed, and we plan to regularly inform the scientific community with our updated results as more SARS-CoV-2 genome sequences are shared (https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473 and https://virological.org/t/masking-strategies-for-sars-cov-2-alignments/480). We also develop tools for comparing and visualizing differences among very large phylogenies and we show that consistent clade- and tree-based comparisons can be made between phylogenies produced by different groups. These will facilitate evolutionary inferences and comparisons among phylogenies produced for a wide array of purposes. Building on the SARS-CoV-2 Genome Browser at UCSC, we present a toolkit to compare, analyze and combine SARS-CoV-2 phylogenies, find and remove potential sequencing errors and establish a widely shared, stable clade structure for a more accurate scientific inference and discourse
Development of a novel tool to uncover mobile genetic element diversity and trace the invasion of DNA transposons
Transposons (TEs) sind egoistische DNA Sequenzen, die sich in ihrem Wirtsgenom vervielfachen können. Sie wurden in den meisten Spezies, die bisher untersucht wurden, gefunden und weisen einen höchst unterschiedlichen Grad an Häufigkeit und Sequenzverschiedenheit auf. Die Zusammensetzung von TEs kann aber nicht nur zwischen, sondern auch innerhalb von Spezies variieren und wichtige biologische Konsequenzen nach sich ziehen. Unterschiede im Vorkommen innerhalb von Populationen könnten beispielsweise auf eine Invasion eines Transposons hinweisen, wohingegen Variation in der Sequenz das Vorhandensein von hyperaktiven oder inaktiven Varianten bedeuten könnte. Um die evolutionäre Dynamik von Transposons zu verstehen, ist es deshalb wichtig unverzerrte Schätzwerte für die Zusammensetzung von TEs zu erhalten.
Deshalb haben wir DeviaTE entwickelt; ein Programm zur Analyse und Visualisierung von TE Häufigkeit mit Illumina- oder Sanger-sequenzierten DNA-Abschnitten. Unser Werkzeug benötigt lediglich sequenzierte DNA-Abschnitte und Prototypsequenzen von TEs. Damit funktioniert es ohne Gesamtsequenz eines Genoms, was die Anwedung bei Nichtmodellorganismen, für die es bisher keine hoch qualitative Gesamtsequenz gibt, ermöglicht. DeviaTE erstellt eine Tabelle und eine Visualisierung der TE Struktur und liefert unverzerrte Schätzwerte für die TE Häufigkeit. Mit bereits publizierten Daten zeigen wir, dass DeviaTE benutzt werden kann um die Zusammensetzung von Transposons in Stichproben zu untersuchen, geographische Variation in TEs festzustellen oder die Verschiedenartigkeit von TEs zwischen Spezies zu ermitteln. Zusätzlich präsentieren wir eine gründliche Validierung mit simulierten Daten.
Darüber hinaus beschreiben wir eine Modell für Invasionen von DNA TEs und eine Methode um den Ablauf von solchen Invasionen mit unserem neuen Programm zu rekonstruieren. Wir argumentieren, dass eine Invasion einzigartige Fingerabdrücke in Populationen hinterlässt, die aus nicht-autonomen Varianten von TEs mit Deletionen inmitten ihrer DNA Sequenz, besteht. Mithilfe dieser TE Relikte zeigen wir, dass die Abfolge der P-element Invasion in Nordamerikanischen und Europäischen Drosophila melanogaster Populationen nachgezeichnet werden kann. Wir stellen fest, dass die Muster von Varianten mit deletierten Sequenzabschnitten die geographische Verteilung der untersuchten Populationen widerspiegeln. Zusätzlich ermitteln wir mögliche Ausgangspunkte und Routen für die Ausbreitung auf beiden Kontinenten. Mit der Entwicklung von DeviaTE hoffen wir, Fortschritte im Verständnis der Dynamik von TE Invasionen und anderer Prozesse, in denen TEs eine wichtige Rolle spielen, zu ermöglichen.Transposable elements (TEs) are selfish DNA sequences that multiply within host genomes. They are present in most species investigated so far at varying degrees of abundance and sequence diversity. The TE composition may not only vary between but also within species and could have important biological implications. Variation in prevalence among populations may for example indicate a recent TE invasion, whereas sequence variation could indicate the presence of hyperactive or inactive forms. Gaining unbiased estimates of TE composition is thus vital for understanding the evolutionary dynamics of transposons.
To this end we developed DeviaTE, a tool to analyze and visualize TE abundance using Illumina or Sanger reads. Our program only requires sequencing reads and consensus sequences of TEs. Thus, it works in an assembly-free manner, increasing its applicability to non-model organisms for which a high-quality assembly is not available yet. It generates a table and a visual representation of TE composition and provides unbiased estimates of TE abundance. Using published data we demonstrate that DeviaTE can be used to study TE composition within samples, identify clinal variation in TEs or compare TE diversity among species. We also present careful validation with simulated data.
Moreover, we describe a model of DNA transposon invasions and an approach to reconstruct the history of such invasions using our novel tool. We propose that an invasion leaves unique fingerprints within populations, which consist of non-autonomous, internally deleted variants of TEs. Using these TE remnants, we show that the sequence of the P-element invasion in North American and European Drosophila melanogaster populations can be retraced. In particular, we find that patterns of internally deleted variants recover the geographic distribution of sampled populations. Additionally, we identify potential origins and routes of the invasion on both continents. With the development of DeviaTE we hope to catalyze future progress in our understanding of TE invasion dynamics and other diverse phenomena, in which TEs play a central role
Recommended from our members
Computational optimisation strategies for targeted DNA sequencing using nanopores
Long-read DNA sequencing is causing a generational shift in genome sequencing productivity and is revolutionising many aspects of biological discovery. One of the technologies behind this transformation is sequencing using nanopores that act as biosensors measuring fluctuations of an ionic current caused by traversing polynucleotides. By partially blocking the flow of ions, distinct patterns of current are generated as nucleotides pass through the narrowest constriction; these are subsequently bioinformatically demixed into nucleobase sequences.
This approach for sequencing has multiple advantages, including reading native molecules that can be up to megabases long without prior amplification, and the ability to analyse data in real time. This allows for ultra-fast time-to-answer investigations, and also manipulation of ongoing experiments by processing the generated data and feeding instructions back to the sequencing machine; a unique feature not realisable with any other existing sequencing technology.
Currently, a few methods exist that make use of this by analysing nascent fragments of DNA and testing whether they originate from predetermined areas of a genome marked as targets. Otherwise, the voltage bias across the membrane, into which the nanopores are embedded, can be reversed and the molecule will be ejected from the nanopore, allowing another one to be sequenced in its place with the aim of saving time and enriching for on-target sequences. This process has been termed `adaptive sampling'; but prior to the work presented in my thesis, these methods were based entirely on static instructions. In other words, target regions of a genome were defined before an experiment and remain constant throughout.
With this thesis, I extend adaptive sampling such that decisions about molecule rejections can incorporate information obtained during an ongoing experiment. One of the main motivations is to address the current need for oversampling genomes many-fold to ensure a minimum coverage across a sequenced genome high enough for downstream analyses, which can be wasteful. Similarly, sequencing mixed samples without wasteful oversampling might lead to underrepresented or missing rare species. This thesis describes two approaches for dynamic extensions to adaptive sampling to address these issues, by implementing more versatile real-time analysis and control of sequencing experiments.
The first approach is intended for resequencing experiments, where reference sequences of the studied sample are available. For this, I implement an algorithmic framework and software that generates dynamically adapting decision strategies that are continuously updated to steer an active sequencing run. More specifically, this method quantifies uncertainty at each position in a genome and for each novel DNA fragment decides whether the expected decrease in uncertainty warrants fully sequencing it. This way, sequencing can be focused on molecules from areas with the highest uncertainty, e.g. regions of low coverage, thus optimising the information gain. I illustrate the effectiveness of the method by mitigating coverage bias between and within members of a microbial mixture sample. In particular, it adapts to the differential abundances without prior knowledge about sample composition, thereby reducing the interspecies bias and effectively redistributing coverage within species.
In some scenarios, the need for reference genomes poses a limitation, e.g. when sample content is unknown. In this case previous implementations are not useful, since underrepresented species cannot be targeted. A second approach I develop in the thesis aims to overcome this limitation by exploring how rejection decisions can be made while simultaneously creating a genome assembly from the fragments read so far. Here, the method rejects molecules from regions of genomes that are already well-represented and instead focuses on sequence that either helps to extend a species' assembly or is entirely unknown. I show how refocusing sequencing in this way is useful to increase the detection limit for rare organisms in a mixed sample, leads to higher quality assemblies, and allows for true de novo enrichment of unknown species for the first time.
Overall, the data-driven approaches to targeted sequencing with nanopores that I have created expand the applicability of adaptive sampling and could be applied to many other sequencing scenarios. The resulting reduction in the time-to-answer or increased information gain might be critical in clinical settings or for pathogen surveillance.This work was supported by EMBL and the EMBL International PhD Programm
Recommended from our members
phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets.
Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, and are an essential component of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here, we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and it implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutability models that we developed to more realistically represent SARS-CoV-2 genome evolution
phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets.
Funder: European Molecular Biology LaboratoryFunder: Schmidt Futures FoundationSequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, and are an essential component of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here, we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and it implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutability models that we developed to more realistically represent SARS-CoV-2 genome evolution
Stability of SARS-CoV-2 phylogenies.
The SARS-CoV-2 pandemic has led to unprecedented, nearly real-time genetic tracing due to the rapid community sequencing response. Researchers immediately leveraged these data to infer the evolutionary relationships among viral samples and to study key biological questions, including whether host viral genome editing and recombination are features of SARS-CoV-2 evolution. This global sequencing effort is inherently decentralized and must rely on data collected by many labs using a wide variety of molecular and bioinformatic techniques. There is thus a strong possibility that systematic errors associated with lab-or protocol-specific practices affect some sequences in the repositories. We find that some recurrent mutations in reported SARS-CoV-2 genome sequences have been observed predominantly or exclusively by single labs, co-localize with commonly used primer binding sites and are more likely to affect the protein-coding sequences than other similarly recurrent mutations. We show that their inclusion can affect phylogenetic inference on scales relevant to local lineage tracing, and make it appear as though there has been an excess of recurrent mutation or recombination among viral lineages. We suggest how samples can be screened and problematic variants removed, and we plan to regularly inform the scientific community with our updated results as more SARS-CoV-2 genome sequences are shared (https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473 and https://virological.org/t/masking-strategies-for-sars-cov-2-alignments/480). We also develop tools for comparing and visualizing differences among very large phylogenies and we show that consistent clade- and tree-based comparisons can be made between phylogenies produced by different groups. These will facilitate evolutionary inferences and comparisons among phylogenies produced for a wide array of purposes. Building on the SARS-CoV-2 Genome Browser at UCSC, we present a toolkit to compare, analyze and combine SARS-CoV-2 phylogenies, find and remove potential sequencing errors and establish a widely shared, stable clade structure for a more accurate scientific inference and discourse
High-speed volumetric imaging of neuronal activity in freely moving rodents
Thus far, optical recording of neuronal activity in freely behaving animals has been limited to a thin axial range. We present a head-mounted miniaturized light-field microscope (MiniLFM) capable of capturing neuronal network activity within a volume of 700 × 600 × 360 µm3 at 16 Hz in the hippocampus of freely moving mice. We demonstrate that neurons separated by as little as ~15 µm and at depths up to 360 µm can be discriminated