100 research outputs found

    An Average-Case Sublinear Exact Li and Stephens Forward Algorithm

    Get PDF
    Hidden Markov models of haplotype inheritance such as the Li and Stephens model allow for computationally tractable probability calculations using the forward algorithms as long as the representative reference panel used in the model is sufficiently small. Specifically, the monoploid Li and Stephens model and its variants are linear in reference panel size unless heuristic approximations are used. However, sequencing projects numbering in the thousands to hundreds of thousands of individuals are underway, and others numbering in the millions are anticipated. To make the Li and Stephens forward algorithm for these datasets computationally tractable, we have created a numerically exact version of the algorithm with observed average case O(nk^{0.35}) runtime in number of genetic sites n and reference panel size k. This avoids any tradeoff between runtime and model complexity. We demonstrate that our approach also provides a succinct data structure for general purpose haplotype data storage. We discuss generalizations of our algorithmic techniques to other hidden Markov models

    Highly Scalable Short Read Alignment with the Burrows-Wheeler Transform and Cloud Computing

    Get PDF
    Improvements in DNA sequencing have both broadened its utility and dramatically increased the size of sequencing datasets. Sequencing instruments are now used regularly as sources of high-resolution evidence for genotyping, methylation profiling, DNA-protein interaction mapping, and characterizing gene expression in the human genome and in other species. With existing methods, the computational cost of aligning short reads from the Illumina instrument to a mammalian genome can be very large: on the order of many CPU months for one human genotyping project. This thesis presents a novel application of the Burrows-Wheeler Transform that enables the alignment of short DNA sequences to mammalian genomes at a rate much faster than existing hashtable-based methods. The thesis also presents an extension of the technique that exploits the scalability of Cloud Computing to perform the equivalent of one human genotyping project in hours

    Rekonstruktion und skalierbare Detektion und Verfolgung von 3D Objekten

    Get PDF
    The task of detecting objects in images is essential for autonomous systems to categorize, comprehend and eventually navigate or manipulate its environment. Since many applications demand not only detection of objects but also the estimation of their exact poses, 3D CAD models can prove helpful since they provide means for feature extraction and hypothesis refinement. This work, therefore, explores two paths: firstly, we will look into methods to create richly-textured and geometrically accurate models of real-life objects. Using these reconstructions as a basis, we will investigate on how to improve in the domain of 3D object detection and pose estimation, focusing especially on scalability, i.e. the problem of dealing with multiple objects simultaneously.Objekterkennung in Bildern ist für ein autonomes System von entscheidender Bedeutung, um seine Umgebung zu kategorisieren, zu erfassen und schließlich zu navigieren oder zu manipulieren. Da viele Anwendungen nicht nur die Erkennung von Objekten, sondern auch die Schätzung ihrer exakten Positionen erfordern, können sich 3D-CAD-Modelle als hilfreich erweisen, da sie Mittel zur Merkmalsextraktion und Verfeinerung von Hypothesen bereitstellen. In dieser Arbeit werden daher zwei Wege untersucht: Erstens werden wir Methoden untersuchen, um strukturreiche und geometrisch genaue Modelle realer Objekte zu erstellen. Auf der Grundlage dieser Konstruktionen werden wir untersuchen, wie sich der Bereich der 3D-Objekterkennung und der Posenschätzung verbessern lässt, wobei insbesondere die Skalierbarkeit im Vordergrund steht, d.h. das Problem der gleichzeitigen Bearbeitung mehrerer Objekte

    ALGORITHMS AND HIGH PERFORMANCE COMPUTING APPROACHES FOR SEQUENCING-BASED COMPARATIVE GENOMICS

    Get PDF
    As cost and throughput of second-generation sequencers continue to improve, even modestly resourced research laboratories can now perform DNA sequencing experiments that generate hundreds of billions of nucleotides of data, enough to cover the human genome dozens of times over, in about a week for a few thousand dollars. Such data are now being generated rapidly by research groups across the world, and large-scale analyses of these data appear often in high-profile publications such as Nature, Science, and The New England Journal of Medicine. But with these advances comes a serious problem: growth in per-sequencer throughput (currently about 4x per year) is drastically outpacing growth in computer speed (about 2x every 2 years). As the throughput gap widens over time, sequence analysis software is becoming a performance bottleneck, and the costs associated with building and maintaining the needed computing resources is burdensome for research laboratories. This thesis proposes two methods and describes four open source software tools that help to address these issues using novel algorithms and high-performance computing techniques. The proposed approaches build primarily on two insights. First, that the Burrows-Wheeler Transform and the FM Index, previously used for data compression and exact string matching, can be extended to facilitate fast and memory-efficient alignment of DNA sequences to long reference genomes such as the human genome. Second, that these algorithmic advances can be combined with MapReduce and cloud computing to solve comparative genomics problems in a manner that is scalable, fault tolerant, and usable even by small research groups

    Topics in Lattice Sieving

    Get PDF

    Using Pan-Genomic Data Structures to Incoporate Diversity Into Genomic Analyses

    Get PDF
    The alignment of sequencing reads to the reference genome is a process subject to reference bias, a phenomenon where reads containing alternative alleles have a smaller likelihood of aligning to the reference when compared to reads that are more similar to the reference. Because the human reference genome is largely comprised of the genomic sequence of a single individual, it is apparent that either changing or modifying the representation of the reference genome in order to incorporate diversity from other individuals can reduce reference bias. We discuss methods for alleviating reference bias through the use of novel text indexing data structures and algorithms that can incorporate such diversity. First, we present data structures built on top of the Run-Length FM Index that can be used to index and query a pan-genome, ie. a representation of the genome that incorporates known variation within the species. Then, we use pan-genome indexes in a workflow for constructing a personalized genome from a set of sequencing reads. This personalized genome can be used in lieu of the reference genome during alignment in order to alleviate reference bias. We also discuss how alignments against personalized genomes can be used in downstream analyses by "lifting" these alignments back over to the reference genome

    Development and application of efficient ab initio molecular dynamics simulations of ground and excited states

    Get PDF
    Ab initio molecular dynamics reflect the movement of nuclei on a potential energy surface generated by ab initio methods. These simulations give access to an entire series of chemically relevant properties, such as vibrational spectra and free energies, and have thus become indispensable in, for example, biochemistry and materials sciences. They are, however, computationally demanding, due to the expensive quantum-chemical calculations that are required at every step. In order to overcome some of the limitations, this thesis presents steps towards efficient but still accurate \textit{ab initio} molecular dynamics simulations, combining recent progress in different fields of computational chemistry. The time-consuming two-electron integral evaluations are conducted on graphics processing units. Their massively parallel architecture leads to speed-ups (with respect to calculations on central processing units) and strong scaling is observed. Expensive electronic structure calculations are circumvented using parametrized methods, such as the corrected small basis set Hartree-Fock method or simplified time-dependent density functional theory. From the field of molecular dynamics, the extended Lagrangian method is adopted to stabilize the trajectories and to accelerate the convergence of the self-consistent field algorithm. Finally, couplings between electronic states are approximated from a finite differences approach to avoid the time-consuming analytical evaluations at the time-dependent density functional theory level. As a result of these approaches, large molecular systems become accessible at comparably low computational cost. This is demonstrated for several illustrative applications. Excited-state dynamics are used to explore the relaxation pathway of the rhodopsin protein and four newly designed rotary molecular motors using the same Schiff base motif. Ground-state simulations deliver vibrational spectra of medium-sized molecules and liquid water. They are used in addition to determine free energy differences of molecular transformations, for which a novel scheme is introduced delivering deeper insights into the underlying process
    corecore