    Demystifying our Grandparent's De Bruijn Sequences with Concatenation Trees

    Some of the most interesting de Bruijn sequences can be constructed in seemingly unrelated ways. In particular, the "Granddaddy" and "Grandmama" can be understood by joining necklace cycles into a tree using simple parent rules, or by concatenating smaller strings (e.g., Lyndon words) in lexicographic orders. These constructions are elegant, but their equivalences seem to come out of thin air, and the community has had limited success in finding others of the same ilk. We aim to demystify the connection between cycle-joining trees and concatenation schemes by introducing "concatenation trees". These structures combine binary trees and ordered trees, and traversals yield concatenation schemes for their sequences. In this work, we focus on the four simplest cycle-joining trees using the pure cycling register (PCR): "Granddaddy" (PCR1), "Grandmama" (PCR2), "Granny" (PCR3), and "Grandpa" (PCR4). In particular, we formally prove a previously observed correspondence for PCR3 and we unravel the mystery behind PCR4. More broadly, this work lays the foundation for translating cycle-joining trees to known concatenation constructions for a variety of underlying feedback functions including the complementing cycling register (CCR), pure summing register (PSR), complementing summing register (CSR), and pure run-length register (PRR)

    Complete characterization of structure of rule 54

    The dynamics of rule 54 one-dimensional two-state cellular automaton (CA) are a discrete analog of a space-time dynamics of excitations in nonlinear active medium with mutual inhibition. A cell switches its state 0 to state 1 if one of its two neighbors is in state 1 (propagation of a perturbation) and a cell remains in state 1 only if its two neighbors are in state 0. A lateral inhibition is because a 1-state neighbor causes a 1-state cell to switch to state 0. The rule produces a rich spectrum of space-time dynamics, including gliders and glider guns just from four primitive gliders. We construct a catalogue of gliders and describe them by tiles. We calculate a subset of regular expressions ΨR54\Psi_{R54} to encode gliders. The regular expressions are derived from de Bruijn diagrams, tile-based representation of gliders, and cycle diagrams sometimes. We construct an abstract machine that recognizes regular expressions of gliders in rule 54 and validate ΨR54\Psi_{R54}. We also propose a way to code initial configurations of gliders to depict any type of collision between the gliders and explore self-organization of gliders, formation of larger tiles, and soliton-like interactions of gliders and computable devices

    A method for constructing decodable de Bruijn sequences

    In this paper we present two related methods of construction for de Bruijn sequences, both based on interleaving "smaller" de Bruijn sequences. Sequences obtained using these construction methods have the advantage that they can be "decoded" very efficiently, i.e., the position within the sequence of any particular "window" can be found very simply. Sequences with simple decoding algorithms are of considerable practical importance in position location applications

    Novel graph based algorithms for transcriptome sequence analysis

    RNA-sequencing (RNA-seq) is one of the most-widely used techniques in molecular biology. A key bioinformatics task in any RNA-seq workflow is the assembling the reads. As the size of transcriptomics data sets is constantly increasing, scalable and accurate assembly approaches have to be developed.Here, we propose several approaches to improve assembling of RNA-seq data generated by second-generation sequencing technologies. We demonstrated that the systematic removal of irrelevant reads from a high coverage dataset prior to assembly, reduces runtime and improves the quality of the assembly. Further, we propose a novel RNA-seq assembly work- flow comprised of read error correction, normalization, assembly with informed parameter selection and transcript-level expression computation. In recent years, the popularity of third-generation sequencing technologies in- creased as long reads allow for accurate isoform quantification and gene-fusion detection, which is essential for biomedical research. We present a sequence-to-graph alignment method to detect and to quantify transcripts for third-generation sequencing data. Also, we propose the first gene-fusion prediction tool which is specifically tailored towards long-read data and hence achieves accurate expression estimation even on complex data sets. Moreover, our method predicted experimentally verified fusion events along with some novel events, which can be validated in the future

    Relations as a program development language

    Pan-genome Search and Storage

    Holley G. Pan-genome Search and Storage. Bielefeld: Universität Bielefeld; 2018.High Throughput Sequencing (HTS) technologies are constantly improving and making genome sequencing more affordable. However, HTS sequencers can only produce short overlapping genome fragments that are erroneous and cover the sequenced genomes unevenly. These genome fragments are assembled based on their overlaps to produce larger contiguous sequences. Since de novo genome assembly is computationally intensive, some species have a reference genome used as a guide for assembling genome fragments from the same species or as a basis for comparative genomics methods. Yet, assembling a genome is an error-prone process depending on the quality of the sequencing data and the heuristics used during the assembly. Furthermore, analyses based on a reference are biased towards the reference. Finally, a single reference cannot reflect the dynamics and diversity of a population of genomes. Overcoming these issues requires to move away from the single-genome reference-centric paradigm and take advantage of the multiple sequenced genomes available for each species. For this purpose, pan-genomes were introduced as sets of genomes from different strains of the same species. A pan-genome is represented by a multi-genome index exploiting the similarity and redundancy of the genomes it contains. Still, pan-genomes are more difficult to analyze than single genomes because of the large amount of data to be stored and indexed. Current data structures for pan-genome indexing do not fulfill all requirements for pan-genome analysis. Indeed, these data structures are often immutable while the size of a pan-genome grows constantly with newly sequenced genomes. Frequently, these data structures consider only assemblies as input, while unassembled genome fragments abound in databases. Also, indexing variants and similarities between the genomes of a pan-genome usually requires time and memory consuming algorithms such as sequence alignments. Sometimes, pan-genome analysis tools just assume variants and similarities are provided as input. While data structures already exist for pan-genome indexing, no solution is currently proposed for genome fragment compression in a pan-genome context. Indeed, it is often of interest to transmit and store all genome fragments of a pan-genome. However, HTS-specific compression tools are not dynamic and cannot update a compressed archive of genome fragments with new fragments of a genome without decompression. Hence, those tools are poorly adapted to the transmission and storage of genome fragments in a pan-genome context. In this thesis, we aim to provide scalable solutions for pan-genome indexing and storage. We first address the problem of pan-genome indexing by proposing a new alignment-free, reference-free and incremental data structure that considers genome fragments as well as assemblies in input: the Bloom Filter Trie (BFT). The BFT is a tree data structure representing a colored de Bruijn graph in which k-mers, words of length k from the input genomes, are associated with sets of colors representing the genomes in which they occur. The BFT makes extensive use of Bloom filters to navigate in the tree and optimize the graph traversal. A "bursting" method is employed to perform an efficient path and level compaction of the tree. We show that the BFT outperforms a data structure that has similar features but is based on an approximation of the set of indexed k-mers. Secondly, we address the problem of genome fragments compression in a pan-genome context by proposing a new abstract data structure, the guided de Bruijn graph. It augments the de Bruijn graph with k-mer partitions such that the graph traversal is guided to reconstruct exactly the genome fragments when decompressing. Different techniques are proposed to optimize the storage of fragments in the graph and the partition encoding. We show that the BFT described previously has all features required to index a guided de Bruijn graph and is used in the implementation of our compression method named DARRC. The evaluation of DARRC on a large pan-genome dataset compared to state-of-the-art HTS-specific and general purpose compression tools shows a 30% compression ratio improvement over the second best performing tool of this evaluation

    Isolation and characterization of bacteriophages with therapeutic potential

