111 research outputs found

    Fast and accurate genome anchoring using fuzzy hash maps

    Get PDF
    Although hash-based approaches to sequence alignment and genome assembly are long established, their utility is predicated on the rapid identification of exact k-mers from a hash-map or similar data structure. We describe how a fuzzy hash-map can be applied to quickly and accurately align a Prokaryotic genome to the reference genome of a related species. Using this technique, a draft genome of Mycoplasma genitalium, sampled at 1X coverage, was accurately anchored against the genome of Mycoplasma pneumoniae. The fuzzy approach to alignment, ordered and orientated more than 65% of the reads from the draft genome in under 10 seconds, with an error rate of <1.5%. Without sacrificing execution speed, fuzzy hash-maps also provide a mechanism for error tolerance and variability in k-mer centric sequence alignment and assembly applications

    Combining AI and AM - Improving Approximate Matching through Transformer Networks

    Full text link
    Approximate matching (AM) is a concept in digital forensics to determine the similarity between digital artifacts. An important use case of AM is the reliable and efficient detection of case-relevant data structures on a blacklist, if only fragments of the original are available. For instance, if only a cluster of indexed malware is still present during the digital forensic investigation, the AM algorithm shall be able to assign the fragment to the blacklisted malware. However, traditional AM functions like TLSH and ssdeep fail to detect files based on their fragments if the presented piece is relatively small compared to the overall file size. A second well-known issue with traditional AM algorithms is the lack of scaling due to the ever-increasing lookup databases. We propose an improved matching algorithm based on transformer models from the field of natural language processing. We call our approach Deep Learning Approximate Matching (DLAM). As a concept from artificial intelligence (AI), DLAM gets knowledge of characteristic blacklisted patterns during its training phase. Then DLAM is able to detect the patterns in a typically much larger file, that is DLAM focuses on the use case of fragment detection. We reveal that DLAM has three key advantages compared to the prominent conventional approaches TLSH and ssdeep. First, it makes the tedious extraction of known to be bad parts obsolete, which is necessary until now before any search for them with AM algorithms. This allows efficient classification of files on a much larger scale, which is important due to exponentially increasing data to be investigated. Second, depending on the use case, DLAM achieves a similar or even significantly higher accuracy in recovering fragments of blacklisted files. Third, we show that DLAM enables the detection of file correlations in the output of TLSH and ssdeep even for small fragment sizes.Comment: Published at DFRWS USA 2023 as a conference pape

    Genome assembly and quality control for non-model organisms

    Get PDF
    This thesis presents my work in genome assembly between 2010 and 2019. Chapter 1 is an introduction to the status of the field, presenting the challenges and opportunities on generating de novo genome assemblies. Chapter 2 presents the development of k-mer spectra validation for assembly completeness, from its beginnings as unique sequence coverage analyses, through its implementation in the Kmer Analysis Toolkit, up to its use to assess consensus accuracy on hybrid assemblies. Chapter 3 describes a series of objective guided de novo assembly strategies applied to non-model genomes, starting with the assembly of the medicinal plant C. roseus to investigate its biosynthesis pathways, continuing with the chromosome-scale assembly of the ash dieback fungus during the UK outbreak, and concluding with my work assembling the hexaploid wheat genome from whole genome shotgun short read data. Chapter 4 describes the creation of haplotype-collapsed assemblies for 16 specimens of Heliconius butterflies to enable evolutionary analyses, and presents the Sequence Distance Graph framework to work with genome graphs and multi-technology data integration as a step towards haplotype-specific assemblies. Finally, Chapter 5 discusses this research and its impact in the context of the present and future of the field

    Multiple Biolgical Sequence Alignment: Scoring Functions, Algorithms, and Evaluations

    Get PDF
    Aligning multiple biological sequences such as protein sequences or DNA/RNA sequences is a fundamental task in bioinformatics and sequence analysis. These alignments may contain invaluable information that scientists need to predict the sequences\u27 structures, determine the evolutionary relationships between them, or discover drug-like compounds that can bind to the sequences. Unfortunately, multiple sequence alignment (MSA) is NP-Complete. In addition, the lack of a reliable scoring method makes it very hard to align the sequences reliably and to evaluate the alignment outcomes. In this dissertation, we have designed a new scoring method for use in multiple sequence alignment. Our scoring method encapsulates stereo-chemical properties of sequence residues and their substitution probabilities into a tree-structure scoring scheme. This new technique provides a reliable scoring scheme with low computational complexity. In addition to the new scoring scheme, we have designed an overlapping sequence clustering algorithm to use in our new three multiple sequence alignment algorithms. One of our alignment algorithms uses a dynamic weighted guidance tree to perform multiple sequence alignment in progressive fashion. The use of dynamic weighted tree allows errors in the early alignment stages to be corrected in the subsequence stages. Other two algorithms utilize sequence knowledge-bases and sequence consistency to produce biological meaningful sequence alignments. To improve the speed of the multiple sequence alignment, we have developed a parallel algorithm that can be deployed on reconfigurable computer models. Analytically, our parallel algorithm is the fastest progressive multiple sequence alignment algorithm

    Graphical pangenomics

    Get PDF
    Completely sequencing genomes is expensive, and to save costs we often analyze new genomic data in the context of a reference genome. This approach distorts our image of the inferred genome, an effect which we describe as reference bias. To mitigate reference bias, I repurpose graphical models previously used in genome assembly and alignment to serve as a reference system in resequencing. To do so I formalize the concept of a variation graph to link genomes to a graphical model of their mutual alignment that is capable of representing any kind of genomic variation, both small and large. As this model combines both sequence and variation information in one structure it serves as a natural basis for resequencing. By indexing the topology, sequence space, and haplotype space of these graphs and developing generalizations of sequence alignment suitable to them, I am able to use them as reference systems in the analysis of a wide array of genomic systems, from large vertebrate genomes to microbial pangenomes. To demonstrate the utility of this approach, I use my implementation to solve resequencing and alignment problems in the context of Homo sapiens and Saccharomyces cerevisiae. I use graph visualization techniques to explore variation graphs built from a variety of sources, including diverged human haplotypes, a gut microbiome, and a freshwater viral metagenome. I find that variation aware read alignment can eliminate reference bias at known variants, and this is of particular importance in the analysis of ancient DNA, where existing approaches result in significant bias towards the reference genome and concomitant distortion of population genetics results. I validate that the variation graph model can be applied to align RNA sequencing data to a splicing graph. Finally, I show that a classical pangenomic inference problem in microbiology can be solved using a resequencing approach based on variation graphs.Wellcome Trust PhD fellowshi

    The Data Science Design Manual

    Get PDF

    New Fundamental Technologies in Data Mining

    Get PDF
    The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining

    Analysis of the early events in the interaction between Venturia inaequalis and the susceptible Golden Delicious apple (Malus x domestica Borkh.)

    Get PDF
    Philosophiae Doctor - PhDApple (Malus x domestica) production in the Western Cape, South Africa, is one of the major contributors to the gross domestic product (GDP) of the region. The production of apples is affected by a number of diseases. One of the economically important diseases is apple scab that is caused by the pathogenic fungus, Venturia inaequalis. Research to introduce disease resistance ranges from traditional plant breeding through to genetic manipulation. Parallel disease management regimes are also implemented to combat the disease, however, such strategies are increasingly becoming more ineffective since some fungal strains have become resistant to fungicides. The recently sequenced apple genome has opened the door to study the plant pathogen interaction at a molecular level. This study reports on proteomic and transcriptomic analyses of apple seedlings infected with Venturia inaequalis. In the proteomic analysis, two-dimensional gel electrophoresis (2-DE) in combination with mass spectrometry (MS) was used to separate, visualise and identify apple leaf proteins extracted from infected and uninfected apple seedlings. Using MelanieTM 2-DE Gel Analysis Software version 7.0 (Genebio, Geneva, Switzerland), a comparative analysis of leaf proteome expression patterns between the uninfected and infected apple leaves were conducted. The results indicated proteins with similar expression profiles as well as qualitative and quantitative differences between the two leaf proteomes. Thirty proteins from the apple leaf proteome were identified as differentially expressed. These were selected for analysis using a combination of MALDI-TOF and MALDI-TOF-TOF MS, followed by database searching. Of these spots, 28 were positively identified with known functions in photosynthesis and carbon metabolism (61%), protein destination and storage (11%), as well as those involved in redox/response to stress, followed by proteins involved in protein synthesis and disease/defence (7%), nucleotide and transport (3%). RNA-Seq was used to identify differentially expressed genes in response to the fungal infection over five time points namely Day 0, 2, 4, 8 and 12. cDNA libraries were constructed, sequenced using Illumina HiScan SQTM and MiSeqTM instruments. Nucleotide reads were analysed by aligning it to the apple genome using TopHat spliceaware aligner software, followed by analysis with limma/voom and edgeR, R statistical packages for finding differentially expressed genes. These results showed that 398 genes were differentially expressed in response to fungal infection over the five time points. These mapped to 1164 transcripts in the apple transcripts database, which were submitted to BLAST2GO. Eighty-six percent of the genes obtained a BLAST hit to which 77% of the BLAST hits were assigned GO terms. These were classed into three ontology categories i.e. biological processes, molecular function and cellular components. By focussing on the host responsive genes, modulation of genes involved in signal perception, transcription, stress/detoxification, defence related proteins, transport and secondary metabolites have been observed. A comparative analysis was performed between the Day 4 proteomic and Day 4 transcriptomic data. In the infected and uninfected apple leaf proteome of Day 4, we found 9 proteins responsive to fungal infection were up-regulated. From the transcriptome data of Day 4, 162 genes were extracted, which mapped to 395 transcripts in the apple transcripts. These were submitted to BLAST2GO for functional annotation. Proteins encoded by the up-regulated transcripts were functionally categorised. Pathways affected by the up-regulated genes are carbon metabolism, protein synthesis, defence, redox/response to stress. Up-regulated genes were involved in signal perception, transcription factors, stress/detoxification, defence related proteins, disease resistance proteins, transport and secondary metabolites. We found that the same pathways including energy, disease/defence and redox/response to stress were affected for the comparative analysis. The results of this study can be used as a starting point for targeting host responsive genes in genetic manipulation of apple cultivars
    corecore