138 research outputs found

    Effiicient Computation of Maximal Exact Matches Between Genomic Sequences

    Get PDF
    Sequence alignment is one of the most accomplished methods in the field of bioinformatics, being crucial to determine similarities between sequences, from finding genes to predicting functions. The computation of Maximal Exact Matches (MEM) plays a fundamental part in some algorithms for sequence alignment. MEMs between a reference-query genome are often utilized as seeds in a genome aligner to increase its efficiency. The MEM computation is a time consuming step in the sequence alignment process and increasing the performance of this step increases significantly the whole process of the alignment between the sequences. As of today, there are many programs available for MEM computing, from algorithms based full text indexes, like essaMEM; to more effective ones, such as E-MEM, copMEM and bfMEM. However, none of the available programs for the computation of MEMs are able to work with highly related sequences. In this study, we propose an improved version, E-MEM2, of the well known MEM computing software, E-MEM. With a trade-off between time and memory, the improved version shows to run faster than its previous version, presenting very large improvements when comparing closely-related sequences

    Data mining for high performance compression of genmoic reads and sequences

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.The rapid development of next-generation sequencing (NGS) technologies has revolutionized almost all fields of genetics. However, the massive amount of genomic data produced by NGS presents great challenges to data storage, transmission and analysis. Among various NGS-related big data challenges, in this thesis, we focus on short reads data compression, assembled genome compression and maximal exact matches (MEMs) detection. First we propose a new compression algorithm for short reads data. The method utilizes minimizers to exploit the redundant information presented in reads. Specifically, large -minimizers are used to group reads and (, )-minimizers are used to search suffix-prefix overlap similarity between two contigs. Our experiments show that the proposed method achieves better compression ratio than the existing methods. Furthermore, we present a high-performance reference-based genome compression algorithm. It is based on a 2-bit encoding scheme and an advanced greedy-matching search on a global hash table. The compression ratio of our method is at least 1.9 times better than the best competing algorithm on its best case, and our compression speed is also at least 2.9 times faster. Finally we introduce a method to detect all MEMs from pairs of large genomes. The method conducts a fixed k-mer sampling on the query sequence and the index -mers are filtered from the reference sequence via a Bloom filter. Experiments on large genomes demonstrate that our method is at least 1.8 times faster than the best of the existing algorithms. Overall, this thesis work has developed efficient algorithms for pattern discovery from and for data compression of genomic sequences of big size

    Allowing mutations in maximal matches boosts genome compression performance.

    Full text link
    Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark datasets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission

    Advanced Methods for Real-time Metagenomic Analysis of Nanopore Sequencing Data

    Get PDF
    Whole shotgun metagenomics sequencing allows researchers to retrieve information about all organisms in a complex sample. This method enables microbiologists to detect pathogens in clinical samples, study the microbial diversity in various environments, and detect abundance differences of certain microbes under different living conditions. The emergence of nanopore sequencing has offered many new possibilities for clinical and environmental microbiologists. In particular, the portability of the small nanopore sequencing devices and the ability to selectively sequence only DNA from interesting organisms are expected to make a significant contribution to the field. However, both options require memory-efficient methods that perform real-time data analysis on commodity hardware like usual laptops. In this thesis, I present new methods for real-time analysis of nanopore sequencing data in a metagenomic context. These methods are based on optimized algorithmic approaches querying the sequenced data against large sets of reference sequences. The main goal of those contributions is to improve the sequencing and analysis of underrepresented organisms in complex metagenomic samples and enable this analysis in low-resource settings in the field. First, I introduce ReadBouncer as a new tool for nanopore adaptive sampling, which can reject uninteresting DNA molecules during the sequencing process. ReadBouncer improves read classification compared to other adaptive sampling tools and has fewer memory requirements. These improvements enable a higher enrichment of underrepresented sequences while performing adaptive sampling in the field. I further show that, besides host sequence removal and enrichment of low-abundant microbes, adaptive sampling can enrich underrepresented plasmid sequences in bacterial samples. These plasmids play a crucial role in the dissemination of antibiotic resistance genes. However, their characterization requires expensive and time-consuming lab protocols. I describe how adaptive sampling can be used as a cheap method for the enrichment of plasmids, which can make a significant contribution to the point-of-care sequencing of bacterial pathogens. Finally, I introduce a novel memory- and space-efficient algorithm for real-time taxonomic profiling of nanopore reads that was implemented in Taxor. It improves the taxonomic classification of nanopore reads compared to other taxonomic profiling tools and tremendously reduces the memory footprint. The resulting database index for thousands of microbial species is small enough to fit into the memory of a small laptop, enabling real-time metagenomics analysis of nanopore sequencing data with large reference databases in the field

    Improving the Compact Bit-Sliced Signature Index COBS for Large Scale Genomic Data

    Get PDF
    In this thesis we investigate the potential for improving the Compact Bit-Sliced Signature Index (COBS) [BBGI19] for large scale genomic data. COBS was developed by Bingmann et al. and is an inverted text index based on Bloom filters. It can be used to index k-mers of DNA samples or q-grams of plain text data and is queried using approximate pattern matching based on the k-mer (or q-gram) profile of a query. In their work Bingmann et al. demonstrated a couple of advantages COBS has over other state of the art approximate k-mer-based indices, some of which are extraordinary fast query and construction times, but as well as the fact that COBS can be constructed and queried even if the index does not fit into main memory. This is one of the reasons we decided to look more closely at some areas we could improve COBS. Our main goal is to make COBS more scalable. Scalability is a very important factor when it comes to handling DNA related data. This is because the amount of sequenced data stored in publicly available archives nearly doubles every year, making it difficult to handle even from the perspective of resources alone. We focus on two main areas in which we try to improve COBS. Those are index compression through clustering and distribution. The thesis presents our findings and improvements achieved in respect to those areas

    Sparse and skew hashing of K-mers

    Get PDF
    Motivation: A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-Throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings-in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge. Results: To tackle this problem, we describe a compressed and associative dictionary for k-mers, that is: A data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0,n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions

    Indexation et analyse de grandes collections de séquençages via des matrices de k-mers

    Get PDF
    The 21st century is bringing a tsunami of data in many fields, especially in bioinformatics. This paradigm shift requires the development of new processing methods capable of scaling up on such data. This work consists mainly in considering massive tera-scaled datasets from genomic sequencing. A common way to process these data is to represent them as a set of words of a fixed size, called k-mers. The k-mers are widely used as building blocks by many sequencing data analysis techniques. The challenge is to be able to represent the k-mers and their abundances in a large number of datasets. One possibility is the k-mer matrix, where each row is a k-mer associated with a vector of abundances and each column corresponds to a sample. Some k-mers are erroneous due to sequencing errors and must be discarded. The usual technique consists in discarding low-abundant k-mers. On complex datasets such as metagenomes, such a filter is not efficient and discards too many k-mers. The holistic view of abundances across samples allowed by the matrix representation also enables a new procedure for error detection on such datasets. In summary, we explore the concept of k-mer matrix and show its scalability in various applications, from indexing to analysis, and propose different tools for this purpose. On the indexing side, our tools have allowed indexing a large metagenomic dataset from the Tara Ocean project while keeping additional k-mers, usually discarded by the classical k-mer filtering technique. The next and important step is to make the index publicly available. On the analysis side, our matrix construction technique enables to speed up a differential k-mer analysis of a state-of-the-art tool by an order of magnitude.Le 21ème siècle subit un tsunami de données dans de nombreux domaines, notamment en bio-informatique. Ce changement de paradigme nécessite le développement de nouvelles méthodes de traitement capables de passer à l’échelle sur de telles données. Ce travail consiste principalement à considérer des jeux de données massifs provenant du séquençage génomique. Une façon courante de traiter ces données est de les représenter comme un ensemble de mots de taille fixe, appelés k-mers. Les k-mers sont très largement utilisés comme éléments de bases par de nombreuses méthodes d’analyses de données de séquençages. L’enjeu est de pouvoir représenter les k-mers et leurs abondances dans un grand nombre de jeux de données. Une possibilité est la matrice de k-mers, où chaque ligne est un k-mer associé à un vecteur d’abondances. Ces k-mers sont erronées en raison des erreurs de séquençage et doivent être filtrés. La technique habituelle consiste à écarter les k-mers peu abondants. Sur des ensembles de données complexes comme les métagénomes, un tel filtre n’est pas efficace et élimine un trop grand nombre de k-mers. La vision des abondances à travers les échantillons permise par la représentation matricielle permet également une nouvelle procédure de détection des erreurs dans les jeux de données complexes. En résumé, nous explorons le concept de matrice de k-mer et montrons ses capacités en termes de passage à l’échelle au travers de diverses applications, de l’indexation à l’analyse, et proposons différents outils à cette fin. Sur le plan de l’indexation, nos outils ont permis d’indexer un grand ensemble métagénomique du projet Tara Ocean tout en conservant des k-mers rares, habituellement écartés par les techniques de filtrage classiques. En matière d’analyse, notre technique de construction de matrices permet d’accélérer d’un ordre de grandeur l’analyse différentielle de k-mers

    Efficient Alignment Algorithms for DNA Sequencing Data

    Get PDF
    The DNA Next Generation Sequencing (NGS) technologies produce data at a low cost, enabling their application to many ambitious fields such as cancer research, disease control, personalized medicine etc. However, even after a decade of research, the modern aligners and assemblers are far from providing efficient and error free genome alignments and assemblies respectively. This is due to the inherent nature of the genome alignment and assembly problem, which involves many complexities. Many algorithms to address this problem have been proposed over the years, but there still is a huge scope for improvement in this research space. Many new genome alignment algorithms are proposed over time and one of the key differentiator among these algorithms is the efficiency of the genome alignment process. I present a new algorithm for efficiently finding Maximal Exact Matches (MEMs) between two genomes: E-MEM (Efficient computation of maximal exact matches for very large genomes). Computing MEMs is one of the most time consuming step during the alignment process. E-MEM can be used to find MEMs which are used as seeds in genome aligner to increase its efficiency. The E-MEM program is the most efficient algorithm as of today for computing MEMs and it surpasses all competition by large margins. There are many genome assembly algorithms available for use, but none produces perfect genome assemblies. It is important that assemblies produced by these algorithms are evaluated accurately and efficiently.This is necessary to make the right choice of the genome assembler to be used for all the downstream research and analysis. A fast genome assembly evaluator is a key factor when a new genome assembler is developed, to quickly evaluate the outcome of the algorithm. I present a fast and efficient genome assembly evaluator called LASER (Large genome ASsembly EvaluatoR), which is based on a leading genome assembly evaluator QUAST, but significantly more efficient both in terms of memory and run time. The NGS technologies limit the potential of genome assembly algorithms because of short read lengths and nonuniform coverage. Recently, third generation sequencing technologies have been proposed which promise very long reads and a uniform coverage. However, this technology comes with its own drawback of high error rate of 10 - 15% consisting mostly of indels. The long read sequencing data is useful only after error correction obtained using self read alignment (or read overlapping) techniques. I propose a new self read alignment algorithm for Pacific Biosciences sequencing data: HISEA (Hierarchical SEed Aligner), which has very high sensitivity and precision as compared to other state-of-the-art aligners. HISEA is also integrated into Canu assembly pipeline. Canu+HISEA produces better assemblies than Canu with its default aligner MHAP, at a much lower coverage

    Developing bioinformatics approaches for the analysis of influenza virus whole genome sequence data

    Get PDF
    Influenza viruses represent a major public health burden worldwide, resulting in an estimated 500,000 deaths per year, with potential for devastating pandemics. Considerable effort is expended in the surveillance of influenza, including major World Health Organization (WHO) initiatives such as the Global Influenza Surveillance and Response System (GISRS). To this end, whole-genome sequencning (WGS), and corresponding bioinformatics pipelines, have emerged as powerful tools. However, due to the inherent diversity of influenza genomes, circulation in several different host species, and noise in short-read data, several pitfalls can appear during bioinformatics processing and analysis. 2.1.2 Results Conventional mapping approaches can be insufficient when a sub-optimal reference strain is chosen. For short-read datasets simulated from human-origin influenza H1N1 HA sequences, read recovery after single-reference mapping was routinely as low as 90% for human-origin influenza sequences, and often lower than 10% for those from avian hosts. To this end, I developed software using de Bruijn 47Graphs (DBGs) for classification of influenza WGS datasets: VAPOR. In real data benchmarking using 257 WGS read sets with corresponding de novo assemblies, VAPOR provided classifications for all samples with a mean of >99.8% identity to assembled contigs. This resulted in an increase of the number of mapped reads by 6.8% on average, up to a maximum of 13.3%. Additionally, using simulations, I demonstrate that classification from reads may be applied to detection of reassorted strains. 2.1.3 Conclusions The approach used in this study has the potential to simplify bioinformatics pipelines for surveillance, providing a novel method for detection of influenza strains of human and non-human origin directly from reads, minimization of potential data loss and bias associated with conventional mapping, and facilitating alignments that would otherwise require slow de novo assembly. Whilst with expertise and time these pitfalls can largely be avoided, with pre-classification they are remedied in a single step. Furthermore, this algorithm could be adapted in future to surveillance of other RNA viruses. VAPOR is available at https://github.com/connor-lab/vapor. Lastly, VAPOR could be improved by future implementation in C++, and should employ more efficient methods for DBG representation
    • …
    corecore