34 research outputs found

    Byte-Aligned Pattern Matching in Encoded Genomic Sequences

    Get PDF
    In this article, we propose a novel pattern matching algorithm, called BAPM, that performs searching in the encoded genomic sequences. The algorithm works at the level of single bytes and it achieves sublinear performance on average. The preprocessing phase of the algorithm is linear with respect to the size of the searched pattern m. A simple O(m)-space data structure is used to store all factors (with a defined length) of the searched pattern. These factors are later searched during the searching phase which ensures sublinear time on average. Our algorithm significantly overcomes the state-of-the-art pattern matching algorithms in the locate time on middle and long patterns. Furthermore, it is able to cooperate very easily with the block q-gram inverted index. The block q-gram inverted index together with our pattern matching algorithm achieve superior results in terms of locate time to the current index data structures for less frequent patterns. We present experimental results using real genomic data. These results prove efficiency of our algorithm

    Finite Automata Implementations Considering CPU Cache

    Get PDF
    The finite automata are mathematical models for finite state systems. More general finite automaton is the nondeterministic finite automaton (NFA) that cannot be directly used. It is usually transformed to the deterministic finite automaton (DFA) that then runs in time O(n), where n is the size of the input text. We present two main approaches to practical implementation of DFA considering CPU cache. The first approach (represented by Table Driven and Hard Coded implementations) is suitable forautomata being run very frequently, typically having cycles. The other approach is suitable for a collection of automata from which various automata are retrieved and then run. This second kind of automata are expected to be cycle-free.

    MergedTrie: Efficient textual indexing

    Get PDF
    The accessing and processing of textual information (i.e. the storing and querying of a set of strings) is especially important for many current applications (e.g. information retrieval and social networks), especially when working in the fields of Big Data or IoT, which require the handling of very large string dictionaries. Typical data structures for textual indexing are Hash Tables and some variants of Tries such as the Double Trie (DT). In this paper, we propose an extension of the DT that we have called MergedTrie. It improves the DT compression by merging both Tries into a single and by segmenting the indexed term into two fixed length parts in order to balance the new Trie. Thus, a higher overlapping of both prefixes and suffixes is obtained. Moreover, we propose a new implementation of Tries that achieves better compression rates than the Double-Array representation usually chosen for implementing Tries. Our proposal also overcomes the limitation of static implementations that does not allow insertions and updates in their compact representations. Finally, our MergedTrie implementation experimentally improves the efficiency of the Hash Tables, the DTs, the Double-Array, the Crit-bit, the Directed Acyclic Word Graphs (DAWG), and the Acyclic Deterministic Finite Automata (ADFA) data structures, requiring less space than the original text to be indexed.This study has been partially funded by the SEQUOIA-UA (TIN2015-63502-C3-3-R) and the RESCATA (TIN2015-65100-R) projects of the Spanish Ministry of Economy and Competitiveness (MINECO)

    Proceedings of the Scientific Data Compression Workshop

    Get PDF
    Continuing advances in space and Earth science requires increasing amounts of data to be gathered from spaceborne sensors. NASA expects to launch sensors during the next two decades which will be capable of producing an aggregate of 1500 Megabits per second if operated simultaneously. Such high data rates cause stresses in all aspects of end-to-end data systems. Technologies and techniques are needed to relieve such stresses. Potential solutions to the massive data rate problems are: data editing, greater transmission bandwidths, higher density and faster media, and data compression. Through four subpanels on Science Payload Operations, Multispectral Imaging, Microwave Remote Sensing and Science Data Management, recommendations were made for research in data compression and scientific data applications to space platforms

    Implementación y análisis de algoritmos de alineación para datos de Next Generation Sequencing (NGS)

    Full text link
    El coste del proceso de secuenciación de los genomas de los seres vivos, se ha reducido en gran cantidad en los últimos diez años debido a la aplicación de nuevas técnicas de secuenciación denominadas Next Generation Sequencing (NGS). Esta situación ha propiciado la aparición de muchos alineadores que nos permiten, dada una serie de secuencias provenientes de la secuenciación NGS, conseguir hallar la posición en el genoma del que proceden usando un genoma de referencia. Sin embargo es difícil escoger cuál es el alineador que mejor se puede adaptar a cada problema, dada la dificultad de encontrar comparaciones justas entre alineadores en términos de efectividad en la alineación y coste computacional. Además, dentro de un mismo alineador el correcto ajuste de los metaparámetros que definen cada uno de los alineadores es otro problema al que también se enfrentan normalmente los bioinformáticos. Estas dificultades se ven acentuadas por el hecho de que estos alineadores son comúnmente usados como cajas negras dada su complejidad y la dificultad de encontrar descripciones y análisis detallados de los mismos. El objetivo de este TFG es el análisis teórico e implementación de tres alineadores para su posterior comparación, determinando en qué casos es mejor optar por la utilización de unos u otros. Adicionalmente, se proporciona un análisis de la influencia de sus metaparámetros en el rendimiento de cada alineador. Se han escogido Bowtie, BWA y BWT-SW, para cubrir tanto alineadores de secuencias cortas como largas, así como alineadores únicamente tolerantes a mutaciones y alineadores tolerantes a mutaciones y huecos. Estos alineadores usan una estructura común, FM-Index, que les permite realizar búsquedas en el genoma de referencia de forma óptima a través de la transformada de Burrows-Wheeler y propiedades. Este TFG ofrece una descripción detallada del funcionamiento de cada uno de los algoritmos, que ha sido posible tras el análisis de cada uno de los artículos científicos donde se encuentran definidos los mismos. A continuación, tanto los alineadores como el FM-Index se han implementado en C++, y su correcto funcionamiento se ha verificado mediante una serie de pruebas unitarias de caja blanca y caja negra. Una vez implementados, se usó el programa ART para la simulación de un generador de secuencias NGS. Este programa recibe como parámetros la tecnología a simular y otros valores que permiten controlar la longitud de la secuencia de salida y las probabilidades de mutaciones y huecos. Se ha comparado el comportamiento de los tres alineadores para diferentes valores de estos parámetros en términos de tiempos de ejecución y tasas de acierto, haciendo variar a su vez los metaparámetros de cada alineador. Las contribuciones derivadas de este TFG son un estudio detallado del funcionamiento de algunos de los algoritmos de alineación más utilizados, que pretende completar la literatura existente, una implementación sencilla de los mismos que facilita su mejor comprensión y comparación, y un análisis cuantitativo y comparativo de estos alineadores que permite determinar la naturaleza de los problemas de alineación para los cuales cada uno de ellos es más adecuado.The cost of sequencing a living being’s genome has been greatly reduced due to the advent of Next Generation Secuencing (NGS) techniques. This situation has led to the apparition of many new aligners that can find the position of a given NGS sequence in a reference genome. However, it is difficult to choose the aligner that best adapts to a problem given the shortage of fair comparisons between aligners in terms of alignment effectiveness and computational cost. Besides, another problem commonly faced by bioinformaticians is the correct adjustment of aligner’s metaparameters. These difficulties are even greater considering that these aligners are commonly used as black-boxes due to their complexity and the lack of a detailed description and analysis of them. The objective of this Masters Thesis is to provide a theoretical analysis and implementation of three aligners that will serve to compare them, showing which is the best algorithm for each type of NGS sequencing problem. Additionally, this work includes an analysis to determine the influence of aligners’ metaparameters in their performance. In order to cover both long and short sequence aligners as well as aligners considering either only mutations (mismatches) or both mutations and gaps, three different aligners, namely Bowtie, BWA and BWT-SW, have been selected for the analysis. These aligners have a common structure, the FM-Index, that allows optimal searching in the reference genome with low memory consumption by using the Burrows-Wheeler transform and its properties. This Masters Thesis offers a description of the hidden details of every algorithm, which has been possible through the thorough study of scientific papers where these algorithms are proposed. Both the FM-Index and the aligners were implemented in C++, and their proper functioning was verified by black and white box unit testing. Once the aligners were implemented, ART software was used to simulate NGS sequences. This software receives as parameters the NGS technology to simulate and other values to control sequences’ length, number of mismatches, and gap probability. The behaviour of these three aligners for different values of these parameters has been compared in terms of execution time and hit rate, varying every aligner’s metaparameters as well. The outcomes of this Masters Thesis are a detailed study of some of the most used alignment algorithms based on the FM-Index, which intends to complement existing literature; a simple implementation of these aligners, which favors their comprehension and comparison; and a quantitative comparative analysis, which allows us to conclude when each aligner is more suitable than others for an specific sequencing problem

    Assessment of Alignment Algorithms, Variant Discovery and Genotype Calling Strategies in Exome Sequencing Data

    Get PDF
    Advances in next generation sequencing (NGS) technologies, in the past half decade, have enabled many novel genomic applications and have generated unprecedented amounts of new knowledge that is quickly changing how biomedical research is being conducted, as well as, how we view human diseases and diversity. As the methods, algorithms and software used to process NGS data are constantly being developed and improved, performing analysis and determining the validity of the results become complex. Moreover, as sequencing moves from being a research tool into a clinical diagnostic tool understanding the performance and limitations of bioinformatics pipelines and the results they produce becomes imperative. This thesis aims to assess the performance of nine bioinformatics pipelines for sequence read alignment, variant calling and genotyping in a Mendelian inherited disease, parent-trio exome sequencing design. A well-characterized reference variant call set from the National Institute of Standards and Technology and the Genome in a Bottle Consortium is be used for producing and comparing the analytical performance of each pipeline on the GRCh37 and GRCh38 human references

    Binary Decision Diagrams for sequences: definitions, properties and algorithms

    Get PDF
    Binary Decision Diagrams (BDDs) are a particular kind of graph for representing boolean functions. In particular they are rooted directed acyclic graphs, where every node represent a binary decision, or equivalently a branch on a certain boolean variable. There is a wide range of problems where BDDs are suitable because of what is called symbolic analysis. It’s possible in fact to encode the parameters of a system with boolean variables, and in general to encode a problem with a boolean function. Many flavours of decision diagrams exist in the literature. One of them, the Sequence BDD, has been used recently for representing sets of strings, with interesting results. The study of these kind of BDDs led to the discovery of new algorithms, mainly conceived for their reduction, that have been studied both theoretically and experimentally. In particular for this last purpose a small BDD package was developed. As a practical application, the problem of indexing substrings has been studied more in depth

    The Many Qualities of a New Directly Accessible Compression Scheme

    Full text link
    We present a new variable-length computation-friendly encoding scheme, named SFDC (Succinct Format with Direct aCcesibility), that supports direct and fast accessibility to any element of the compressed sequence and achieves compression ratios often higher than those offered by other solutions in the literature. The SFDC scheme provides a flexible and simple representation geared towards either practical efficiency or compression ratios, as required. For a text of length nn over an alphabet of size σ\sigma and a fixed parameter λ\lambda, the access time of the proposed encoding is proportional to the length of the character's code-word, plus an expected O((Fσλ+33)/Fσ+1)\mathcal{O}((F_{\sigma - \lambda + 3} - 3)/F_{\sigma+1}) overhead, where FjF_j is the jj-th number of the Fibonacci sequence. In the overall it uses N+O(n(λ(Fσ+33)/Fσ+1))=N+O(n)N+\mathcal{O}\big(n \left(\lambda - (F_{\sigma+3}-3)/F_{\sigma+1}\big) \right) = N + \mathcal{O}(n) bits, where NN is the length of the encoded string. Experimental results show that the performance of our scheme is, in some respects, comparable with the performance of DACs and Wavelet Tees, which are among of the most efficient schemes. In addition our scheme is configured as a \emph{computation-friendly compression} scheme, as it counts several features that make it very effective in text processing tasks. In the string matching problem, that we take as a case study, we experimentally prove that the new scheme enables results that are up to 29 times faster than standard string-matching techniques on plain texts.Comment: 33 page

    On Provable Security for Complex Systems

    Get PDF
    We investigate the contribution of cryptographic proofs of security to a systematic security engineering process. To this end we study how to model and prove security for concrete applications in three practical domains: computer networks, data outsourcing, and electronic voting. We conclude that cryptographic proofs of security can benefit a security engineering process in formulating requirements, influencing design, and identifying constraints for the implementation
    corecore