34 research outputs found
Byte-Aligned Pattern Matching in Encoded Genomic Sequences
In this article, we propose a novel pattern matching algorithm, called BAPM, that performs searching in the encoded genomic sequences. The algorithm works at the level of single bytes and it achieves sublinear performance on average. The preprocessing phase of the algorithm is linear with respect to the size of the searched pattern m. A simple O(m)-space data structure is used to store all factors (with a defined length) of the searched pattern. These factors are later searched during the searching phase which ensures sublinear time on average. Our algorithm significantly overcomes the state-of-the-art pattern matching algorithms in the locate time on middle and long patterns. Furthermore, it is able to cooperate very easily with the block q-gram inverted index. The block q-gram inverted index together with our pattern matching algorithm achieve superior results in terms of locate time to the current index data structures for less frequent patterns. We present experimental results using real genomic data. These results prove efficiency of our algorithm
Finite Automata Implementations Considering CPU Cache
The finite automata are mathematical models for finite state systems. More general finite automaton is the nondeterministic finite automaton (NFA) that cannot be directly used. It is usually transformed to the deterministic finite automaton (DFA) that then runs in time O(n), where n is the size of the input text. We present two main approaches to practical implementation of DFA considering CPU cache. The first approach (represented by Table Driven and Hard Coded implementations) is suitable forautomata being run very frequently, typically having cycles. The other approach is suitable for a collection of automata from which various automata are retrieved and then run. This second kind of automata are expected to be cycle-free.
MergedTrie: Efficient textual indexing
The accessing and processing of textual information (i.e. the storing and querying of a set of strings) is especially important for many current applications (e.g. information retrieval and social networks), especially when working in the fields of Big Data or IoT, which require the handling of very large string dictionaries. Typical data structures for textual indexing are Hash Tables and some variants of Tries such as the Double Trie (DT). In this paper, we propose an extension of the DT that we have called MergedTrie. It improves the DT compression by merging both Tries into a single and by segmenting the indexed term into two fixed length parts in order to balance the new Trie. Thus, a higher overlapping of both prefixes and suffixes is obtained. Moreover, we propose a new implementation of Tries that achieves better compression rates than the Double-Array representation usually chosen for implementing Tries. Our proposal also overcomes the limitation of static implementations that does not allow insertions and updates in their compact representations. Finally, our MergedTrie implementation experimentally improves the efficiency of the Hash Tables, the DTs, the Double-Array, the Crit-bit, the Directed Acyclic Word Graphs (DAWG), and the Acyclic Deterministic Finite Automata (ADFA) data structures, requiring less space than the original text to be indexed.This study has been partially funded by the SEQUOIA-UA (TIN2015-63502-C3-3-R) and the RESCATA (TIN2015-65100-R) projects of the Spanish Ministry of Economy and Competitiveness (MINECO)
Proceedings of the Scientific Data Compression Workshop
Continuing advances in space and Earth science requires increasing amounts of data to be gathered from spaceborne sensors. NASA expects to launch sensors during the next two decades which will be capable of producing an aggregate of 1500 Megabits per second if operated simultaneously. Such high data rates cause stresses in all aspects of end-to-end data systems. Technologies and techniques are needed to relieve such stresses. Potential solutions to the massive data rate problems are: data editing, greater transmission bandwidths, higher density and faster media, and data compression. Through four subpanels on Science Payload Operations, Multispectral Imaging, Microwave Remote Sensing and Science Data Management, recommendations were made for research in data compression and scientific data applications to space platforms
Implementación y análisis de algoritmos de alineación para datos de Next Generation Sequencing (NGS)
El coste del proceso de secuenciación de los genomas de los seres vivos, se ha reducido en
gran cantidad en los últimos diez años debido a la aplicación de nuevas técnicas de secuenciación
denominadas Next Generation Sequencing (NGS). Esta situación ha propiciado la aparición de
muchos alineadores que nos permiten, dada una serie de secuencias provenientes de la secuenciación
NGS, conseguir hallar la posición en el genoma del que proceden usando un genoma de
referencia. Sin embargo es difícil escoger cuál es el alineador que mejor se puede adaptar a cada
problema, dada la dificultad de encontrar comparaciones justas entre alineadores en términos
de efectividad en la alineación y coste computacional. Además, dentro de un mismo alineador el
correcto ajuste de los metaparámetros que definen cada uno de los alineadores es otro problema
al que también se enfrentan normalmente los bioinformáticos. Estas dificultades se ven acentuadas
por el hecho de que estos alineadores son comúnmente usados como cajas negras dada su
complejidad y la dificultad de encontrar descripciones y análisis detallados de los mismos.
El objetivo de este TFG es el análisis teórico e implementación de tres alineadores para su
posterior comparación, determinando en qué casos es mejor optar por la utilización de unos
u otros. Adicionalmente, se proporciona un análisis de la influencia de sus metaparámetros
en el rendimiento de cada alineador. Se han escogido Bowtie, BWA y BWT-SW, para cubrir
tanto alineadores de secuencias cortas como largas, así como alineadores únicamente tolerantes a
mutaciones y alineadores tolerantes a mutaciones y huecos. Estos alineadores usan una estructura
común, FM-Index, que les permite realizar búsquedas en el genoma de referencia de forma
óptima a través de la transformada de Burrows-Wheeler y propiedades. Este TFG ofrece una
descripción detallada del funcionamiento de cada uno de los algoritmos, que ha sido posible tras
el análisis de cada uno de los artículos científicos donde se encuentran definidos los mismos.
A continuación, tanto los alineadores como el FM-Index se han implementado en C++, y su
correcto funcionamiento se ha verificado mediante una serie de pruebas unitarias de caja blanca
y caja negra.
Una vez implementados, se usó el programa ART para la simulación de un generador de
secuencias NGS. Este programa recibe como parámetros la tecnología a simular y otros valores
que permiten controlar la longitud de la secuencia de salida y las probabilidades de mutaciones
y huecos. Se ha comparado el comportamiento de los tres alineadores para diferentes valores de
estos parámetros en términos de tiempos de ejecución y tasas de acierto, haciendo variar a su
vez los metaparámetros de cada alineador.
Las contribuciones derivadas de este TFG son un estudio detallado del funcionamiento de
algunos de los algoritmos de alineación más utilizados, que pretende completar la literatura
existente, una implementación sencilla de los mismos que facilita su mejor comprensión y comparación,
y un análisis cuantitativo y comparativo de estos alineadores que permite determinar
la naturaleza de los problemas de alineación para los cuales cada uno de ellos es más adecuado.The cost of sequencing a living being’s genome has been greatly reduced due to the advent
of Next Generation Secuencing (NGS) techniques. This situation has led to the apparition of
many new aligners that can find the position of a given NGS sequence in a reference genome.
However, it is difficult to choose the aligner that best adapts to a problem given the shortage of
fair comparisons between aligners in terms of alignment effectiveness and computational cost.
Besides, another problem commonly faced by bioinformaticians is the correct adjustment of
aligner’s metaparameters. These difficulties are even greater considering that these aligners are
commonly used as black-boxes due to their complexity and the lack of a detailed description
and analysis of them.
The objective of this Masters Thesis is to provide a theoretical analysis and implementation
of three aligners that will serve to compare them, showing which is the best algorithm for each
type of NGS sequencing problem. Additionally, this work includes an analysis to determine
the influence of aligners’ metaparameters in their performance. In order to cover both long
and short sequence aligners as well as aligners considering either only mutations (mismatches)
or both mutations and gaps, three different aligners, namely Bowtie, BWA and BWT-SW,
have been selected for the analysis. These aligners have a common structure, the FM-Index,
that allows optimal searching in the reference genome with low memory consumption by using
the Burrows-Wheeler transform and its properties. This Masters Thesis offers a description of
the hidden details of every algorithm, which has been possible through the thorough study of
scientific papers where these algorithms are proposed. Both the FM-Index and the aligners were
implemented in C++, and their proper functioning was verified by black and white box unit
testing.
Once the aligners were implemented, ART software was used to simulate NGS sequences.
This software receives as parameters the NGS technology to simulate and other values to control
sequences’ length, number of mismatches, and gap probability. The behaviour of these three
aligners for different values of these parameters has been compared in terms of execution time
and hit rate, varying every aligner’s metaparameters as well.
The outcomes of this Masters Thesis are a detailed study of some of the most used alignment
algorithms based on the FM-Index, which intends to complement existing literature; a simple
implementation of these aligners, which favors their comprehension and comparison; and a quantitative
comparative analysis, which allows us to conclude when each aligner is more suitable
than others for an specific sequencing problem
Assessment of Alignment Algorithms, Variant Discovery and Genotype Calling Strategies in Exome Sequencing Data
Advances in next generation sequencing (NGS) technologies, in the past half decade, have enabled many novel genomic applications and have generated unprecedented amounts of new knowledge that is quickly changing how biomedical research is being conducted, as well as, how we view human diseases and diversity. As the methods, algorithms and software used to process NGS data are constantly being developed and improved, performing analysis and determining the validity of the results become complex. Moreover, as sequencing moves from being a research tool into a clinical diagnostic tool understanding the performance and limitations of bioinformatics pipelines and the results they produce becomes imperative. This thesis aims to assess the performance of nine bioinformatics pipelines for sequence read alignment, variant calling and genotyping in a Mendelian inherited disease, parent-trio exome sequencing design. A well-characterized reference variant call set from the National Institute of Standards and Technology and the Genome in a Bottle Consortium is be used for producing and comparing the analytical performance of each pipeline on the GRCh37 and GRCh38 human references
Binary Decision Diagrams for sequences: definitions, properties and algorithms
Binary Decision Diagrams (BDDs) are a particular kind of graph for representing boolean functions. In particular they are rooted directed acyclic graphs, where every node represent a binary decision, or equivalently a branch on a certain boolean variable.
There is a wide range of problems where BDDs are suitable because of what is called symbolic analysis. It’s possible in fact to encode the parameters of a system with boolean variables, and in general to encode a problem with a boolean function.
Many flavours of decision diagrams exist in the literature. One of them, the Sequence BDD, has been used recently for representing sets of strings, with interesting results.
The study of these kind of BDDs led to the discovery of new algorithms, mainly conceived for their reduction, that have been studied both theoretically and experimentally. In particular for this last purpose a small BDD package was developed. As a practical application, the problem of indexing substrings has been studied more in depth
The Many Qualities of a New Directly Accessible Compression Scheme
We present a new variable-length computation-friendly encoding scheme, named
SFDC (Succinct Format with Direct aCcesibility), that supports direct and fast
accessibility to any element of the compressed sequence and achieves
compression ratios often higher than those offered by other solutions in the
literature. The SFDC scheme provides a flexible and simple representation
geared towards either practical efficiency or compression ratios, as required.
For a text of length over an alphabet of size and a fixed
parameter , the access time of the proposed encoding is proportional
to the length of the character's code-word, plus an expected
overhead, where
is the -th number of the Fibonacci sequence. In the overall it uses
bits, where is the length of the encoded string.
Experimental results show that the performance of our scheme is, in some
respects, comparable with the performance of DACs and Wavelet Tees, which are
among of the most efficient schemes. In addition our scheme is configured as a
\emph{computation-friendly compression} scheme, as it counts several features
that make it very effective in text processing tasks. In the string matching
problem, that we take as a case study, we experimentally prove that the new
scheme enables results that are up to 29 times faster than standard
string-matching techniques on plain texts.Comment: 33 page
On Provable Security for Complex Systems
We investigate the contribution of cryptographic proofs of security to a systematic security engineering process. To this end we study how to model and prove security for concrete applications in three practical domains: computer networks, data outsourcing, and electronic voting. We conclude that cryptographic proofs of security can benefit a security engineering process in formulating requirements, influencing design, and identifying constraints for the implementation