1,240 research outputs found
Efficient approximate string matching techniques for sequence alignment
One of the outstanding milestones achieved in recent years in the field of biotechnology research has been the development of high-throughput sequencing (HTS). Due to the fact that at the moment it is technically impossible to decode the genome as a whole, HTS technologies read billions of relatively short chunks of a genome at random locations. Such reads then need to be located within a reference for the species being studied (that is aligned or mapped to the genome): for each read one identifies in the reference regions that share a large sequence similarity with it, therefore indicating what the read¿s point or points of origin may be. HTS technologies are able to re-sequence a human individual (i.e. to establish the differences between his/her individual genome and the reference genome for the human species) in a very short period of time. They have also paved the way for the development of a number of new protocols and methods, leading to novel insights in genomics and biology in general. However, HTS technologies also pose a challenge to traditional data analysis methods; this is due to the sheer amount of data to be processed and the need for improved alignment algorithms that can generate accurate results quickly.
This thesis tackles the problem of sequence alignment as a step within the analysis of HTS data. Its contributions focus on both the methodological aspects and the algorithmic challenges towards efficient, scalable, and accurate HTS mapping. From a methodological standpoint, this thesis strives to establish a comprehensive framework able to assess the quality of HTS mapping results. In order to be able to do so one has to understand the source and nature of mapping conflicts, and explore the accuracy limits inherent in how sequence alignment is performed for current HTS technologies. From an algorithmic standpoint, this work introduces state-of-the-art index structures and approximate string matching algorithms. They contribute novel insights that can be used in practical applications towards efficient and accurate read mapping. More in detail, first we present methods able to reduce the storage space taken by indexes for genome-scale references, while still providing fast query access in order to support effective search algorithms. Second, we describe novel filtering techniques that vastly reduce the computational requirements of sequence mapping, but are nonetheless capable of giving strict algorithmic guarantees on the completeness of the results. Finally, this thesis presents new incremental algorithmic techniques able to combine several approximate string matching algorithms; this leads to efficient and flexible search algorithms allowing the user to reach arbitrary search depths.
All algorithms and methodological contributions of this thesis have been implemented as components of a production aligner, the GEM-mapper, which is publicly available, widely used worldwide and cited by a sizeable body of literature. It offers flexible and accurate sequence mapping while outperforming other HTS mappers both as to running time and to the quality of the results it produces.Uno de los avances más importantes de los últimos años en el campo de la biotecnología ha sido el desarrollo de las llamadas técnicas de secuenciación de alto rendimiento (high-throughput sequencing, HTS). Debido a las limitaciones técnicas para secuenciar un genoma, las técnicas de alto rendimiento secuencian individualmente billones de pequeñas partes del genoma provenientes de regiones aleatorias. Posteriormente, estas pequeñas secuencias han de ser localizadas en el genoma de referencia del organismo en cuestión. Este proceso se denomina alineamiento - o mapeado - y consiste en identificar aquellas regiones del genoma de referencia que comparten una alta similaridad con las lecturas producidas por el secuenciador. De esta manera, en cuestión de horas, la secuenciación de alto rendimiento puede secuenciar un individuo y establecer las diferencias de este con el resto de la especie. En última instancia, estas tecnologías han potenciado nuevos protocolos y metodologías de investigación con un profundo impacto en el campo de la genómica, la medicina y la biología en general. La secuenciación alto rendimiento, sin embargo, supone un reto para los procesos tradicionales de análisis de datos. Debido a la elevada cantidad de datos a analizar, se necesitan nuevas y mejoradas técnicas algorítmicas que puedan escalar con el volumen de datos y producir resultados precisos. Esta tesis aborda dicho problema. Las contribuciones que en ella se realizan se enfocan desde una perspectiva metodológica y otra algorítmica que propone el desarrollo de nuevos algoritmos y técnicas que permitan alinear secuencias de manera eficiente, precisa y escalable. Desde el punto de vista metodológico, esta tesis analiza y propone un marco de referencia para evaluar la calidad de los resultados del alineamiento de secuencias. Para ello, se analiza el origen de los conflictos durante la alineación de secuencias y se exploran los límites alcanzables en calidad con las tecnologías de secuenciación de alto rendimiento. Desde el punto de vista algorítmico, en el contexto de la búsqueda aproximada de patrones, esta tesis propone nuevas técnicas algorítmicas y de diseño de índices con el objetivo de mejorar la calidad y el desempeño de las herramientas dedicadas a alinear secuencias. En concreto, esta tesis presenta técnicas de diseño de índices genómicos enfocados a obtener un acceso más eficiente y escalable. También se presentan nuevas técnicas algorítmicas de filtrado con el fin de reducir el tiempo de ejecución necesario para alinear secuencias. Y, por último, se proponen algoritmos incrementales y técnicas híbridas para combinar métodos de alineamiento y mejorar el rendimiento en búsquedas donde el error esperado es alto. Todo ello sin degradar la calidad de los resultados y con garantías formales de precisión. Para concluir, es preciso apuntar que todos los algoritmos y metodologías propuestos en esta tesis están implementados y forman parte del alineador GEM. Este versátil alineador ofrece resultados de alta calidad en entornos de producción siendo varias veces más rápido que otros alineadores. En la actualidad este software se ofrece gratuitamente, tiene una amplia comunidad de usuarios y ha sido citado en numerosas publicaciones científicas.Postprint (published version
Efficient approximate string matching techniques for sequence alignment
One of the outstanding milestones achieved in recent years in the field of biotechnology research has been the development of high-throughput sequencing (HTS). Due to the fact that at the moment it is technically impossible to decode the genome as a whole, HTS technologies read billions of relatively short chunks of a genome at random locations. Such reads then need to be located within a reference for the species being studied (that is aligned or mapped to the genome): for each read one identifies in the reference regions that share a large sequence similarity with it, therefore indicating what the read¿s point or points of origin may be. HTS technologies are able to re-sequence a human individual (i.e. to establish the differences between his/her individual genome and the reference genome for the human species) in a very short period of time. They have also paved the way for the development of a number of new protocols and methods, leading to novel insights in genomics and biology in general. However, HTS technologies also pose a challenge to traditional data analysis methods; this is due to the sheer amount of data to be processed and the need for improved alignment algorithms that can generate accurate results quickly.
This thesis tackles the problem of sequence alignment as a step within the analysis of HTS data. Its contributions focus on both the methodological aspects and the algorithmic challenges towards efficient, scalable, and accurate HTS mapping. From a methodological standpoint, this thesis strives to establish a comprehensive framework able to assess the quality of HTS mapping results. In order to be able to do so one has to understand the source and nature of mapping conflicts, and explore the accuracy limits inherent in how sequence alignment is performed for current HTS technologies. From an algorithmic standpoint, this work introduces state-of-the-art index structures and approximate string matching algorithms. They contribute novel insights that can be used in practical applications towards efficient and accurate read mapping. More in detail, first we present methods able to reduce the storage space taken by indexes for genome-scale references, while still providing fast query access in order to support effective search algorithms. Second, we describe novel filtering techniques that vastly reduce the computational requirements of sequence mapping, but are nonetheless capable of giving strict algorithmic guarantees on the completeness of the results. Finally, this thesis presents new incremental algorithmic techniques able to combine several approximate string matching algorithms; this leads to efficient and flexible search algorithms allowing the user to reach arbitrary search depths.
All algorithms and methodological contributions of this thesis have been implemented as components of a production aligner, the GEM-mapper, which is publicly available, widely used worldwide and cited by a sizeable body of literature. It offers flexible and accurate sequence mapping while outperforming other HTS mappers both as to running time and to the quality of the results it produces.Uno de los avances más importantes de los últimos años en el campo de la biotecnología ha sido el desarrollo de las llamadas técnicas de secuenciación de alto rendimiento (high-throughput sequencing, HTS). Debido a las limitaciones técnicas para secuenciar un genoma, las técnicas de alto rendimiento secuencian individualmente billones de pequeñas partes del genoma provenientes de regiones aleatorias. Posteriormente, estas pequeñas secuencias han de ser localizadas en el genoma de referencia del organismo en cuestión. Este proceso se denomina alineamiento - o mapeado - y consiste en identificar aquellas regiones del genoma de referencia que comparten una alta similaridad con las lecturas producidas por el secuenciador. De esta manera, en cuestión de horas, la secuenciación de alto rendimiento puede secuenciar un individuo y establecer las diferencias de este con el resto de la especie. En última instancia, estas tecnologías han potenciado nuevos protocolos y metodologías de investigación con un profundo impacto en el campo de la genómica, la medicina y la biología en general. La secuenciación alto rendimiento, sin embargo, supone un reto para los procesos tradicionales de análisis de datos. Debido a la elevada cantidad de datos a analizar, se necesitan nuevas y mejoradas técnicas algorítmicas que puedan escalar con el volumen de datos y producir resultados precisos. Esta tesis aborda dicho problema. Las contribuciones que en ella se realizan se enfocan desde una perspectiva metodológica y otra algorítmica que propone el desarrollo de nuevos algoritmos y técnicas que permitan alinear secuencias de manera eficiente, precisa y escalable. Desde el punto de vista metodológico, esta tesis analiza y propone un marco de referencia para evaluar la calidad de los resultados del alineamiento de secuencias. Para ello, se analiza el origen de los conflictos durante la alineación de secuencias y se exploran los límites alcanzables en calidad con las tecnologías de secuenciación de alto rendimiento. Desde el punto de vista algorítmico, en el contexto de la búsqueda aproximada de patrones, esta tesis propone nuevas técnicas algorítmicas y de diseño de índices con el objetivo de mejorar la calidad y el desempeño de las herramientas dedicadas a alinear secuencias. En concreto, esta tesis presenta técnicas de diseño de índices genómicos enfocados a obtener un acceso más eficiente y escalable. También se presentan nuevas técnicas algorítmicas de filtrado con el fin de reducir el tiempo de ejecución necesario para alinear secuencias. Y, por último, se proponen algoritmos incrementales y técnicas híbridas para combinar métodos de alineamiento y mejorar el rendimiento en búsquedas donde el error esperado es alto. Todo ello sin degradar la calidad de los resultados y con garantías formales de precisión. Para concluir, es preciso apuntar que todos los algoritmos y metodologías propuestos en esta tesis están implementados y forman parte del alineador GEM. Este versátil alineador ofrece resultados de alta calidad en entornos de producción siendo varias veces más rápido que otros alineadores. En la actualidad este software se ofrece gratuitamente, tiene una amplia comunidad de usuarios y ha sido citado en numerosas publicaciones científicas
On Polynomially Many Queries to NP or QMA Oracles
We study the complexity of problems solvable in deterministic polynomial time
with access to an NP or Quantum Merlin-Arthur (QMA)-oracle, such as
and , respectively. The former allows one to classify problems more
finely than the Polynomial-Time Hierarchy (PH), whereas the latter
characterizes physically motivated problems such as Approximate Simulation
(APX-SIM) [Ambainis, CCC 2014]. In this area, a central role has been played by
the classes and , defined identically to
and , except that only logarithmically many oracle queries are
allowed. Here, [Gottlob, FOCS 1993] showed that if the adaptive queries made by
a machine have a "query graph" which is a tree, then this computation
can be simulated in .
In this work, we first show that for any verification class
, any machine with a query
graph of "separator number" can be simulated using deterministic time
and queries to a -oracle. When (which
includes the case of -treewidth, and thus also of trees), this gives an
upper bound of , and when , this yields bound
(QP meaning quasi-polynomial time). We next show how to
combine Gottlob's "admissible-weighting function" framework with the
"flag-qubit" framework of [Watson, Bausch, Gharibian, 2020], obtaining a
unified approach for embedding computations directly into APX-SIM
instances in a black-box fashion. Finally, we formalize a simple no-go
statement about polynomials (c.f. [Krentel, STOC 1986]): Given a multi-linear
polynomial specified via an arithmetic circuit, if one can "weakly
compress" so that its optimal value requires bits to represent, then
can be decided with only queries to an NP-oracle.Comment: 46 pages pages, 5 figures, to appear in ITCS 202
Fine-Grained Complexity: Exploring Reductions and their Properties
Η σχεδίαση αλγορίθμων αποτελεί ένα απο τα κύρια θέματα ενδιαφέροντος για τον τομέα της Πληροφορικής. Παρά τα πολλά αποτελέσματα σε ορισμένους τομείς, η προσέγγιση αυτή έχει πετύχει κάποια πρακτικά αδιέξοδα που έχουν αποδειχτεί προβληματικά στην πρόοδο του τομέα. Επίσης, οι κλασικές πρακτικές Υπολογιστικής Πολυπλοκότητας δεν ήταν σε θέση να παρακάμψουν αυτά τα εμπόδια. Η κατανόηση της δυσκολίας του κάθε προβλήματος δεν είναι τετριμμένη. Η Ραφιναρισμένη Πολυπλοκότητα παρέχει νέες προ-οπτικές για τα κλασικά προβλήματα, με αποτέλεσμα σταθερούς δεσμούς μεταξύ γνωστών εικασιών στην πολυπλοκότητα και την σχεδίαση αλγορίθμων. Χρησιμεύει επίσης ως εργα-λείο για να αποδείξει τα υπο όρους κατώτατα όρια για προβλήματα πολυωνυμικής χρονικής πολυπλοκότητας, ένα πεδίο που έχει σημειώσει πολύ λίγη πρόοδο μέχρι τώρα. Οι δημοφι-λείς υποθέσεις/παραδοχές όπως το SETH, το OVH, το 3SUM, και το APSP, δίνουν πολλά φράγματα που δεν έχουν ακόμα αποδειχθεί με κλασικές τεχνικές και παρέχουν μια νέα κατανόηση της δομής και της εντροπίας των προβλημάτων γενικά. Σκοπός αυτής της εργασίας είναι να συμβάλει στην εδραίωση του πλαισίου για αναγωγές από κάθε εικασία και να διερευνήσει την διαρθρωτική διαφορά μεταξύ των προβλημάτων σε κάθε περίπτωση.Algorithmic design has been one of the main subjects of interest for Computer science. While very effective in some areas, this approach has been met with some practical dead ends that have been very problematic in the progress of the field. Classical Computational Complexity practices have also not been able to bypass these blocks. Understanding the hardness of each problem is not trivial. Fine-Grained Complexity provides new perspectives on classic problems, resulting to solid links between famous conjectures in Complexity, and Algorithmic design. It serves as a tool to prove conditional lower bounds for problems with polynomial time complexity, a field that had seen very little progress until now. Popular conjectures such as SETH, k-OV, 3SUM, and APSP, imply many bounds that have yet to be proven using classic techniques, and provide a new understanding of the structure and entropy of problems in general. The aim of this thesis is to contribute towards solidifying the framework for reductions from each conjecture, and to explore the structural difference between the problems in each cas
Expressive and Efficient Memory Representation for Bounded Model Checking of C programs
Ensuring memory safety in programs has been an important yet difficult topic of research. Most static analysis approaches rely on the theory of arrays to model memory
access. The limitation of the theory of arrays in terms of scalability and compatibility
with SAT/SMT solvers is well-known, and there has been many attempts at optimizing
either the theory itself or memory encodings based on theory of arrays.
In this thesis, we demonstrate that existing arrays-based memory encodings miss potential optimization opportunities by omitting language specific properties such as alignment and pointer arithmetic in C. We present SeaM, a new memory representation for C programs built around a more expressive First-order Theory: the Theory of Memory.
We show that by preserving more C language specific rules and properties, the Theory
of Memory allows for more thorough optimization methods during eager rewriting of sequences of stores. We introduce two such optimization methods in this thesis. First, we over-approximate pointer comparison with an abstract interpretation-like approach called AddressRangeMap. Second, we compress sequences of stores with Store-Map for faster
address offset look-ups.
The new memory representation is implemented in SeaBmc, a new BMC tool for
LLVM. We evaluate our approach on real-world bounded model checking tasks from the
aws-c-common library and Sv-Comp benchmarks and compare it against two existing
memory representations in SeaBmc. Our results show that SeaM outperforms the theory
of array based representation and is comparable with the λ based representation
Sublinear Computation Paradigm
This open access book gives an overview of cutting-edge work on a new paradigm called the “sublinear computation paradigm,” which was proposed in the large multiyear academic research project “Foundations of Innovative Algorithms for Big Data.” That project ran from October 2014 to March 2020, in Japan. To handle the unprecedented explosion of big data sets in research, industry, and other areas of society, there is an urgent need to develop novel methods and approaches for big data analysis. To meet this need, innovative changes in algorithm theory for big data are being pursued. For example, polynomial-time algorithms have thus far been regarded as “fast,” but if a quadratic-time algorithm is applied to a petabyte-scale or larger big data set, problems are encountered in terms of computational resources or running time. To deal with this critical computational and algorithmic bottleneck, linear, sublinear, and constant time algorithms are required. The sublinear computation paradigm is proposed here in order to support innovation in the big data era. A foundation of innovative algorithms has been created by developing computational procedures, data structures, and modelling techniques for big data. The project is organized into three teams that focus on sublinear algorithms, sublinear data structures, and sublinear modelling. The work has provided high-level academic research results of strong computational and algorithmic interest, which are presented in this book. The book consists of five parts: Part I, which consists of a single chapter on the concept of the sublinear computation paradigm; Parts II, III, and IV review results on sublinear algorithms, sublinear data structures, and sublinear modelling, respectively; Part V presents application results. The information presented here will inspire the researchers who work in the field of modern algorithms
Proceedings of the 22nd Conference on Formal Methods in Computer-Aided Design – FMCAD 2022
The Conference on Formal Methods in Computer-Aided Design (FMCAD) is an annual conference on the theory and applications of formal methods in hardware and system verification. FMCAD provides a leading forum to researchers in academia and industry for presenting and discussing groundbreaking methods, technologies, theoretical results, and tools for reasoning formally about computing systems. FMCAD covers formal aspects of computer-aided system design including verification, specification, synthesis, and testing
- …