1,240 research outputs found

    Efficient approximate string matching techniques for sequence alignment

    Get PDF
    One of the outstanding milestones achieved in recent years in the field of biotechnology research has been the development of high-throughput sequencing (HTS). Due to the fact that at the moment it is technically impossible to decode the genome as a whole, HTS technologies read billions of relatively short chunks of a genome at random locations. Such reads then need to be located within a reference for the species being studied (that is aligned or mapped to the genome): for each read one identifies in the reference regions that share a large sequence similarity with it, therefore indicating what the read¿s point or points of origin may be. HTS technologies are able to re-sequence a human individual (i.e. to establish the differences between his/her individual genome and the reference genome for the human species) in a very short period of time. They have also paved the way for the development of a number of new protocols and methods, leading to novel insights in genomics and biology in general. However, HTS technologies also pose a challenge to traditional data analysis methods; this is due to the sheer amount of data to be processed and the need for improved alignment algorithms that can generate accurate results quickly. This thesis tackles the problem of sequence alignment as a step within the analysis of HTS data. Its contributions focus on both the methodological aspects and the algorithmic challenges towards efficient, scalable, and accurate HTS mapping. From a methodological standpoint, this thesis strives to establish a comprehensive framework able to assess the quality of HTS mapping results. In order to be able to do so one has to understand the source and nature of mapping conflicts, and explore the accuracy limits inherent in how sequence alignment is performed for current HTS technologies. From an algorithmic standpoint, this work introduces state-of-the-art index structures and approximate string matching algorithms. They contribute novel insights that can be used in practical applications towards efficient and accurate read mapping. More in detail, first we present methods able to reduce the storage space taken by indexes for genome-scale references, while still providing fast query access in order to support effective search algorithms. Second, we describe novel filtering techniques that vastly reduce the computational requirements of sequence mapping, but are nonetheless capable of giving strict algorithmic guarantees on the completeness of the results. Finally, this thesis presents new incremental algorithmic techniques able to combine several approximate string matching algorithms; this leads to efficient and flexible search algorithms allowing the user to reach arbitrary search depths. All algorithms and methodological contributions of this thesis have been implemented as components of a production aligner, the GEM-mapper, which is publicly available, widely used worldwide and cited by a sizeable body of literature. It offers flexible and accurate sequence mapping while outperforming other HTS mappers both as to running time and to the quality of the results it produces.Uno de los avances más importantes de los últimos años en el campo de la biotecnología ha sido el desarrollo de las llamadas técnicas de secuenciación de alto rendimiento (high-throughput sequencing, HTS). Debido a las limitaciones técnicas para secuenciar un genoma, las técnicas de alto rendimiento secuencian individualmente billones de pequeñas partes del genoma provenientes de regiones aleatorias. Posteriormente, estas pequeñas secuencias han de ser localizadas en el genoma de referencia del organismo en cuestión. Este proceso se denomina alineamiento - o mapeado - y consiste en identificar aquellas regiones del genoma de referencia que comparten una alta similaridad con las lecturas producidas por el secuenciador. De esta manera, en cuestión de horas, la secuenciación de alto rendimiento puede secuenciar un individuo y establecer las diferencias de este con el resto de la especie. En última instancia, estas tecnologías han potenciado nuevos protocolos y metodologías de investigación con un profundo impacto en el campo de la genómica, la medicina y la biología en general. La secuenciación alto rendimiento, sin embargo, supone un reto para los procesos tradicionales de análisis de datos. Debido a la elevada cantidad de datos a analizar, se necesitan nuevas y mejoradas técnicas algorítmicas que puedan escalar con el volumen de datos y producir resultados precisos. Esta tesis aborda dicho problema. Las contribuciones que en ella se realizan se enfocan desde una perspectiva metodológica y otra algorítmica que propone el desarrollo de nuevos algoritmos y técnicas que permitan alinear secuencias de manera eficiente, precisa y escalable. Desde el punto de vista metodológico, esta tesis analiza y propone un marco de referencia para evaluar la calidad de los resultados del alineamiento de secuencias. Para ello, se analiza el origen de los conflictos durante la alineación de secuencias y se exploran los límites alcanzables en calidad con las tecnologías de secuenciación de alto rendimiento. Desde el punto de vista algorítmico, en el contexto de la búsqueda aproximada de patrones, esta tesis propone nuevas técnicas algorítmicas y de diseño de índices con el objetivo de mejorar la calidad y el desempeño de las herramientas dedicadas a alinear secuencias. En concreto, esta tesis presenta técnicas de diseño de índices genómicos enfocados a obtener un acceso más eficiente y escalable. También se presentan nuevas técnicas algorítmicas de filtrado con el fin de reducir el tiempo de ejecución necesario para alinear secuencias. Y, por último, se proponen algoritmos incrementales y técnicas híbridas para combinar métodos de alineamiento y mejorar el rendimiento en búsquedas donde el error esperado es alto. Todo ello sin degradar la calidad de los resultados y con garantías formales de precisión. Para concluir, es preciso apuntar que todos los algoritmos y metodologías propuestos en esta tesis están implementados y forman parte del alineador GEM. Este versátil alineador ofrece resultados de alta calidad en entornos de producción siendo varias veces más rápido que otros alineadores. En la actualidad este software se ofrece gratuitamente, tiene una amplia comunidad de usuarios y ha sido citado en numerosas publicaciones científicas.Postprint (published version

    Efficient approximate string matching techniques for sequence alignment

    Get PDF
    One of the outstanding milestones achieved in recent years in the field of biotechnology research has been the development of high-throughput sequencing (HTS). Due to the fact that at the moment it is technically impossible to decode the genome as a whole, HTS technologies read billions of relatively short chunks of a genome at random locations. Such reads then need to be located within a reference for the species being studied (that is aligned or mapped to the genome): for each read one identifies in the reference regions that share a large sequence similarity with it, therefore indicating what the read¿s point or points of origin may be. HTS technologies are able to re-sequence a human individual (i.e. to establish the differences between his/her individual genome and the reference genome for the human species) in a very short period of time. They have also paved the way for the development of a number of new protocols and methods, leading to novel insights in genomics and biology in general. However, HTS technologies also pose a challenge to traditional data analysis methods; this is due to the sheer amount of data to be processed and the need for improved alignment algorithms that can generate accurate results quickly. This thesis tackles the problem of sequence alignment as a step within the analysis of HTS data. Its contributions focus on both the methodological aspects and the algorithmic challenges towards efficient, scalable, and accurate HTS mapping. From a methodological standpoint, this thesis strives to establish a comprehensive framework able to assess the quality of HTS mapping results. In order to be able to do so one has to understand the source and nature of mapping conflicts, and explore the accuracy limits inherent in how sequence alignment is performed for current HTS technologies. From an algorithmic standpoint, this work introduces state-of-the-art index structures and approximate string matching algorithms. They contribute novel insights that can be used in practical applications towards efficient and accurate read mapping. More in detail, first we present methods able to reduce the storage space taken by indexes for genome-scale references, while still providing fast query access in order to support effective search algorithms. Second, we describe novel filtering techniques that vastly reduce the computational requirements of sequence mapping, but are nonetheless capable of giving strict algorithmic guarantees on the completeness of the results. Finally, this thesis presents new incremental algorithmic techniques able to combine several approximate string matching algorithms; this leads to efficient and flexible search algorithms allowing the user to reach arbitrary search depths. All algorithms and methodological contributions of this thesis have been implemented as components of a production aligner, the GEM-mapper, which is publicly available, widely used worldwide and cited by a sizeable body of literature. It offers flexible and accurate sequence mapping while outperforming other HTS mappers both as to running time and to the quality of the results it produces.Uno de los avances más importantes de los últimos años en el campo de la biotecnología ha sido el desarrollo de las llamadas técnicas de secuenciación de alto rendimiento (high-throughput sequencing, HTS). Debido a las limitaciones técnicas para secuenciar un genoma, las técnicas de alto rendimiento secuencian individualmente billones de pequeñas partes del genoma provenientes de regiones aleatorias. Posteriormente, estas pequeñas secuencias han de ser localizadas en el genoma de referencia del organismo en cuestión. Este proceso se denomina alineamiento - o mapeado - y consiste en identificar aquellas regiones del genoma de referencia que comparten una alta similaridad con las lecturas producidas por el secuenciador. De esta manera, en cuestión de horas, la secuenciación de alto rendimiento puede secuenciar un individuo y establecer las diferencias de este con el resto de la especie. En última instancia, estas tecnologías han potenciado nuevos protocolos y metodologías de investigación con un profundo impacto en el campo de la genómica, la medicina y la biología en general. La secuenciación alto rendimiento, sin embargo, supone un reto para los procesos tradicionales de análisis de datos. Debido a la elevada cantidad de datos a analizar, se necesitan nuevas y mejoradas técnicas algorítmicas que puedan escalar con el volumen de datos y producir resultados precisos. Esta tesis aborda dicho problema. Las contribuciones que en ella se realizan se enfocan desde una perspectiva metodológica y otra algorítmica que propone el desarrollo de nuevos algoritmos y técnicas que permitan alinear secuencias de manera eficiente, precisa y escalable. Desde el punto de vista metodológico, esta tesis analiza y propone un marco de referencia para evaluar la calidad de los resultados del alineamiento de secuencias. Para ello, se analiza el origen de los conflictos durante la alineación de secuencias y se exploran los límites alcanzables en calidad con las tecnologías de secuenciación de alto rendimiento. Desde el punto de vista algorítmico, en el contexto de la búsqueda aproximada de patrones, esta tesis propone nuevas técnicas algorítmicas y de diseño de índices con el objetivo de mejorar la calidad y el desempeño de las herramientas dedicadas a alinear secuencias. En concreto, esta tesis presenta técnicas de diseño de índices genómicos enfocados a obtener un acceso más eficiente y escalable. También se presentan nuevas técnicas algorítmicas de filtrado con el fin de reducir el tiempo de ejecución necesario para alinear secuencias. Y, por último, se proponen algoritmos incrementales y técnicas híbridas para combinar métodos de alineamiento y mejorar el rendimiento en búsquedas donde el error esperado es alto. Todo ello sin degradar la calidad de los resultados y con garantías formales de precisión. Para concluir, es preciso apuntar que todos los algoritmos y metodologías propuestos en esta tesis están implementados y forman parte del alineador GEM. Este versátil alineador ofrece resultados de alta calidad en entornos de producción siendo varias veces más rápido que otros alineadores. En la actualidad este software se ofrece gratuitamente, tiene una amplia comunidad de usuarios y ha sido citado en numerosas publicaciones científicas

    On Polynomially Many Queries to NP or QMA Oracles

    Get PDF
    We study the complexity of problems solvable in deterministic polynomial time with access to an NP or Quantum Merlin-Arthur (QMA)-oracle, such as PNPP^{NP} and PQMAP^{QMA}, respectively. The former allows one to classify problems more finely than the Polynomial-Time Hierarchy (PH), whereas the latter characterizes physically motivated problems such as Approximate Simulation (APX-SIM) [Ambainis, CCC 2014]. In this area, a central role has been played by the classes PNP[log]P^{NP[\log]} and PQMA[log]P^{QMA[\log]}, defined identically to PNPP^{NP} and PQMAP^{QMA}, except that only logarithmically many oracle queries are allowed. Here, [Gottlob, FOCS 1993] showed that if the adaptive queries made by a PNPP^{NP} machine have a "query graph" which is a tree, then this computation can be simulated in PNP[log]P^{NP[\log]}. In this work, we first show that for any verification class C{NP,MA,QCMA,QMA,QMA(2),NEXP,QMAexp}C\in\{NP,MA,QCMA,QMA,QMA(2),NEXP,QMA_{\exp}\}, any PCP^C machine with a query graph of "separator number" ss can be simulated using deterministic time exp(slogn)\exp(s\log n) and slogns\log n queries to a CC-oracle. When sO(1)s\in O(1) (which includes the case of O(1)O(1)-treewidth, and thus also of trees), this gives an upper bound of PC[log]P^{C[\log]}, and when sO(logk(n))s\in O(\log^k(n)), this yields bound QPC[logk+1]QP^{C[\log^{k+1}]} (QP meaning quasi-polynomial time). We next show how to combine Gottlob's "admissible-weighting function" framework with the "flag-qubit" framework of [Watson, Bausch, Gharibian, 2020], obtaining a unified approach for embedding PCP^C computations directly into APX-SIM instances in a black-box fashion. Finally, we formalize a simple no-go statement about polynomials (c.f. [Krentel, STOC 1986]): Given a multi-linear polynomial pp specified via an arithmetic circuit, if one can "weakly compress" pp so that its optimal value requires mm bits to represent, then PNPP^{NP} can be decided with only mm queries to an NP-oracle.Comment: 46 pages pages, 5 figures, to appear in ITCS 202

    Fine-Grained Complexity: Exploring Reductions and their Properties

    Get PDF
    Η σχεδίαση αλγορίθμων αποτελεί ένα απο τα κύρια θέματα ενδιαφέροντος για τον τομέα της Πληροφορικής. Παρά τα πολλά αποτελέσματα σε ορισμένους τομείς, η προσέγγιση αυτή έχει πετύχει κάποια πρακτικά αδιέξοδα που έχουν αποδειχτεί προβληματικά στην πρόοδο του τομέα. Επίσης, οι κλασικές πρακτικές Υπολογιστικής Πολυπλοκότητας δεν ήταν σε θέση να παρακάμψουν αυτά τα εμπόδια. Η κατανόηση της δυσκολίας του κάθε προβλήματος δεν είναι τετριμμένη. Η Ραφιναρισμένη Πολυπλοκότητα παρέχει νέες προ-οπτικές για τα κλασικά προβλήματα, με αποτέλεσμα σταθερούς δεσμούς μεταξύ γνωστών εικασιών στην πολυπλοκότητα και την σχεδίαση αλγορίθμων. Χρησιμεύει επίσης ως εργα-λείο για να αποδείξει τα υπο όρους κατώτατα όρια για προβλήματα πολυωνυμικής χρονικής πολυπλοκότητας, ένα πεδίο που έχει σημειώσει πολύ λίγη πρόοδο μέχρι τώρα. Οι δημοφι-λείς υποθέσεις/παραδοχές όπως το SETH, το OVH, το 3SUM, και το APSP, δίνουν πολλά φράγματα που δεν έχουν ακόμα αποδειχθεί με κλασικές τεχνικές και παρέχουν μια νέα κατανόηση της δομής και της εντροπίας των προβλημάτων γενικά. Σκοπός αυτής της εργασίας είναι να συμβάλει στην εδραίωση του πλαισίου για αναγωγές από κάθε εικασία και να διερευνήσει την διαρθρωτική διαφορά μεταξύ των προβλημάτων σε κάθε περίπτωση.Algorithmic design has been one of the main subjects of interest for Computer science. While very effective in some areas, this approach has been met with some practical dead ends that have been very problematic in the progress of the field. Classical Computational Complexity practices have also not been able to bypass these blocks. Understanding the hardness of each problem is not trivial. Fine-Grained Complexity provides new perspectives on classic problems, resulting to solid links between famous conjectures in Complexity, and Algorithmic design. It serves as a tool to prove conditional lower bounds for problems with polynomial time complexity, a field that had seen very little progress until now. Popular conjectures such as SETH, k-OV, 3SUM, and APSP, imply many bounds that have yet to be proven using classic techniques, and provide a new understanding of the structure and entropy of problems in general. The aim of this thesis is to contribute towards solidifying the framework for reductions from each conjecture, and to explore the structural difference between the problems in each cas

    Symbolic Execution for Advanced Program Reasoning

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Expressive and Efficient Memory Representation for Bounded Model Checking of C programs

    Get PDF
    Ensuring memory safety in programs has been an important yet difficult topic of research. Most static analysis approaches rely on the theory of arrays to model memory access. The limitation of the theory of arrays in terms of scalability and compatibility with SAT/SMT solvers is well-known, and there has been many attempts at optimizing either the theory itself or memory encodings based on theory of arrays. In this thesis, we demonstrate that existing arrays-based memory encodings miss potential optimization opportunities by omitting language specific properties such as alignment and pointer arithmetic in C. We present SeaM, a new memory representation for C programs built around a more expressive First-order Theory: the Theory of Memory. We show that by preserving more C language specific rules and properties, the Theory of Memory allows for more thorough optimization methods during eager rewriting of sequences of stores. We introduce two such optimization methods in this thesis. First, we over-approximate pointer comparison with an abstract interpretation-like approach called AddressRangeMap. Second, we compress sequences of stores with Store-Map for faster address offset look-ups. The new memory representation is implemented in SeaBmc, a new BMC tool for LLVM. We evaluate our approach on real-world bounded model checking tasks from the aws-c-common library and Sv-Comp benchmarks and compare it against two existing memory representations in SeaBmc. Our results show that SeaM outperforms the theory of array based representation and is comparable with the λ based representation

    Sublinear Computation Paradigm

    Get PDF
    This open access book gives an overview of cutting-edge work on a new paradigm called the “sublinear computation paradigm,” which was proposed in the large multiyear academic research project “Foundations of Innovative Algorithms for Big Data.” That project ran from October 2014 to March 2020, in Japan. To handle the unprecedented explosion of big data sets in research, industry, and other areas of society, there is an urgent need to develop novel methods and approaches for big data analysis. To meet this need, innovative changes in algorithm theory for big data are being pursued. For example, polynomial-time algorithms have thus far been regarded as “fast,” but if a quadratic-time algorithm is applied to a petabyte-scale or larger big data set, problems are encountered in terms of computational resources or running time. To deal with this critical computational and algorithmic bottleneck, linear, sublinear, and constant time algorithms are required. The sublinear computation paradigm is proposed here in order to support innovation in the big data era. A foundation of innovative algorithms has been created by developing computational procedures, data structures, and modelling techniques for big data. The project is organized into three teams that focus on sublinear algorithms, sublinear data structures, and sublinear modelling. The work has provided high-level academic research results of strong computational and algorithmic interest, which are presented in this book. The book consists of five parts: Part I, which consists of a single chapter on the concept of the sublinear computation paradigm; Parts II, III, and IV review results on sublinear algorithms, sublinear data structures, and sublinear modelling, respectively; Part V presents application results. The information presented here will inspire the researchers who work in the field of modern algorithms

    Proceedings of the 22nd Conference on Formal Methods in Computer-Aided Design – FMCAD 2022

    Get PDF
    The Conference on Formal Methods in Computer-Aided Design (FMCAD) is an annual conference on the theory and applications of formal methods in hardware and system verification. FMCAD provides a leading forum to researchers in academia and industry for presenting and discussing groundbreaking methods, technologies, theoretical results, and tools for reasoning formally about computing systems. FMCAD covers formal aspects of computer-aided system design including verification, specification, synthesis, and testing
    corecore