44 research outputs found

    Indices and Applications in High-Throughput Sequencing

    Get PDF
    Recent advances in sequencing technology allow to produce billions of base pairs per day in the form of reads of length 100 bp an longer and current developments promise the personal $1,000 genome in a couple of years. The analysis of these unprecedented amounts of data demands for efficient data structures and algorithms. One such data structures is the substring index, that represents all substrings or substrings up to a certain length contained in a given text. In this thesis we propose 3 substring indices, which we extend to be applicable to millions of sequences. We devise internal and external memory construction algorithms and a uniform framework for accessing the generalized suffix tree. Additionally we propose different index-based applications, e.g. exact and approximate pattern matching and different repeat search algorithms. Second, we present the read mapping tool RazerS, which aligns millions of single or paired-end reads of arbitrary lengths to their potential genomic origin using either Hamming or edit distance. Our tool can work either lossless or with a user-defined loss rate at higher speeds. Given the loss rate, we present a novel approach that guarantees not to lose more reads than specified. This enables the user to adapt to the problem at hand and provides a seamless tradeoff between sensitivity and running time. We compare RazerS with other state-of-the-art read mappers and show that it has the highest sensitivity and a comparable performance on various real-world datasets. At last, we propose a general approach for frequency based string mining, which has many applications, e.g. in contrast data mining. Our contribution is a novel and lightweight algorithm that is faster and uses less memory than the best available algorithms. We show its applicability for mining multiple databases with a variety of frequency constraints. As such, we use the notion of entropy from information theory to generalize the emerging substring mining problem to multiple databases. To demonstrate the improvement of our algorithm we compared to recent approaches on real-world experiments of various string domains, e.g. natural language, DNA, or protein sequences

    Towards a table top quantum computer

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 1999.Includes bibliographical references (leaves 135-139).In the early 1990s, quantum computing proved to be an enticing theoretical possibility but a extremely difficult experimental challenge. Two advances have made experimental quantum computing demonstrable: Quantum error correction; and bulk, thermal quantum computing using nuclear magnetic resonance (NMR). Simple algorithms have been implemented on large, commercial NMR spectrometers that are expensive and cumbersome. The goal of this project is to construct a table-top quantum computer that can match and eventually exceed the performance of commercial machines. This computer should be an inexpensive, easy-to-use machine that can be considered more a computer than its "supercomputer" counterparts. For this thesis, the goal is to develop a low-cost, table-top quantum computer capable of implementing simple quantum algorithms demonstrated thus far in the community, but is also amenable to the many scaling issues of practical quantum computing. Understanding these scaling issues requires developing a theoretical understanding of the signal enhancement techniques and fundamental noise sources of this powerful but delicate system. Complementary to quantum computing, this high performance but low cost NMR machine will be useful for a number of medical, low cost sensing and tagging applications due the unique properties of NMR: the ability to sense and manipulate the information content of materials on macroscopic and microscopic scales.Yael G. Maguire.S.M

    LASSO – an observatorium for the dynamic selection, analysis and comparison of software

    Full text link
    Mining software repositories at the scale of 'big code' (i.e., big data) is a challenging activity. As well as finding a suitable software corpus and making it programmatically accessible through an index or database, researchers and practitioners have to establish an efficient analysis infrastructure and precisely define the metrics and data extraction approaches to be applied. Moreover, for analysis results to be generalisable, these tasks have to be applied at a large enough scale to have statistical significance, and if they are to be repeatable, the artefacts need to be carefully maintained and curated over time. Today, however, a lot of this work is still performed by human beings on a case-by-case basis, with the level of effort involved often having a significant negative impact on the generalisability and repeatability of studies, and thus on their overall scientific value. The general purpose, 'code mining' repositories and infrastructures that have emerged in recent years represent a significant step forward because they automate many software mining tasks at an ultra-large scale and allow researchers and practitioners to focus on defining the questions they would like to explore at an abstract level. However, they are currently limited to static analysis and data extraction techniques, and thus cannot support (i.e., help automate) any studies which involve the execution of software systems. This includes experimental validations of techniques and tools that hypothesise about the behaviour (i.e., semantics) of software, or data analysis and extraction techniques that aim to measure dynamic properties of software. In this thesis a platform called LASSO (Large-Scale Software Observatorium) is introduced that overcomes this limitation by automating the collection of dynamic (i.e., execution-based) information about software alongside static information. It features a single, ultra-large scale corpus of executable software systems created by amalgamating existing Open Source software repositories and a dedicated DSL for defining abstract selection and analysis pipelines. Its key innovations are integrated capabilities for searching for selecting software systems based on their exhibited behaviour and an 'arena' that allows their responses to software tests to be compared in a purely data-driven way. We call the platform a 'software observatorium' since it is a place where the behaviour of large numbers of software systems can be observed, analysed and compared

    Multiparty session types for dynamic verification of distributed systems

    Get PDF
    In large-scale distributed systems, each application is realised through interactions among distributed components. To guarantee safe communication (no deadlocks and communication mismatches) we need programming languages and tools that structure, manage, and policy-check these interactions. Multiparty session types (MPST), a typing discipline for structured interactions between communicating processes, offers a promising approach. To date, however, session types applications have been limited to static verification, which is not always feasible and is often restrictive in terms of programming API and specifying policies. This thesis investigates the design and implementation of a runtime verification framework, ensuring conformance between programs and specifications. Specifications are written in Scribble, a protocol description language formally founded on MPST. The central idea of the approach is a dynamic monitor, which takes a form of a communicating finite state machine, automatically generated from Scribble specifications, and a communication runtime stipulating a message format. We extend and apply Scribble-based runtime verification in manifold ways. First, we implement a Python library, facilitated with session primitives and verification runtime. We integrate the library in a large cyber-infrastructure project for oceanography. Second, we examine multiple communication patterns, which reveal and motivate two novel extensions, asynchronous interrupts for verification of exception handling behaviours, and time constraints for enforcement of realtime protocols. Third, we apply the verification framework to actor programming by augmenting an actor library in Python with protocol annotations. For both implementations, measurements show Scribble-based dynamic checking delivers minimal overhead and allows expressive specifications. Finally, we explore a static analysis of Scribble specifications as to efficiently compute a safe global state from which a monitored system of interacting processes can be recovered after a failure. We provide an implementation of a verification framework for recovery in Erlang. Benchmarks show our recovery strategy outperforms a built-in static recovery strategy, in Erlang, on a number of use cases.Open Acces

    The ciao prolog system

    Get PDF
    Ciao is a public domain, next generation multi-paradigm programming environment with a unique set of features: Ciao offers a complete Prolog system, supporting ISO-Prolog, but its novel modular design allows both restricting and extending the language. As a result, it allows working with fully declarative subsets of Prolog and also to extend these subsets (or ISO-Prolog) both syntactically and semantically. Most importantly, these restrictions and extensions can be activated separately on each program module so that several extensions can coexist in the same application for different modules. Ciao also supports (through such extensions) programming with functions, higher-order (with predicate abstractions), constraints, and objects, as well as feature terms (records), persistence, several control rules (breadth-first search, iterative deepening, ...), concurrency (threads/engines), a good base for distributed execution (agents), and parallel execution. Libraries also support WWW programming, sockets, external interfaces (C, Java, TclTk, relational databases, etc.), etc. Ciao offers support for programming in the large with a robust module/object system, module-based separate/incremental compilation (automatically -no need for makefiles), an assertion language for declaring (optional) program properties (including types and modes, but also determinacy, non-failure, cost, etc.), automatic static inference and static/dynamic checking of such assertions, etc. Ciao also offers support for programming in the small producing small executables (including only those builtins used by the program) and support for writing scripts in Prolog. The Ciao programming environment includes a classical top-level and a rich emacs interface with an embeddable source-level debugger and a number of execution visualization tools. The Ciao compiler (which can be run outside the top level shell) generates several forms of architecture-independent and stand-alone executables, which run with speed, efficiency and executable size which are very competive with other commercial and academic Prolog/CLP systems. Library modules can be compiled into compact bytecode or C source files, and linked statically, dynamically, or autoloaded. The novel modular design of Ciao enables, in addition to modular program development, effective global program analysis and static debugging and optimization via source to source program transformation. These tasks are performed by the Ciao preprocessor ( ciaopp, distributed separately). The Ciao programming environment also includes lpdoc, an automatic documentation generator for LP/CLP programs. It processes Prolog files adorned with (Ciao) assertions and machine-readable comments and generates manuals in many formats including postscript, pdf, texinfo, info, HTML, man, etc. , as well as on-line help, ascii README files, entries for indices of manuals (info, WWW, ...), and maintains WWW distribution sites

    LIPIcs, Volume 244, ESA 2022, Complete Volume

    Get PDF
    LIPIcs, Volume 244, ESA 2022, Complete Volum
    corecore