173 research outputs found

    Parallel gene upstream comparison via multi-level hash tables on GPU

    Get PDF
    The region of DNA immediately in front of a gene body (also called upstream region) contains short (8-20 base) sequence motifs that help to control when that gene is turned on and off. Unfortunately, these motifs are generally unknown and commonly degenerate. In this work, a motif-finding framework is proposed that, given a set of gene upstream regions, performs all-to-all pairwise comparison and identifies all motifs of length k (k-mers) that are common to any pair of upstream regions or differ in at most d characters. This framework stores the k-mers found in each gene in a multi-level hash table. The hash table design optimizes hash table comparison (rather than hash table insertion or lookup), is highly parallelizable and easily maps onto GPU. Four GPU kernels are proposed for pairwise hash table comparison, each leveraging a distinct parallelization approach. A study of different factors that affect the performance of the implementations is included (the hash function, the number of buckets and the settings of additional implementation-specific parameters). Experimental results performed using an average-size yeast genome show that our fastest GPU kernel outperforms an 8-thread, cache-efficient CPU implementation by a factor of [about]52x

    Homology sequence analysis using GPU acceleration

    Get PDF
    A number of problems in bioinformatics, systems biology and computational biology field require abstracting physical entities to mathematical or computational models. In such studies, the computational paradigms often involve algorithms that can be solved by the Central Processing Unit (CPU). Historically, those algorithms benefit from the advancements of computing power in the serial processing capabilities of individual CPU cores. However, the growth has slowed down over recent years, as scaling out CPU has been shown to be both cost-prohibitive and insecure. To overcome this problem, parallel computing approaches that employ the Graphics Processing Unit (GPU) have gained attention as complementing or replacing traditional CPU approaches. The premise of this research is to investigate the applicability of various parallel computing platforms to several problems in the detection and analysis of homology in biological sequence. I hypothesize that by exploiting the sheer amount of computation power and sequencing data, it is possible to deduce information from raw sequences without supplying the underlying prior knowledge to come up with an answer. I have developed such tools to perform analysis at scales that are traditionally unattainable with general-purpose CPU platforms. I have developed a method to accelerate sequence alignment on the GPU, and I used the method to investigate whether the Operational Taxonomic Unit (OTU) classification problem can be improved with such sheer amount of computational power. I have developed a method to accelerate pairwise k-mer comparison on the GPU, and I used the method to further develop PolyHomology, a framework to scaffold shared sequence motifs across large numbers of genomes to illuminate the structure of the regulatory network in yeasts. The results suggest that such approach to heterogeneous computing could help to answer questions in biology and is a viable path to new discoveries in the present and the future.Includes bibliographical reference

    Hardening High-Assurance Security Systems with Trusted Computing

    Get PDF
    We are living in the time of the digital revolution in which the world we know changes beyond recognition every decade. The positive aspect is that these changes also drive the progress in quality and availability of digital assets crucial for our societies. To name a few examples, these are broadly available communication channels allowing quick exchange of knowledge over long distances, systems controlling automatic share and distribution of renewable energy in international power grid networks, easily accessible applications for early disease detection enabling self-examination without burdening the health service, or governmental systems assisting citizens to settle official matters without leaving their homes. Unfortunately, however, digitalization also opens opportunities for malicious actors to threaten our societies if they gain control over these assets after successfully exploiting vulnerabilities in the complex computing systems building them. Protecting these systems, which are called high-assurance security systems, is therefore of utmost importance. For decades, humanity has struggled to find methods to protect high-assurance security systems. The advancements in the computing systems security domain led to the popularization of hardware-assisted security techniques, nowadays available in commodity computers, that opened perspectives for building more sophisticated defense mechanisms at lower costs. However, none of these techniques is a silver bullet. Each one targets particular use cases, suffers from limitations, and is vulnerable to specific attacks. I argue that some of these techniques are synergistic and help overcome limitations and mitigate specific attacks when used together. My reasoning is supported by regulations that legally bind high-assurance security systems' owners to provide strong security guarantees. These requirements can be fulfilled with the help of diverse technologies that have been standardized in the last years. In this thesis, I introduce new techniques for hardening high-assurance security systems that execute in remote execution environments, such as public and hybrid clouds. I implemented these techniques as part of a framework that provides technical assurance that high-assurance security systems execute in a specific data center, on top of a trustworthy operating system, in a virtual machine controlled by a trustworthy hypervisor or in strong isolation from other software. I demonstrated the practicality of my approach by leveraging the framework to harden real-world applications, such as machine learning applications in the eHealth domain. The evaluation shows that the framework is practical. It induces low performance overhead (<6%), supports software updates, requires no changes to the legacy application's source code, and can be tailored to individual trust boundaries with the help of security policies. The framework consists of a decentralized monitoring system that offers better scalability than traditional centralized monitoring systems. Each monitored machine runs a piece of code that verifies that the machine's integrity and geolocation conform to the given security policy. This piece of code, which serves as a trusted anchor on that machine, executes inside the trusted execution environment, i.e., Intel SGX, to protect itself from the untrusted host, and uses trusted computing techniques, such as trusted platform module, secure boot, and integrity measurement architecture, to attest to the load-time and runtime integrity of the surrounding operating system running on a bare metal machine or inside a virtual machine. The trusted anchor implements my novel, formally proven protocol, enabling detection of the TPM cuckoo attack. The framework also implements a key distribution protocol that, depending on the individual security requirements, shares cryptographic keys only with high-assurance security systems executing in the predefined security settings, i.e., inside the trusted execution environments or inside the integrity-enforced operating system. Such an approach is particularly appealing in the context of machine learning systems where some algorithms, like the machine learning model training, require temporal access to large computing power. These algorithms can execute inside a dedicated, trusted data center at higher performance because they are not limited by security features required in the shared execution environment. The evaluation of the framework showed that training of a machine learning model using real-world datasets achieved 0.96x native performance execution on the GPU and a speedup of up to 1560x compared to the state-of-the-art SGX-based system. Finally, I tackled the problem of software updates, which makes the operating system's integrity monitoring unreliable due to false positives, i.e., software updates move the updated system to an unknown (untrusted) state that is reported as an integrity violation. I solved this problem by introducing a proxy to a software repository that sanitizes software packages so that they can be safely installed. The sanitization consists of predicting and certifying the future (after the specific updates are installed) operating system's state. The evaluation of this approach showed that it supports 99.76% of the packages available in Alpine Linux main and community repositories. The framework proposed in this thesis is a step forward in verifying and enforcing that high-assurance security systems execute in an environment compliant with regulations. I anticipate that the framework might be further integrated with industry-standard security information and event management tools as well as other security monitoring mechanisms to provide a comprehensive solution hardening high-assurance security systems

    Associative Pattern Recognition for Biological Regulation Data

    Get PDF
    In the last decade, bioinformatics data has been accumulated at an unprecedented rate, thanks to the advancement in sequencing technologies. Such rapid development poses both challenges and promising research topics. In this dissertation, we propose a series of associative pattern recognition algorithms in biological regulation studies. In particular, we emphasize efficiently recognizing associative patterns between genes, transcription factors, histone modifications and functional labels using heterogeneous data sources (numeric, sequences, time series data and textual labels). In protein-DNA associative pattern recognition, we introduce an efficient algorithm for affinity test by searching for over-represented DNA sequences using a hash function and modulo addition calculation. This substantially improves the efficiency of \textit{next generation sequencing} data analysis. In gene regulatory network inference, we propose a framework for refining weak networks based on transcription factor binding sites, thus improved the precision of predicted edges by up to 52%. In histone modification code analysis, we propose an approach to genome-wide combinatorial pattern recognition for histone code to function associative pattern recognition, and achieved improvement by up to 38.1%38.1\%. We also propose a novel shape based modification pattern analysis approach, using this to successfully predict sub-classes of genes in flowering-time category. We also propose a combination to combination associative pattern recognition, and achieved better performance compared against multi-label classification and bidirectional associative memory methods. Our proposed approaches recognize associative patterns from different types of data efficiently, and provides a useful toolbox for biological regulation analysis. This dissertation presents a road-map to associative patterns recognition at genome wide level

    GENETIC ALGORITHM OPTIMIZED PACKET FILTERING

    Get PDF
    n our research, we present a method to optimize packet filtering by genetic algorithm. Packet filtering in our work consists of packet capturing and firewall rules reordering. Genetic algorithm is used to automate rules reordering and the discovery of optimal combination of packet capture configuration, in the framework of PF_RING platform and rules ordering. Our method has been tested in different sizes of network traffic load. Genetic Algorithm evolves configuration based on the recorded throughput rates; the higher the throughput the better the solution. Results obtained indicate the effectiveness ofthe approach

    SUFFIX TREE, MINWISE HASHING AND STREAMING ALGORITHMS FOR BIG DATA ANALYSIS IN BIOINFORMATICS

    Get PDF
    In this dissertation, we worked on several algorithmic problems in bioinformatics using mainly three approaches: (a) a streaming model, (b) sux-tree based indexing, and (c) minwise-hashing (minhash) and locality-sensitive hashing (LSH). The streaming models are useful for large data problems where a good approximation needs to be achieved with limited space usage. We developed an approximation algorithm (Kmer-Estimate) using the streaming approach to obtain a better estimation of the frequency of k-mer counts. A k-mer, a subsequence of length k, plays an important role in many bioinformatics analyses such as genome distance estimation. We also developed new methods that use sux tree, a trie data structure, for alignment-free, non-pairwise algorithms for a conserved non-coding sequence (CNS) identification problem. We provided two different algorithms: STAG-CNS to identify exact-matched CNSs and DiCE to identify CNSs with mismatches. Using our algorithms, CNSs among various grass species were identified. A different approach was employed for identification of longer CNSs ( 100 bp, mostly found in animals). In our new method (MinCNE), the minhash approach was used to estimate the Jaccard similarity. Using also LSH, k-mers extracted from genomic sequences were clustered and CNSs were identified. Another new algorithm (MinIsoClust) that also uses minhash and LSH techniques was developed for an isoform clustering problem. Isoforms are generated from the same gene but by alternative splicing. As the isoform sequences share some exons but in different combinations, regular sequencing clustering methods do not work well. Our algorithm generates clusters for isoform sequences based on their shared minhash signatures. Finally, we discuss de novo transcriptome assembly algorithms and how to improve the assembly accuracy using ensemble approaches. First, we did a comprehensive performance analysis on different transcriptome assemblers using simulated benchmark datasets. Then, we developed a new ensemble approach (Minsemble) for the de novo transcriptome assembly problem that integrates isoform-clustering using minhash technique to identify potentially correct transcripts from various de novo transcriptome assemblers. Minsemble identified more correctly assembled transcripts as well as genes compared to other de novo and ensemble methods. Adviser: Jitender S. Deogu

    Transcript assembly and abundance estimation with high-throughput RNA sequencing

    Get PDF
    We present algorithms and statistical methods for the reconstruction and abundance estimation of transcript sequences from high throughput RNA sequencing ("RNA-Seq"). We evaluate these approaches through large-scale experiments of a well studied model of muscle development. We begin with an overview of sequencing assays and outline why the short read alignment problem is fundamental to the analysis of these assays. We then describe two approaches to the contiguous alignment problem, one of which uses massively parallel graphics hardware to accelerate alignment, and one of which exploits an indexing scheme based on the Burrows-Wheeler transform. We then turn to the spliced alignment problem, which is fundamental to RNA-Seq, and present an algorithm, TopHat. TopHat is the first algorithm that can align the reads from an entire RNA-Seq experiment to a large genome without the aid of reference gene models. In the second part of the thesis, we present the first comparative RNA-Seq as- sembly algorithm, Cufflinks, which is adapted from a constructive proof of Dilworth's Theorem, a classic result in combinatorics. We evaluate Cufflinks by assembling the transcriptome from a time course RNA-Seq experiment of developing skeletal muscle cells. The assembly contains 13,689 known transcripts and 3,724 novel ones. Of the novel transcripts, 62% were strongly supported by earlier sequencing experiments or by homologous transcripts in other organisms. We further validated interesting genes with isoform-specific RT-PCR. We then present a statistical model for RNA-Seq included in Cufflinks and with which we estimate abundances of transcripts from RNA-seq data. Simulation studies demonstrate that the model is highly accurate. We apply this model to the muscle data, and track the abundances of individual isoforms over development. Finally, we present significance tests for changes in relative and absolute abundances between time points, which we employ to uncover differential expression and differential regulation. By testing for relative abundance changes within and between transcripts sharing a transcription start site, we find significant shifts in the rates of alternative splicing and promoter preference in hundreds of genes, including those believed to regulate muscle development
    • …
    corecore