217 research outputs found

    Protein Threading for Genome-Scale Structural Analysis

    Get PDF
    Protein structure prediction is a necessary tool in the field of bioinformatic analysis. It is a non-trivial process that can add a great deal of information to a genome annotation. This dissertation deals with protein structure prediction through the technique of protein fold recognition and outlines several strategies for the improvement of protein threading techniques. In order to improve protein threading performance, this dissertation begins with an outline of sequence/structure alignment energy functions. A technique called Violated Inequality Minimization is used to quickly adapt to the changing energy landscape as new energy functions are added. To continue the improvement of alignment accuracy and fold recognition, new formulations of energy functions are used for the creation of the sequence/structure alignment. These energies include a formulation of a gap penalty which is dependent on sequence characteristics different from the traditional constant penalty. Another proposed energy is dependent on conserved structural patterns found during threading. These structural patterns have been employed to refine the sequence/structure alignment in my research. The section on Linear Programming Algorithm for protein structure alignment deals with the optimization of an alignment using additional residue-pair energy functions. In the original version of the model, all cores had to be aligned to the target sequence. Our research outlines an expansion of the original threading model which allows for a more flexible alignment by allowing core deletions. Aside from improvements in fold recognition and alignment accuracy, there is also a need to ensure that these techniques can scale for the computational demands of genome level structure prediction. A heuristic decision making processes has been designed to automate the classification and preparation of proteins for prediction. A graph analysis has been applied to the integration of different tools involved in the pipeline. Analysis of the data dependency graph allows for automatic parallelization of genome structure prediction. These different contributions help to improve the overall performance of protein threading and help distribute computations across a large set of computers to help make genome scale protein structure prediction practically feasible

    DynamO: A free O(N) general event-driven molecular-dynamics simulator

    Full text link
    Molecular-dynamics algorithms for systems of particles interacting through discrete or "hard" potentials are fundamentally different to the methods for continuous or "soft" potential systems. Although many software packages have been developed for continuous potential systems, software for discrete potential systems based on event-driven algorithms are relatively scarce and specialized. We present DynamO, a general event-driven simulation package which displays the optimal O(N) asymptotic scaling of the computational cost with the number of particles N, rather than the O(N log(N)) scaling found in most standard algorithms. DynamO provides reference implementations of the best available event-driven algorithms. These techniques allow the rapid simulation of both complex and large (>10^6 particles) systems for long times. The performance of the program is benchmarked for elastic hard sphere systems, homogeneous cooling and sheared inelastic hard spheres, and equilibrium Lennard-Jones fluids. This software and its documentation are distributed under the GNU General Public license and can be freely downloaded from http://marcusbannerman.co.uk/dynamo

    Efficient Storage of Genomic Sequences in High Performance Computing Systems

    Get PDF
    ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction

    GNNFlow: A Distributed Framework for Continuous Temporal GNN Learning on Dynamic Graphs

    Full text link
    Graph Neural Networks (GNNs) play a crucial role in various fields. However, most existing deep graph learning frameworks assume pre-stored static graphs and do not support training on graph streams. In contrast, many real-world graphs are dynamic and contain time domain information. We introduce GNNFlow, a distributed framework that enables efficient continuous temporal graph representation learning on dynamic graphs on multi-GPU machines. GNNFlow introduces an adaptive time-indexed block-based data structure that effectively balances memory usage with graph update and sampling operation efficiency. It features a hybrid GPU-CPU graph data placement for rapid GPU-based temporal neighborhood sampling and kernel optimizations for enhanced sampling processes. A dynamic GPU cache for node and edge features is developed to maximize cache hit rates through reuse and restoration strategies. GNNFlow supports distributed training across multiple machines with static scheduling to ensure load balance. We implement GNNFlow based on DGL and PyTorch. Our experimental results show that GNNFlow provides up to 21.1x faster continuous learning than existing systems

    Homology Modeling of Toll-Like Receptor Ligand-Binding Domains

    Get PDF
    Toll-like receptors (TLRs) are in the front-line during the initiation of an innate immune response against invading pathogens. TLRs are type I transmembrane proteins that are expressed on the surface of immune system cells. They are evolutionarily conserved between insects and vertebrates. To date, 13 groups of mammalian TLRs have been identified, ten in humans and 13 in mice. They share a modular structure that consists of a leucine-rich repeat (LRR) ectodomain, a single transmembrane helix and a cytoplasmic Toll/interleukin-1 receptor (TIR) domain. Most TLRs have been shown to recognize pathogen-associated molecular patterns (PAMPs) from a wide range of invading agents and initiate intracellular signal transduction pathways to trigger expression of genes, the products of which can control innate immune responses. The TLR signaling pathways, however, must be under tight negative regulation to maintain immune balance because over-activation of immune responses in the body can cause autoimmune diseases. The TLR ectodomains are highly variable and are directly involved in ligand recognition. So far, crystal structures are missing for most TLR ectodomains because structure determination by X-ray diffraction or nuclear magnetic resonance (NMR) spectroscopy experiments remains time-consuming, and sometimes the crystallization of a protein can be very difficult. Computational modeling enables initial predictions of three-dimensional structures for the investigation of receptor-ligand interaction mechanisms. Computational methods are also helpful to develop new TLR agonists and antagonists that have therapeutic significance for diseases. In this dissertation, an LRR template assembly approach for homology modeling of TLR ligand-binding domains is discussed. To facilitate the modeling work, two databases, TollML and LRRML, have been established. With this LRR template assembly approach, the ligand-binding domains of human TLR5-10 and mouse TLR11-13 were modeled. Based on the models of human TLR7, 8 and 9, we predicted potential ligand-binding residues and possible configurations of the receptor-ligand complex using a combined procedure. In addition, we modeled the cytoplasmic TIR domains of TLR4 and 7, the TLR adaptor protein MyD88 (myeloid differentiation primary response protein 88) and the TLR inhibitor SIGIRR (Single immunoglobulin interleukin-1 receptor-related molecule) to investigate the structural mechanism of TLR negative regulation

    Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPU

    Full text link
    Dedicated accelerator hardware has become essential for processing AI-based workloads, leading to the rise of novel accelerator architectures. Furthermore, fundamental differences in memory architecture and parallelism have made these accelerators targets for scientific computing. The sequence alignment problem is fundamental in bioinformatics; we have implemented the XX-Drop algorithm, a heuristic method for pairwise alignment that reduces search space, on the Graphcore Intelligence Processor Unit (IPU) accelerator. The XX-Drop algorithm has an irregular computational pattern, which makes it difficult to accelerate due to load balancing. Here, we introduce a graph-based partitioning and queue-based batch system to improve load balancing. Our implementation achieves 10×10\times speedup over a state-of-the-art GPU implementation and up to 4.65×4.65\times compared to CPU. In addition, we introduce a memory-restricted XX-Drop algorithm that reduces memory footprint by 55×55\times and efficiently uses the IPU's limited low-latency SRAM. This optimization further improves the strong scaling performance by 3.6×3.6\times.Comment: 12 pages, 7 figures, 2 table

    Alternative Splicing and Protein Structure Evolution

    Get PDF
    In den letzten Jahren gab es in verschiedensten Bereichen der Biologie einen dramatischen Anstieg verfügbarer, experimenteller Daten. Diese erlauben zum ersten Mal eine detailierte Analyse der Funktionsweisen von zellulären Komponenten wie Genen und Proteinen, die Analyse ihrer Verknüpfung in zellulären Netzwerken sowie der Geschichte ihrer Evolution. Insbesondere der Bioinformatik kommt hier eine wichtige Rolle in der Datenaufbereitung und ihrer biologischen Interpretation zu. In der vorliegenden Doktorarbeit werden zwei wichtige Bereiche der aktuellen bioinformatischen Forschung untersucht, nämlich die Analyse von Proteinstrukturevolution und Ähnlichkeiten zwischen Proteinstrukturen, sowie die Analyse von alternativem Splicing, einem integralen Prozess in eukaryotischen Zellen, der zur funktionellen Diversität beiträgt. Insbesondere führen wir mit dieser Arbeit die Idee einer kombinierten Analyse der beiden Mechanismen (Strukturevolution und Splicing) ein. Wir zeigen, dass sich durch eine kombinierte Betrachtung neue Einsichten gewinnen lassen, wie Strukturevolution und alternatives Splicing sowie eine Kopplung beider Mechanismen zu funktioneller und struktureller Komplexität in höheren Organismen beitragen. Die in der Arbeit vorgestellten Methoden, Hypothesen und Ergebnisse können dabei einen Beitrag zu unserem Verständnis der Funktionsweise von Strukturevolution und alternativem Splicing bei der Entstehung komplexer Organismen leisten wodurch beide, traditionell getrennte Bereiche der Bioinformatik in Zukunft voneinander profitieren können
    corecore