217 research outputs found
Protein Threading for Genome-Scale Structural Analysis
Protein structure prediction is a necessary tool in the field of bioinformatic analysis. It is a non-trivial process that can add a great deal of information to a genome annotation. This dissertation deals with protein structure prediction through the technique of protein fold recognition and outlines several strategies for the improvement of protein threading techniques. In order to improve protein threading performance, this dissertation begins with an outline of sequence/structure alignment energy functions. A technique called Violated Inequality Minimization is used to quickly adapt to the changing energy landscape as new energy functions are added. To continue the improvement of alignment accuracy and fold recognition, new formulations of energy functions are used for the creation of the sequence/structure alignment. These energies include a formulation of a gap penalty which is dependent on sequence characteristics different from the traditional constant penalty. Another proposed energy is dependent on conserved structural patterns found during threading. These structural patterns have been employed to refine the sequence/structure alignment in my research. The section on Linear Programming Algorithm for protein structure alignment deals with the optimization of an alignment using additional residue-pair energy functions. In the original version of the model, all cores had to be aligned to the target sequence. Our research outlines an expansion of the original threading model which allows for a more flexible alignment by allowing core deletions. Aside from improvements in fold recognition and alignment accuracy, there is also a need to ensure that these techniques can scale for the computational demands of genome level structure prediction. A heuristic decision making processes has been designed to automate the classification and preparation of proteins for prediction. A graph analysis has been applied to the integration of different tools involved in the pipeline. Analysis of the data dependency graph allows for automatic parallelization of genome structure prediction. These different contributions help to improve the overall performance of protein threading and help distribute computations across a large set of computers to help make genome scale protein structure prediction practically feasible
DynamO: A free O(N) general event-driven molecular-dynamics simulator
Molecular-dynamics algorithms for systems of particles interacting through
discrete or "hard" potentials are fundamentally different to the methods for
continuous or "soft" potential systems. Although many software packages have
been developed for continuous potential systems, software for discrete
potential systems based on event-driven algorithms are relatively scarce and
specialized. We present DynamO, a general event-driven simulation package which
displays the optimal O(N) asymptotic scaling of the computational cost with the
number of particles N, rather than the O(N log(N)) scaling found in most
standard algorithms. DynamO provides reference implementations of the best
available event-driven algorithms. These techniques allow the rapid simulation
of both complex and large (>10^6 particles) systems for long times. The
performance of the program is benchmarked for elastic hard sphere systems,
homogeneous cooling and sheared inelastic hard spheres, and equilibrium
Lennard-Jones fluids. This software and its documentation are distributed under
the GNU General Public license and can be freely downloaded from
http://marcusbannerman.co.uk/dynamo
Efficient Storage of Genomic Sequences in High Performance Computing Systems
ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction
GNNFlow: A Distributed Framework for Continuous Temporal GNN Learning on Dynamic Graphs
Graph Neural Networks (GNNs) play a crucial role in various fields. However,
most existing deep graph learning frameworks assume pre-stored static graphs
and do not support training on graph streams. In contrast, many real-world
graphs are dynamic and contain time domain information. We introduce GNNFlow, a
distributed framework that enables efficient continuous temporal graph
representation learning on dynamic graphs on multi-GPU machines. GNNFlow
introduces an adaptive time-indexed block-based data structure that effectively
balances memory usage with graph update and sampling operation efficiency. It
features a hybrid GPU-CPU graph data placement for rapid GPU-based temporal
neighborhood sampling and kernel optimizations for enhanced sampling processes.
A dynamic GPU cache for node and edge features is developed to maximize cache
hit rates through reuse and restoration strategies. GNNFlow supports
distributed training across multiple machines with static scheduling to ensure
load balance. We implement GNNFlow based on DGL and PyTorch. Our experimental
results show that GNNFlow provides up to 21.1x faster continuous learning than
existing systems
Homology Modeling of Toll-Like Receptor Ligand-Binding Domains
Toll-like receptors (TLRs) are in the front-line during the initiation of an innate immune response against invading pathogens. TLRs are type I transmembrane proteins that are expressed on the surface of immune system cells. They are evolutionarily conserved between insects and vertebrates. To date, 13 groups of mammalian TLRs have been identified, ten in humans and 13 in mice. They share a modular structure that consists of a leucine-rich repeat (LRR) ectodomain, a single transmembrane helix and a cytoplasmic Toll/interleukin-1 receptor (TIR) domain. Most TLRs have been shown to recognize pathogen-associated molecular patterns (PAMPs) from a wide range of invading agents and initiate intracellular signal transduction pathways to trigger expression of genes, the products of which can control innate immune responses. The TLR signaling pathways, however, must be under tight negative regulation to maintain immune balance because over-activation of immune responses in the body can cause autoimmune diseases.
The TLR ectodomains are highly variable and are directly involved in ligand recognition. So far, crystal structures are missing for most TLR ectodomains because structure determination by X-ray diffraction or nuclear magnetic resonance (NMR) spectroscopy experiments remains time-consuming, and sometimes the crystallization of a protein can be very difficult. Computational modeling enables initial predictions of three-dimensional structures for the investigation of receptor-ligand interaction mechanisms. Computational methods are also helpful to develop new TLR agonists and antagonists that have therapeutic significance for diseases.
In this dissertation, an LRR template assembly approach for homology modeling of TLR ligand-binding domains is discussed. To facilitate the modeling work, two databases, TollML and LRRML, have been established. With this LRR template assembly approach, the ligand-binding domains of human TLR5-10 and mouse TLR11-13 were modeled. Based on the models of human TLR7, 8 and 9, we predicted potential ligand-binding residues and possible configurations of the receptor-ligand complex using a combined procedure. In addition, we modeled the cytoplasmic TIR domains of TLR4 and 7, the TLR adaptor protein MyD88 (myeloid differentiation primary response protein 88) and the TLR inhibitor SIGIRR (Single immunoglobulin interleukin-1 receptor-related molecule) to investigate the structural mechanism of TLR negative regulation
Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPU
Dedicated accelerator hardware has become essential for processing AI-based
workloads, leading to the rise of novel accelerator architectures. Furthermore,
fundamental differences in memory architecture and parallelism have made these
accelerators targets for scientific computing.
The sequence alignment problem is fundamental in bioinformatics; we have
implemented the -Drop algorithm, a heuristic method for pairwise alignment
that reduces search space, on the Graphcore Intelligence Processor Unit (IPU)
accelerator. The -Drop algorithm has an irregular computational pattern,
which makes it difficult to accelerate due to load balancing.
Here, we introduce a graph-based partitioning and queue-based batch system to
improve load balancing. Our implementation achieves speedup over a
state-of-the-art GPU implementation and up to compared to CPU. In
addition, we introduce a memory-restricted -Drop algorithm that reduces
memory footprint by and efficiently uses the IPU's limited
low-latency SRAM. This optimization further improves the strong scaling
performance by .Comment: 12 pages, 7 figures, 2 table
Alternative Splicing and Protein Structure Evolution
In den letzten Jahren gab es in verschiedensten Bereichen der Biologie einen dramatischen Anstieg verfügbarer, experimenteller Daten. Diese erlauben zum ersten Mal eine detailierte Analyse der Funktionsweisen von zellulären Komponenten wie Genen und Proteinen, die Analyse ihrer
Verknüpfung in zellulären Netzwerken sowie der Geschichte ihrer Evolution. Insbesondere der Bioinformatik kommt hier eine wichtige Rolle in der Datenaufbereitung und ihrer biologischen Interpretation zu.
In der vorliegenden Doktorarbeit werden zwei wichtige Bereiche der aktuellen bioinformatischen Forschung untersucht, nämlich die Analyse von Proteinstrukturevolution und Ähnlichkeiten zwischen Proteinstrukturen, sowie die Analyse von alternativem Splicing, einem integralen Prozess in eukaryotischen Zellen, der zur funktionellen
Diversität beiträgt. Insbesondere führen wir mit dieser Arbeit die Idee einer kombinierten Analyse der beiden Mechanismen (Strukturevolution und Splicing) ein. Wir zeigen, dass sich durch eine kombinierte Betrachtung neue Einsichten gewinnen lassen, wie Strukturevolution und alternatives Splicing sowie eine Kopplung beider Mechanismen zu funktioneller und struktureller Komplexität in höheren Organismen beitragen.
Die in der Arbeit vorgestellten Methoden, Hypothesen und Ergebnisse können dabei einen Beitrag zu unserem Verständnis der Funktionsweise von Strukturevolution und alternativem Splicing bei der Entstehung komplexer Organismen leisten wodurch beide, traditionell getrennte Bereiche der Bioinformatik in Zukunft voneinander profitieren können
- …