50 research outputs found

    High Performance Computing Algorithms for Accelerating Peptide Identification from Mass-Spectrometry Data Using Heterogeneous Supercomputers

    Get PDF
    Fast and accurate identification of peptides and proteins from the mass spectrometry (MS) data is a critical problem in modern systems biology. Database peptide search is the most commonly used computational method to identify peptide sequences from the MS data. In this method, giga-bytes of experimentally generated MS data are compared against tera-byte sized databases of theoretically simulated MS data resulting in a compute- and data-intensive problem requiring days or weeks of computational times on desktop machines. Existing serial and high performance computing (HPC) algorithms strive to accelerate and improve the computational efficiency of the search, but exhibit sub-optimal performances due to their inefficient parallelization models, low resource utilization and high overhead costs

    Computational Strategies for Proteogenomics Analyses

    Full text link
    Proteogenomics is an area of proteomics concerning the detection of novel peptides and peptide variants nominated by genomics and transcriptomics experiments. While the term primarily refers to studies utilizing a customized protein database derived from select sequencing experiments, proteogenomics methods can also be applied in the quest for identifying previously unobserved, or missing, proteins in a reference protein database. The identification of novel peptides is difficult and results can be dominated by false positives if conventional computational and statistical approaches for shotgun proteomics are directly applied without consideration of the challenges involved in proteogenomics analyses. In this dissertation, I systematically distill the sources of false positives in peptide identification and present potential remedies, including computational strategies that are necessary to make these approaches feasible for large datasets. In the first part, I analyze high scoring decoys, which are false identifications with high assigned confidences, using multiple peptide identification strategies to understand how they are generated and develop strategies for reducing false positives. I also demonstrate that modified peptides can cause violations in the target-decoy assumptions, which is a cornerstone for error rate estimation in shotgun proteomics, leading to potential underestimation in the number of false positives. Second, I address computational bottlenecks in proteogenomics workflows through the development of two database search engines: EGADS and MSFragger. EGADS aims to address issues relating to the large sequence space involved in proteogenomics studies by using graphical processing units to accelerate both in-silico digestion and similarity scoring. MSFragger implements a novel fragment ion index and searching algorithm that vastly speeds up spectra similarity calculations. For the identification of modified peptides using the open search strategy, MSFragger is over 150X faster than conventional database search tools. Finally, I will discuss refinements to the open search strategy for detecting modified peptides and tools for improved collation and annotation. Using the speed afforded by MSFragger, I perform open searching on several large-scale proteomics experiments, identifying modified peptides on an unprecedented scale and demonstrating its utility in diverse proteomics applications. The ability to rapidly and comprehensively identify modified peptides allows for the reduction of false positives in proteogenomics. It also has implications in discovery proteomics by allowing for the detection of both common and rare (including novel) biological modifications that are often not considered in large scale proteomics experiments. The ability to account for all chemically modified peptides may also improve protein abundance estimates in quantitative proteomics.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/138581/1/andykong_1.pd

    Graphics Processing Units: Abstract Modelling and Applications in Bioinformatics

    Get PDF
    The Graphical Processing Unit is a specialised piece of hardware that contains many low powered cores, available on both the consumer and industrial market. The original Graphical Processing Units were designed for processing high quality graphical images, for presentation to the screen, and were therefore marketed to the computer games market segment. More recently, frameworks such as CUDA and OpenCL allowed the specialised highly parallel architecture of the Graphical Processing Unit to be used for not just graphical operations, but for general computation. This is known as General Purpose Programming on Graphical Processing Units, and it has attracted interest from the scientific community, looking for ways to exploit this highly parallel environment, which was cheaper and more accessible than the traditional High Performance Computing platforms, such as the supercomputer. This interest in developing algorithms that exploit the parallel architecture of the Graphical Processing Unit has highlighted the need for scientists to be able to analyse proposed algorithms, just as happens for proposed sequential algorithms. In this thesis, we study the abstract modelling of computation on the Graphical Processing Unit, and the application of Graphical Processing Unit-based algorithms in the field of bioinformatics, the field of using computational algorithms to solve biological problems. We show that existing abstract models for analysing parallel algorithms on the Graphical Processing Unit are not able to sufficiently and accurately model all that is required. We propose a new abstract model, called the Abstract Transferring Graphical Processing Unit Model, which is able to provide analysis of Graphical Processing Unit-based algorithms that is more accurate than existing abstract models. It does this by capturing the data transfer between the Central Processing Unit and the Graphical Processing Unit. We demonstrate the accuracy and applicability of our model with several computational problems, showing that our model provides greater accuracy than the existing models, verifying these claims using experiments. We also contribute novel Graphics Processing Unit-base solutions to two bioinformatics problems: DNA sequence alignment, and Protein spectral identification, demonstrating promising levels of improvement against the sequential Central Processing Unit experiments

    Data-independent acquisition mass spectrometry for human gut microbiota metaproteome analysis

    Get PDF
    Human digestive tract microbiota is a diverse community of microorganisms having complex interactions between microbes and the human host. Observing the functions carried out by microbes is essential for gaining understanding on the role of gut microbiota in human health and associations to diseases. New methods and tools are needed for acquirement of functional information from complex microbial samples. Metagenomic approaches focus on taxonomy or gene based function potential but lack power in the discovery of the actual functions carried out by the microbes. Metaproteomic methods are required to uncover the functions. The current highthroughput metaproteomics methods are based on mass spectrometry which is capable of identifying and quantifying ionized protein fragments, called peptides. Proteins can be inferred from the peptides and the functions associated with protein expression can be determined by using protein databases. Currently the most widely used data-dependent acquisition (DDA) method records only the most intensive ions in a semi-stochastic manner, which reduces reproducibility and produces incomplete records impairing quantification. Alternative data-independent acquisition (DIA) systematically records all ions and has been proposed as a replacement for DDA. However, recording all ions produces highly convoluted spectra from multiple peptides and, for this reason, it has not been known if and how DIA can be applied to metaproteomics where the number of different peptides is high. This thesis work introduced the DIA method for metaproteomic data analysis. The method was shown to achieve high reproducibility enabling the usage of only a single analysis per sample while DDA requires multiple. An easy to use open source software package, DIAtools, was developed for the analysis. Finally, the DIA analysis method was applied to study human gut microbiota and carbohydrate-active enzymes expressed in members of gut microbiota.Ihmisen suolistomikrobiston analyysi DIAmassaspektrometriamenetelmällä Ihmisen suoliston mikrobisto on monien mikro-organismien yhteisö, joka on vuorovaikutuksessa ihmisen kehon kanssa. Suoliston mikrobien toiminnan ymmärtäminen on keskeistä niiden roolista ihmisen terveyteen ja sairauksiin. Uusia tutkimusmenetelmiä tarvitaan mikrobien toiminnallisuuden määrittämiseen monimutkaisista, useita mikrobeja sisältävistä, näytteistä. Yleisesti käytetyt metagenomiikan menetelmät keskittyvät taksonomiaan tai geenien perusteella ennustettuihin funktioihin, mutta metaproteomiikkaa tarvitaan mikrobien toiminnan selvittämiseen. Metaproteomiikka-analyysiin voidaan käyttää massaspektrometriaa, jolla pystytään tunnistamaan ja määrittämään ionisoitujen proteiinien osasten, peptidien, määrä. Proteiinit voidaan päätellä peptideistä ja näin pystytään määrittämään proteiineihin liittyviä toimintoja hyödyntäen proteiinitietokantoja. Nykyisin käytetty DDA-menetelmä tunnistaa vain runsaimmin esiintyvät ionit, mikä rajoittaa sen hyödyntämistä. Siinä mitattavien ionien valinta on jossain määrin satunnainen, mikä vähentää tulosten toistettavuutta. Vaihtoehtoinen DIA-menetelmä analysoi järjestelmällisesti kaikki ionit ja kyseistä menetelmää on ehdotettu DDA:n tilalle. DIA-menetelmä tuottaa päällekkäisiä peptidispektrejä ja siksi aiemmin ei ole ollut tiedossa, onko se soveltuva menetelmä tai miten sitä olisi mahdollista soveltaa metaproteomiikkaan, jossa on suuri määrä erilaisia peptidejä. Tämä tutkimus esittelee soveltuvia tapoja DIA-menetelmän käyttöön metaproteomiikkadatan analysoinnissa. Työssä osoitetaan, että DIA-metaproteomiikka tuottaa luotettavasti toistettavia tuloksia. DIA-menetelmää käyttäessä riittää, että näyte analysoidaan vain yhden kerran, kun vastaavasti DDA-menetelmän käyttö vaatii useamman analysointikerran. Tutkimuksessa kehitettiin avoimen lähdekoodin ohjelmisto DIAtools, joka toteuttaa kehitetyt DIA-datojen analysointimenetelmät. Lopuksi DIA-analyysiä sovellettiin ruoansulatuskanavan mikrobien ja niiden tuottamien CAZy-entsyymien tutkimiseksi

    In Silico Simulation of DUSP-YIV906 Protein-Ligand Interactions and DUSP3-ERK Protein-Peptide Interactions

    Get PDF
    The dual-specificity phosphatases (DUSPs) are a heterogeneous group of protein enzymes that modulate several critical cellular signaling pathways by dephosphorylating the phosphotyrosine and phosphoserine/phosphothreonine residues within their substrate proteins. One of the best characterized sub-group of DUSPs is the mitogen-activated protein kinase phosphatases (MKPs), which act as the antagonists of associated signaling cascades including the extracellular signal-regulated kinases (ERKs) pathways. Accumulated evidences have highlighted the therapeutic value of DUSPs, as deletion or inhibition of some DUSPs can increase the phosphorylated level of ERKs to cause cancer cell death. In this study, multi-scale molecular modeling simulations were first performed to investigate the mechanism of action of YIV-906, which is an herbal formulation used in cancer treatment targeting DUSP-ERK1/2 pathways. In total, MD simulations and binding free energy calculations were performed for 99 DUSP-ligand complexes. Our results demonstrate that the sulfate moieties and carboxyl moieties of the advantageous ligands, either original herbal chemicals or human metabolites from YIV-906, can occupy the enzymes’ catalytic sites, mimicking the endogenous phosphate substrates of DUSPs. With the second aim to improve the accuracy of protein-peptide docking between DUSP3 and a peptide fragment of ERK1/2, a new receptor residue mapping (RR mapping) algorithm was developed to identify hotspots residues on the surface of DUSP3 and improve the peptide docking scoring. By performing all-atom molecular dynamics (MD) simulations with the receptor being soaked in a water box containing 0.5 moles of capped dipeptide of 20 natural amino acids (AA) plus 3 phosphorylated non-standard AAs, the RR maps with probabilities of AAs interacting with DUSP3’s surface residues were obtained. With the interaction probabilities incorporated, the ERK peptide binding models yielded by protein-peptide docking can be re-ranked and generate more accurate predictions. We have demonstrated that multi-scale molecular modeling techniques are able to elucidate molecular mechanisms involving complex molecular systems. Finally, our modeling study provides useful insights into the rational design of high potent anti-cancer drugs targeting DUSPs, and the new RR mapping algorithm is a promising tool that can be universally applied in characterization of protein-protein interactions (PPIs)

    A Framework for the Design and Analysis of High-Performance Applications on FPGAs using Partial Reconfiguration

    Get PDF
    The field-programmable gate array (FPGA) is a dynamically reconfigurable digital logic chip used to implement custom hardware. The large densities of modern FPGAs and the capability of the on-thely reconfiguration has made the FPGA a viable alternative to fixed logic hardware chips such as the ASIC. In high-performance computing, FPGAs are used as co-processors to speed up computationally intensive processes or as autonomous systems that realize a complete hardware application. However, due to the limited capacity of FPGA logic resources, denser FPGAs must be purchased if more logic resources are required to realize all the functions of a complex application. Alternatively, partial reconfiguration (PR) can be used to swap, on demand, idle components of the application with active components. This research uses PR to swap components to improve the performance of the application given the limited logic resources available with smaller but economical FPGAs. The swap is called ”resource sharing PR”. In a pipelined design of multiple hardware modules (pipeline stages), resource sharing PR is a technique that uses PR to improve the performance of pipeline bottlenecks. This is done by reconfiguring other pipeline stages, typically those that are idle waiting for data from a bottleneck, into an additional parallel bottleneck module. The target pipeline of this research is a two-stage “slow-toast” pipeline where the flow of data traversing the pipeline transitions from a relatively slow, bottleneck stage to a fast stage. A two stage pipeline that combines FPGA-based hardware implementations of well-known Bioinformatics search algorithms, the X! Tandem algorithm and the Smith-Waterman algorithm, is implemented for this research; the implemented pipeline demonstrates that characteristics of these algorithm. The experimental results show that, in a database of unknown peptide spectra, when matching spectra with 388 peaks or greater, performing resource sharing PR to instantiate a parallel X! Tandem module is worth the cost for PR. In addition, from timings gathered during experiments, a general formula was derived for determining the value of performing PR upon a fast module

    Interrogation of Dynamic Proteins to Expand the Druggable Proteome

    Full text link
    The human proteome is vastly complex, and our understanding of it is constantly evolving. There are roughly 20,000 protein-coding genes in the human genome, yet only about 10% of the resultant proteins are deemed “druggable” targets. And, only half of those have disease relevance. Thus, the druggable proteome is surprisingly narrow, consisting largely of structured proteins with defined binding pockets. With so many disease signatures residing in the “undruggable” portion of the proteome, there is much work to be done to expand the druggable landscape. An area rich with disease relevance is dynamic protein-protein interactions (PPIs), which underpin many regulatory cellular functions both in healthy and diseased states. However, devoid of typical binding pockets that enable traditional drug discovery approaches (i.e. substrate mimicry), dynamic PPIs occur over large, flat surface areas, which is why they have remained “undrugged.” A disproportionate number of dynamic proteins can be found in transcriptional regulation. As such, it provides an interesting avenue for chemical probe development and therapeutic intervention. For instance, a hallmark of cancerous cells are rampant growth and proliferation, with many proteins being overexpressed. While many research efforts have focused on targeting the overexpressed proteins themselves, halting the overexpression at the transcriptional level could stop the disease progression at its initiation. This dissertation works towards expanding the druggable proteome by establishing principles of molecular recognition that guide native PPIs. By primarily using molecular dynamics simulations, with complementary biophysical experimentation, I dissect coactivators and establish rules of activator recognition and engagement. In doing so, I demonstrate the utility of disorder in transcriptional regulation. In particular, I identify ways in which allostery manifests in dynamic coactivator proteins. Further, I explore how inhibition / enhancement of particular PPIs can be achieved using small molecules that attenuate fluctuations and disrupt binding allosterically.PHDChemical BiologyUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169881/1/apeiffer_1.pd

    Machine learning approaches for computer aided drug discovery

    Get PDF
    Pharmaceutical drug discovery is expensive, time consuming and scientifically challenging. In order to increase efficiency of the pre-clinical drug discovery pathway, computational drug discovery methods and most recently, machine learning-based methods are increasingly used as powerful tools to aid early stage drug discovery. In this thesis, I present three complementary computer-aided drug discovery methods, with a focus on aiding hit discovery and hit-to-lead optimization. In addition, this thesis particularly focuses on exploring different molecular representations used to featurise machine learning models, in order explore how best to capture valuable information about protein, ligands and 3D protein-ligand complexes to build more robust, more interpretable and more accurate machine learning models. First, I developed ligand-based models using a Gaussian Process (GP) as an easy-to-implement tool to guide exploration of chemical space for the optimization of protein-ligand binding affinity. I explored different topological fingerprint and autoencoder representations for Bayesian optimisation (BO) and showed that BO is a powerful tool to help medicinal chemists to prioritise which new compounds to make for single-target as well as multi-target optimisation. The algorithm achieved high enrichment of top compounds for both single target and multiobjective optimisation when tested on a well known benchmark dataset of the drug target matrix metalloproteinase-12 and a real, ongoing drug optimisation dataset targeting four bacterial metallo-β-lactamases. Next, I present the development of a knowledge-based approach to drug design, combining new protein-ligand interaction fingerprints with a fragment-based drug discovery approach to understand SARS-CoV-2 Mpro-substrate specificity and to design novel small molecule inhibitors in silico. In combination with a fragment-based drug discovery approach, I show how this knowledge-based interaction fingerprint-driven approach can reveal fruitful fragment-growth design strategies. Lastly, I expand on the knowledge-based contact fingerprints to create a ligand-shaped molecular graph representation (Protein Ligand Interaction Graphs, PLIGs) to develop novel graph-based deep learning protein-ligand binding affinity scoring functions. PLIGs encode all intermolecular interactions in a protein-ligand complex within the node features of the graph and are therefore simple and fully interpretable. I explore a variety of Graph Neural Network architectures in combination with PLIGs and found Graph Attention Networks to perform slightly better than other GNN architectures, performing amongst the best known protein-ligand binding affinity scoring functions
    corecore