382 research outputs found

    Computational Labeling, Partitioning, and Balancing of Molecular Networks

    Get PDF
    Recent advances in high throughput techniques enable large-scale molecular quantification with high accuracy, including mRNAs, proteins and metabolites. Differential expression of these molecules in case and control samples provides a way to select phenotype-associated molecules with statistically significant changes. However, given the significance ranking list of molecular changes, how those molecules work together to drive phenotype formation is still unclear. In particular, the changes in molecular quantities are insufficient to interpret the changes in their functional behavior. My study is aimed at answering this question by integrating molecular network data to systematically model and estimate the changes of molecular functional behaviors. We build three computational models to label, partition, and balance molecular networks using modern machine learning techniques. (1) Due to the incompleteness of protein functional annotation, we develop AptRank, an adaptive PageRank model for protein function prediction on bilayer networks. By integrating Gene Ontology (GO) hierarchy with protein-protein interaction network, our AptRank outperforms four state-of-the-art methods in a comprehensive evaluation using benchmark datasets. (2) We next extend our AptRank into a network partitioning method, BioSweeper, to identify functional network modules in which molecules share similar functions and also densely connect to each other. Compared to traditional network partitioning methods using only network connections, BioSweeper, which integrates the GO hierarchy, can automatically identify functionally enriched network modules. (3) Finally, we conduct a differential interaction analysis, namely difFBA, on protein-protein interaction networks by simulating protein fluxes using flux balance analysis (FBA). We test difFBA using quantitative proteomic data from colon cancer, and demonstrate that difFBA offers more insights into functional changes in molecular behavior than does protein quantity changes alone. We conclude that our integrative network model increases the observational dimensions of complex biological systems, and enables us to more deeply understand the causal relationships between genotypes and phenotypes

    Doctor of Philosophy

    Get PDF
    dissertationWith the tremendous growth of data produced in the recent years, it is impossible to identify patterns or test hypotheses without reducing data size. Data mining is an area of science that extracts useful information from the data by discovering patterns and structures present in the data. In this dissertation, we will largely focus on clustering which is often the first step in any exploratory data mining task, where items that are similar to each other are grouped together, making downstream data analysis robust. Different clustering techniques have different strengths, and the resulting groupings provide different perspectives on the data. Due to the unsupervised nature i.e., the lack of domain experts who can label the data, validation of results is very difficult. While there are measures that compute "goodness" scores for clustering solutions as a whole, there are few methods that validate the assignment of individual data items to their clusters. To address these challenges we focus on developing a framework that can generate, compare, combine, and evaluate different solutions to make more robust and significant statements about the data. In the first part of this dissertation, we present fast and efficient techniques to generate and combine different clustering solutions. We build on some recent ideas on efficient representations of clusters of partitions to develop a well founded metric that is spatially aware to compare clusterings. With the ability to compare clusterings, we describe a heuristic to combine different solutions to produce a single high quality clustering. We also introduce a Markov chain Monte Carlo approach to sample different clusterings from the entire landscape to provide the users with a variety of choices. In the second part of this dissertation, we build certificates for individual data items and study their influence on effective data reduction. We present a geometric approach by defining regions of influence for data items and clusters and use this to develop adaptive sampling techniques to speedup machine learning algorithms. This dissertation is therefore a systematic approach to study the landscape of clusterings in an attempt to provide a better understanding of the data

    NOVEL ALGORITHMS AND TOOLS FOR LIGAND-BASED DRUG DESIGN

    Get PDF
    Computer-aided drug design (CADD) has become an indispensible component in modern drug discovery projects. The prediction of physicochemical properties and pharmacological properties of candidate compounds effectively increases the probability for drug candidates to pass latter phases of clinic trials. Ligand-based virtual screening exhibits advantages over structure-based drug design, in terms of its wide applicability and high computational efficiency. The established chemical repositories and reported bioassays form a gigantic knowledgebase to derive quantitative structure-activity relationship (QSAR) and structure-property relationship (QSPR). In addition, the rapid advance of machine learning techniques suggests new solutions for data-mining huge compound databases. In this thesis, a novel ligand classification algorithm, Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps (LiCABEDS), was reported for the prediction of diverse categorical pharmacological properties. LiCABEDS was successfully applied to model 5-HT1A ligand functionality, ligand selectivity of cannabinoid receptor subtypes, and blood-brain-barrier (BBB) passage. LiCABEDS was implemented and integrated with graphical user interface, data import/export, automated model training/ prediction, and project management. Besides, a non-linear ligand classifier was proposed, using a novel Topomer kernel function in support vector machine. With the emphasis on green high-performance computing, graphics processing units are alternative platforms for computationally expensive tasks. A novel GPU algorithm was designed and implemented in order to accelerate the calculation of chemical similarities with dense-format molecular fingerprints. Finally, a compound acquisition algorithm was reported to construct structurally diverse screening library in order to enhance hit rates in high-throughput screening

    Design of selective peptide inhibitors of anti-apoptotic Bfl-1 using experimental screening, structure-based design, and data-driven modeling

    Get PDF
    Thesis: Ph. D., Massachusetts Institute of Technology, Department of Biology, 2018.Cataloged from PDF version of thesis.Includes bibliographical references.Protein-protein interactions are central to all biological processes. Designer reagents that selectively bind to proteins and inhibit their interactions can be used to probe protein interaction networks, discover druggable targets, and generate potential therapeutic leads. Current technology makes it possible to engineer proteins and peptides with desirable interaction profiles using carefully selected sets of experiments that are customized for each design objective. There is great interest in improving the protein design pipeline to create protein binders more efficiently and against a wider array of targets. In this thesis, I describe the design and development of selective peptide inhibitors of anti-apoptotic BcI-2 family proteins, with an emphasis on targeting Bfl-1. Anti-apoptotic Bcl-2 family proteins bind to short, pro-apoptotic BH3 motifs to support cellular survival. Overexpression of BfI-1 has been shown to promote cancer cell survival and the development of chemoresistance. Prior work suggests that selective inhibition of Bfl-1 can induce cell death in Bfl-1 overexpressing cancer cells without compromising healthy cells that also rely on anti-apoptotic BcI-2 proteins for survival. Thus, Bfl-1-selective BH3 mimetic peptides are potentially valuable for diagnosing Bfl-1 dependence and can serve as leads for therapeutic development. In this thesis, I describe three distinct approaches to designing potent and selective Bfl-1 inhibitors. First, I describe the design and screening of libraries of variants of BH3 peptides. I show that peptides from this screen bind in a previously unobserved BH3 binding mode and have large margins of specificity for Bfl-1 when tested in vitro and in cultured cells. Second, I describe a computational model of the specificity landscape of three anti-apoptotic Bcl-2 proteins including Bfl-1. This model was derived from high-throughput affinity measurement of thousands of peptides from BH3 libraries. I show that this model is useful for designing peptides with desirable interaction profiles within a family of related proteins. Third, I describe the use of a scoring potential built on the amino acid frequencies from well-defined structural motifs complied from the Protein Data Bank to design novel BH3 peptides targeting Bfl-1.by Justin Michael Jenson.Ph. D

    Application and Development of Computational Methods for Ligand-Based Virtual Screening

    Get PDF
    The detection of novel active compounds that are able to modulate the biological function of a target is the primary goal of drug discovery. Different screening methods are available to identify hit compounds having the desired bioactivity in a large collection of molecules. As a computational method, virtual screening (VS) is used to search compound libraries in silico and identify those compounds that are likely to exhibit a specific activity. Ligand-based virtual screening (LBVS) is a subdiscipline that uses the information of one or more known active compounds in order to identify new hit compounds. Different LBVS methods exist, e.g. similarity searching and support vector machines (SVMs). In order to enable the application of these computational approaches, compounds have to be described numerically. Fingerprints derived from the two-dimensional compound structure, called 2D fingerprints, are among the most popular molecular descriptors available. This thesis covers the usage of 2D fingerprints in the context of LBVS. The first part focuses on a detailed analysis of 2D fingerprints. Their performance range against a wide range of pharmaceutical targets is globally estimated through fingerprint-based similarity searching. Additionally, mechanisms by which fingerprints are capable of detecting structurally diverse active compounds are identified. For this purpose, two different feature selection methods are applied to find those fingerprint features that are most relevant for the active compounds and distinguish them from other compounds. Then, 2D fingerprints are used in SVM calculations. The SVM methodology provides several opportunities to include additional information about the compounds in order to direct LBVS search calculations. In a first step, a variant of the SVM approach is applied to the multi-class prediction problem involving compounds that are active against several related targets. SVM linear combination is used to recover compounds with desired activity profiles and deprioritize compounds with other activities. Then, the SVM methodology is adopted for potency-directed VS. Compound potency is incorporated into the SVM approach through potencyoriented SVM linear combination and kernel function design to direct search calculations to the preferential detection of potent hit compounds. Next, SVM calculations are applied to address an intrinsic limitation of similarity-based methods, i.e., the presence of similar compounds having large differences in their potency. An especially designed SVM approach is introduced to predict compound pairs forming such activity cliffs. Finally, the impact of different training sets on the recall performance of SVM-based VS is analyzed and caveats are identified

    Machine learning for large and small data biomedical discovery

    Get PDF
    In modern biomedicine, the role of computation becomes more crucial in light of the ever-increasing growth of biological data, which requires effective computational methods to integrate them in a meaningful way and unveil previously undiscovered biological insights. In this dissertation, we introduce a series of machine learning algorithms for biomedical discovery. Focused on protein functions in the context of system biology, these machine learning algorithms learn representations of protein sequences, structures, and networks in both the small- and large-data scenarios. First, we present a deep learning model that learns evolutionary contexts integrated representations of protein sequence and assists to discover protein variants with enhanced functions in protein engineering. Second, we describe a geometric deep learning model that learns representations of protein and compound structures to inform the prediction of protein-compound binding affinity. Third, we introduce a machine learning algorithm to integrate heterogeneous networks by learning compact network representations and to achieve drug repurposing by predicting novel drug-target interaction. We also present new scientific discoveries enabled by these machine learning algorithms. Taken together, this dissertation demonstrates the potential of machine learning to address the small- and large-data challenges of biomedical data and transform data into actionable insights and new discoveries

    Coarse-grained modeling for molecular discovery:Applications to cardiolipin-selectivity

    Get PDF
    The development of novel materials is pivotal for addressing global challenges such as achieving sustainability, technological progress, and advancements in medical technology. Traditionally, developing or designing new molecules was a resource-intensive endeavor, often reliant on serendipity. Given the vast space of chemically feasible drug-like molecules, estimated between 106 - 10100 compounds, traditional in vitro techniques fall short.Consequently, in silico tools such as virtual screening and molecular modeling have gained increasing recognition. However, the computational cost and the limited precision of the utilized molecular models still limit computational molecular design.This thesis aimed to enhance the molecular design process by integrating multiscale modeling and free energy calculations. Employing a coarse-grained model allowed us to efficiently traverse a significant portion of chemical space and reduce the sampling time required by molecular dynamics simulations. The physics-informed nature of the applied Martini force field and its level of retained structural detail make the model a suitable starting point for the focused learning of molecular properties.We applied our proposed approach to a cardiolipin bilayer, posing a relevant and challenging problem and facilitating reasonable comparison to experimental measurements.We identified promising molecules with defined properties within the resolution limit of a coarse-grained representation. Furthermore, we were able to bridge the gap from in silico predictions to in vitro and in vivo experiments, supporting the validity of the theoretical concept. The findings underscore the potential of multiscale modeling and free-energy calculations in enhancing molecular discovery and design and offer a promising direction for future research

    Coarse-grained modeling for molecular discovery:Applications to cardiolipin-selectivity

    Get PDF
    The development of novel materials is pivotal for addressing global challenges such as achieving sustainability, technological progress, and advancements in medical technology. Traditionally, developing or designing new molecules was a resource-intensive endeavor, often reliant on serendipity. Given the vast space of chemically feasible drug-like molecules, estimated between 106 - 10100 compounds, traditional in vitro techniques fall short.Consequently, in silico tools such as virtual screening and molecular modeling have gained increasing recognition. However, the computational cost and the limited precision of the utilized molecular models still limit computational molecular design.This thesis aimed to enhance the molecular design process by integrating multiscale modeling and free energy calculations. Employing a coarse-grained model allowed us to efficiently traverse a significant portion of chemical space and reduce the sampling time required by molecular dynamics simulations. The physics-informed nature of the applied Martini force field and its level of retained structural detail make the model a suitable starting point for the focused learning of molecular properties.We applied our proposed approach to a cardiolipin bilayer, posing a relevant and challenging problem and facilitating reasonable comparison to experimental measurements.We identified promising molecules with defined properties within the resolution limit of a coarse-grained representation. Furthermore, we were able to bridge the gap from in silico predictions to in vitro and in vivo experiments, supporting the validity of the theoretical concept. The findings underscore the potential of multiscale modeling and free-energy calculations in enhancing molecular discovery and design and offer a promising direction for future research
    • …
    corecore