54 research outputs found

    Estimating the Rate-Distortion Function by Wasserstein Gradient Descent

    Full text link
    In the theory of lossy compression, the rate-distortion (R-D) function R(D)R(D) describes how much a data source can be compressed (in bit-rate) at any given level of fidelity (distortion). Obtaining R(D)R(D) for a given data source establishes the fundamental performance limit for all compression algorithms. We propose a new method to estimate R(D)R(D) from the perspective of optimal transport. Unlike the classic Blahut--Arimoto algorithm which fixes the support of the reproduction distribution in advance, our Wasserstein gradient descent algorithm learns the support of the optimal reproduction distribution by moving particles. We prove its local convergence and analyze the sample complexity of our R-D estimator based on a connection to entropic optimal transport. Experimentally, we obtain comparable or tighter bounds than state-of-the-art neural network methods on low-rate sources while requiring considerably less tuning and computation effort. We also highlight a connection to maximum-likelihood deconvolution and introduce a new class of sources that can be used as test cases with known solutions to the R-D problem.Comment: Accepted as conference paper at NeurIPS 202

    Machine Learning in Compiler Optimization

    Get PDF
    In the last decade, machine learning based compilation has moved from an an obscure research niche to a mainstream activity. In this article, we describe the relationship between machine learning and compiler optimisation and introduce the main concepts of features, models, training and deployment. We then provide a comprehensive survey and provide a road map for the wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This paper provides both an accessible introduction to the fast moving area of machine learning based compilation and a detailed bibliography of its main achievements

    Methods towards precision bioinformatics in single cell era

    Get PDF
    Single-cell technology offers unprecedented insight into the molecular landscape of individual cell and is transforming precision medicine. Key to the effective use of single-cell data for disease understanding is the analysis of such information through bioinformatics methods. In this thesis, we examine and address several challenges in single-cell bioinformatics methods for precision medicine. While most of current single-cell analytical tools employ statistical and machine learning methods, deep learning technology has gained tremendous success in computer science. Combined with ensemble learning, this further improve model performance. Through a review article (Cao et al., 2020), we share recent key developments in this area and their contribution to bioinformatics research. Bioinformatics tools often use simulation data to assess proposed methodologies, but evaluation of the quality of single-cell RNA-sequencing (scRNA-seq) data simulation tools is lacking. We develop a comprehensive framework, SimBench (Cao et al., 2021), that examines a range of aspects from data properties to the ability to maintain biological signals, scalability, and applicability. While individual patient understanding is the key to precision medicine, there is little consensus on the best ways to compress complex single-cell data into summary statistics that represent each individual. We present scFeatures (Cao et al., 2022b), an approach that creates interpretable molecular representations for individuals. Finally, in a case study using multiple COVID-19 scRNA-seq data, we utilise scFeatures to generate molecular characterisations of individuals and illustrate the impact of ensemble learning and deep learning on improving disease outcome prediction. Overall, this thesis addresses several gaps in precision bioinformatics in the single-cell field by highlighting research advances, developing methodologies, and illustrating practical uses through experimental datasets and case studies

    Graph Priors, Optimal Transport, and Deep Learning in Biomedical Discovery

    Get PDF
    Recent advances in biomedical data collection allows the collection of massive datasets measuring thousands of features in thousands to millions of individual cells. This data has the potential to advance our understanding of biological mechanisms at a previously impossible resolution. However, there are few methods to understand data of this scale and type. While neural networks have made tremendous progress on supervised learning problems, there is still much work to be done in making them useful for discovery in data with more difficult to represent supervision. The flexibility and expressiveness of neural networks is sometimes a hindrance in these less supervised domains, as is the case when extracting knowledge from biomedical data. One type of prior knowledge that is more common in biological data comes in the form of geometric constraints. In this thesis, we aim to leverage this geometric knowledge to create scalable and interpretable models to understand this data. Encoding geometric priors into neural network and graph models allows us to characterize the models’ solutions as they relate to the fields of graph signal processing and optimal transport. These links allow us to understand and interpret this datatype. We divide this work into three sections. The first borrows concepts from graph signal processing to construct more interpretable and performant neural networks by constraining and structuring the architecture. The second borrows from the theory of optimal transport to perform anomaly detection and trajectory inference efficiently and with theoretical guarantees. The third examines how to compare distributions over an underlying manifold, which can be used to understand how different perturbations or conditions relate. For this we design an efficient approximation of optimal transport based on diffusion over a joint cell graph. Together, these works utilize our prior understanding of the data geometry to create more useful models of the data. We apply these methods to molecular graphs, images, single-cell sequencing, and health record data

    CompF2: Theoretical Calculations and Simulation Topical Group Report

    Full text link
    This report summarizes the work of the Computational Frontier topical group on theoretical calculations and simulation for Snowmass 2021. We discuss the challenges, potential solutions, and needs facing six diverse but related topical areas that span the subject of theoretical calculations and simulation in high energy physics (HEP): cosmic calculations, particle accelerator modeling, detector simulation, event generators, perturbative calculations, and lattice QCD (quantum chromodynamics). The challenges arise from the next generations of HEP experiments, which will include more complex instruments, provide larger data volumes, and perform more precise measurements. Calculations and simulations will need to keep up with these increased requirements. The other aspect of the challenge is the evolution of computing landscape away from general-purpose computing on CPUs and toward special-purpose accelerators and coprocessors such as GPUs and FPGAs. These newer devices can provide substantial improvements for certain categories of algorithms, at the expense of more specialized programming and memory and data access patterns.Comment: Report of the Computational Frontier Topical Group on Theoretical Calculations and Simulation for Snowmass 202

    A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations

    Get PDF
    The ability to sense and react to external and internal pH changes is a survival requirement for any cell. pH homeostasis is tightly regulated, and even minor disruptions can severely impact cell metabolism, function, and survival. The pH dependence of proteins can be attributed to only 7 out of the 20 canonical amino acids, the titratable amino acids that can exchange protons with water in the usual 0-14 pH range. These amino acids make up for approximately 31% of all amino acids in the human proteome, meaning that, on average, roughly one-third of each protein is sensitive not only to the medium pH but also to alterations in the electrostatics of its surroundings. Unsurprisingly, protonation switches have been associated with a wide array of protein behaviors, including modulating the binding affinity in protein-protein, protein-ligand, or protein-lipid systems, modifying enzymatic activity and function, and even altering their stability and subcellular location. Despite its importance, our molecular understanding of pHdependent effects in proteins and other biomolecules is still very limited, particularly in big macromolecular complexes such as protein-protein or membrane protein systems. Over the years, several classes of methods have been developed to provide molecular insights into the protonation preference and dependence of biomolecules. Empirical methods offer cheap and competitive predictions for time- or resource-constrained situations. Albeit more computationally expensive, continuum electrostatics-based are a cost-effective solution for estimating microscopic equilibrium constants, pKhalf and macroscopic pKa. To study pH-dependent conformational transitions, constant-pH molecular dynamics (CpHMD) is the appropriate methodology. Unfortunately, given the computational cost and, in many cases, the difficulty associated with using CE-based and CpHMD, most researchers overuse empirical methods or neglect the effect of pH in their studies. Here, we address these issues by proposing multiple pKa predictor methods and tools with different levels of theory designed to be faster and accessible to more users. First, we introduced PypKa, a flexible tool to predict Poisson–Boltzmann/Monte Carlo-based (PB/MC) pKa values of titratable sites in proteins. It was validated with a large set of experimental values exhibiting a competitive performance. PypKa supports CPU parallel computing and can be used directly on proteins obtained from the Protein Data Bank (PDB) repository or molecular dynamics (MD) simulations. A simple, reusable, and extensible Python API is provided, allowing pKa calculations to be easily added to existing protocols with a few extra lines of code. This capability was exploited in the development of PypKa-MD, an easy-to-use implementation of the stochastic titration CpHMD method. PypKa-MD supports GROMOS and CHARMM force fields, as well as modern versions of GROMACS. Using PypKa’s API and consequent abstraction of PB/MC contributed to its greatly simplified modular architecture that will serve as the foundation for future developments. The new implementation was validated on alanine-based tetrapeptides with closely interacting titratable residues and four commonly used benchmark proteins, displaying highly similar and correlated pKa predictions compared to a previously validated implementation. Like most structural-based computational studies, the majority of pKa calculations are performed on experimental structures deposited in the PDB. Furthermore, there is an ever-growing imbalance between scarce experimental pKa values and the increasingly higher number of resolved structures. To save countless hours and resources that would be spent on repeated calculations, we have released pKPDB, a database of over 12M theoretical pKa values obtained by running PypKa over 120k protein structures from the PDB. The precomputed pKa estimations can be retrieved instantaneously via our web application, the PypKa Server. In case the protein of interest is not in the pKPDB, the user may easily run PypKa in the cloud either by uploading a custom structure or submitting an identifier code from the PBD or UniProtKB. It is also possible to use the server to get structures with representative pH-dependent protonation states to be used in other computational methods such as molecular dynamics. The advent of artificial intelligence in biological sciences presented an opportunity to drastically accelerate pKa predictors using our previously generated database of pKa values. With pKAI, we introduced the first deep learning-based predictor of pKa shifts in proteins trained on continuum electrostatics data. By combining a reasonable understanding of the underlying physics, an accuracy comparable to that of physics-based methods, and inference time speedups of more than 1000 ×, pKAI provided a game-changing solution for fast estimations of macroscopic pKa from ensembles of microscopic values. However, several limitations needed to be addressed before its integration within the CpHMD framework as a replacement for PypKa. Hence, we proposed a new graph neural network for protein pKa predictions suitable for CpHMD, pKAI-MD. This model estimates pH-independent energies to be used in a Monte Carlo routine to sample representative microscopic protonation states. While developing the new model, we explored different graph representations of proteins using multiple electrostatics-driven properties. While there are certainly many new features to be introduced and a multitude of development to be expanded, the selection of methods and tools presented in this work poses a significant improvement over the alternatives and effectively constitutes a new generation of user-friendly and machine learning-accelerated methods for pKa calculations
    • …
    corecore