139 research outputs found

    Robustness and Interpretability of Neural Networks’ Predictions under Adversarial Attacks

    Get PDF
    Le reti neurali profonde (DNNs) sono potenti modelli predittivi, che superano le capacità umane in una varietà di task. Imparano sistemi decisionali complessi e flessibili dai dati a disposizione e raggiungono prestazioni eccezionali in molteplici campi di apprendimento automatico, dalle applicazioni dell'intelligenza artificiale, come il riconoscimento di immagini, parole e testi, alle scienze più tradizionali, tra cui medicina, fisica e biologia. Nonostante i risultati eccezionali, le prestazioni elevate e l’alta precisione predittiva non sono sufficienti per le applicazioni nel mondo reale, specialmente in ambienti critici per la sicurezza, dove l'utilizzo dei DNNs è fortemente limitato dalla loro natura black-box. Vi è una crescente necessità di comprendere come vengono eseguite le predizioni, fornire stime di incertezza, garantire robustezza agli attacchi avversari e prevenire comportamenti indesiderati. Anche le migliori architetture sono vulnerabili a piccole perturbazioni nei dati di input, note come attacchi avversari: manipolazioni malevole degli input che sono percettivamente indistinguibili dai campioni originali ma sono in grado di ingannare il modello in predizioni errate. In questo lavoro, dimostriamo che tale fragilità è correlata alla geometria del manifold dei dati ed è quindi probabile che sia una caratteristica intrinseca delle predizioni dei DNNs. Questa condizione suggerisce una possibile direzione al fine di ottenere robustezza agli attacchi: studiamo la geometria degli attacchi avversari nel limite di un numero infinito di dati e di pesi per le reti neurali Bayesiane, dimostrando che, in questo limite, sono immuni agli attacchi avversari gradient-based. Inoltre, proponiamo alcune tecniche di training per migliorare la robustezza delle architetture deterministiche. In particolare, osserviamo sperimentalmente che ensembles di reti neurali addestrati su proiezioni casuali degli input originali in spazi basso-dimensionali sono più resistenti agli attacchi. Successivamente, ci concentriamo sul problema dell'interpretabilità delle predizioni delle reti nel contesto delle saliency-based explanations. Analizziamo la stabilità delle explanations soggette ad attacchi avversari e dimostriamo che, nel limite di un numero infinito di dati e di pesi, le interpretazioni Bayesiane sono più stabili di quelle fornite dalle reti deterministiche. Confermiamo questo comportamento in modo sperimentale nel regime di un numero finito di dati. Infine, introduciamo il concetto di attacco avversario alle sequenze di amminoacidi per protein Language Models (LM). I modelli di Deep Learning per la predizione della struttura delle proteine, come AlphaFold2, sfruttano le architetture Transformer e il loro meccanismo di attention per catturare le proprietà strutturali e funzionali delle sequenze di amminoacidi. Nonostante l'elevata precisione delle predizioni, perturbazioni biologicamente piccole delle sequenze di input, o anche mutazioni di un singolo amminoacido, possono portare a strutture 3D sostanzialmente diverse. Al contempo, i protein LMs sono insensibili alle mutazioni che inducono misfolding o disfunzione (ad esempio le missense mutations). In particolare, le predizioni delle coordinate 3D non rivelano l'effetto di unfolding indotto da queste mutazioni. Pertanto, esiste un'evidente incoerenza tra l'importanza biologica delle mutazioni e il conseguente cambiamento nella predizione strutturale. Ispirati da questo problema, introduciamo il concetto di perturbazione avversaria delle sequenze proteiche negli embedding continui dei protein LMs. Il nostro metodo utilizza i valori di attention per rilevare le posizioni degli amminoacidi più vulnerabili nelle sequenze di input. Le mutazioni avversarie sono biologicamente diverse dalle sequenze di riferimento e sono in grado di alterare in modo significativo le strutture 3D.Deep Neural Networks (DNNs) are powerful predictive models, exceeding human capabilities in a variety of tasks. They learn complex and flexible decision systems from the available data and achieve exceptional performances in multiple machine learning fields, spanning from applications in artificial intelligence, such as image, speech and text recognition, to the more traditional sciences, including medicine, physics and biology. Despite the outstanding achievements, high performance and high predictive accuracy are not sufficient for real-world applications, especially in safety-critical settings, where the usage of DNNs is severely limited by their black-box nature. There is an increasing need to understand how predictions are performed, to provide uncertainty estimates, to guarantee robustness to malicious attacks and to prevent unwanted behaviours. State-of-the-art DNNs are vulnerable to small perturbations in the input data, known as adversarial attacks: maliciously crafted manipulations of the inputs that are perceptually indistinguishable from the original samples but are capable of fooling the model into incorrect predictions. In this work, we prove that such brittleness is related to the geometry of the data manifold and is therefore likely to be an intrinsic feature of DNNs’ predictions. This negative condition suggests a possible direction to overcome such limitation: we study the geometry of adversarial attacks in the large-data, overparameterized limit for Bayesian Neural Networks and prove that, in this limit, they are immune to gradient-based adversarial attacks. Furthermore, we propose some training techniques to improve the adversarial robustness of deterministic architectures. In particular, we experimentally observe that ensembles of NNs trained on random projections of the original inputs into lower dimensional spaces are more resilient to the attacks. Next, we focus on the problem of interpretability of NNs’ predictions in the setting of saliency-based explanations. We analyze the stability of the explanations under adversarial attacks on the inputs and we prove that, in the large-data and overparameterized limit, Bayesian interpretations are more stable than those provided by deterministic networks. We validate this behaviour in multiple experimental settings in the finite data regime. Finally, we introduce the concept of adversarial perturbations of amino acid sequences for protein Language Models (LMs). Deep Learning models for protein structure prediction, such as AlphaFold2, leverage Transformer architectures and their attention mechanism to capture structural and functional properties of amino acid sequences. Despite the high accuracy of predictions, biologically small perturbations of the input sequences, or even single point mutations, can lead to substantially different 3d structures. On the other hand, protein language models are insensitive to mutations that induce misfolding or dysfunction (e.g. missense mutations). Precisely, predictions of the 3d coordinates do not reveal the structure-disruptive effect of these mutations. Therefore, there is an evident inconsistency between the biological importance of mutations and the resulting change in structural prediction. Inspired by this problem, we introduce the concept of adversarial perturbation of protein sequences in continuous embedding spaces of protein language models. Our method relies on attention scores to detect the most vulnerable amino acid positions in the input sequences. Adversarial mutations are biologically diverse from their references and are able to significantly alter the resulting 3D structures

    Data mining for important amino acid residues in multiple sequence alignments and protein structures

    Get PDF
    Enzymes are highly efficient bio-catalysts interesting for industries and medicine. Therefore, a goal of utmost importance in biochemical research is to understand how an enzyme catalyzes a chemical reaction. Here, the computational identification of functionally or structurally important residue positions can be of tremendous help. The datasets that are most informative for the algorithms are the 3D structure of a protein and a multiple sequence alignment (MSA) composed of homologous sequences. For example, an MSA allows for the quantification of residue conservation. Residue conservation at a given position indicates that only one type of amino acid fulfills all constraints imposed by protein structure or function. Furthermore, a detailed analysis of less strictly conserved residue positions may identify pairs, whose orchestration is mutually dependent and induces correlated mutations. Both of these conservation signals are indicative of functionally or structurally important positions. In the first part of this thesis, methods of machine learning were used to identify and classify these residue positions. It was the aim to predict in a mutually exclusively manner a role in catalysis, ligand-binding or protein stability for each residue position of a protein. Unfortunately, for many proteins the 3D structure is unknown. For other proteins, the number of known homologs is not sufficient to compile a meaningful MSA. Therefore, three variants of a classifier were designed and implemented, named CLIPS-1D, CLIPS-3D, and CLIPS-4D. These multi-class support vector machines allow for a classification based on an MSA (CLIPS-1D), a 3D structure (CLIPS-3D), and a combination of both (CLIPS-4D). CLIPS-1D exploits seven sequence-based features, whereas CLIPS-3D utilizes seven structure-based features. CLIPS-4D combines the seven sequence-based features of CLIPS-1D with those two structure-based features that increased its classification performance. A comparison with existing methods and a detailed analysis on a well-studied enzyme confirmed state-of-the-art prediction quality for CLIPS-1D and CLIPS-4D. In the second part of this thesis an algorithm for the identification of correlated mutations was improved. A common method for the identification of correlated mutations is to deduce the mutual information (MI) of a pair of residue positions from an MSA. The classical MI is based on Shannon’s information theory that utilizes probabilities only. Consequently, these approaches do not consider the similarity of residue pairs, which is a severe limitation. In order to improve these algorithms, H2rs was developed for this thesis. Thus, the MIvalues originate from the von Neumann entropy (vNE), which takes into account amino acid similarities modeled by means of a substitution matrix. To further improve the specificity of H2rs, the significance of MIvNE-values was assessed with a bootstrapping approach. The analysis of a large in silico testbed and the detailed assessment of five well-studied enzymes demonstrated state-of-the-art performance

    Information Theory in Molecular Evolution: From Models to Structures and Dynamics

    Get PDF
    This Special Issue collects novel contributions from scientists in the interdisciplinary field of biomolecular evolution. Works listed here use information theoretical concepts as a core but are tightly integrated with the study of molecular processes. Applications include the analysis of phylogenetic signals to elucidate biomolecular structure and function, the study and quantification of structural dynamics and allostery, as well as models of molecular interaction specificity inspired by evolutionary cues

    H2rs: Deducing evolutionary and functionally important residue positions by means of an entropy and similarity based analysis of multiple sequence alignments

    Get PDF
    Background The identification of functionally important residue positions is an important task of computational biology. Methods of correlation analysis allow for the identification of pairs of residue positions, whose occupancy is mutually dependent due to constraints imposed by protein structure or function. A common measure assessing these dependencies is the mutual information, which is based on Shannon’s information theory that utilizes probabilities only. Consequently, such approaches do not consider the similarity of residue pairs, which may degrade the algorithm’s performance. One typical algorithm is H2r, which characterizes each individual residue position k by the conn(k)-value, which is the number of significantly correlated pairs it belongs to. Results To improve specificity of H2r, we developed a revised algorithm, named H2rs, which is based on the von Neumann entropy (vNE). To compute the corresponding mutual information, a matrix A is required, which assesses the similarity of residue pairs. We determined A by deducing substitution frequencies from contacting residue pairs observed in the homologs of 35 809 proteins, whose structure is known. In analogy to H2r, the enhanced algorithm computes a normalized conn(k)-value. Within the framework of H2rs, only statistically significant vNE values were considered. To decide on significance, the algorithm calculates a p-value by performing a randomization test for each individual pair of residue positions. The analysis of a large in silico testbed demonstrated that specificity and precision were higher for H2rs than for H2r and two other methods of correlation analysis. The gain in prediction quality is further confirmed by a detailed assessment of five well-studied enzymes. The outcome of H2rs and of a method that predicts contacting residue positions (PSICOV) overlapped only marginally. H2rs can be downloaded from http://www-bioinf.uni-regensburg.de webcite. Conclusions Considering substitution frequencies for residue pairs by means of the von Neumann entropy and a p-value improved the success rate in identifying important residue positions. The integration of proven statistical concepts and normalization allows for an easier comparison of results obtained with different proteins. Comparing the outcome of the local method H2rs and of the global method PSICOV indicates that such methods supplement each other and have different scopes of application

    Mass & secondary structure propensity of amino acids explain their mutability and evolutionary replacements

    Get PDF
    Why is an amino acid replacement in a protein accepted during evolution? The answer given by bioinformatics relies on the frequency of change of each amino acid by another one and the propensity of each to remain unchanged. We propose that these replacement rules are recoverable from the secondary structural trends of amino acids. A distance measure between high-resolution Ramachandran distributions reveals that structurally similar residues coincide with those found in substitution matrices such as BLOSUM: Asn Asp, Phe Tyr, Lys Arg, Gln Glu, Ile Val, Met → Leu; with Ala, Cys, His, Gly, Ser, Pro, and Thr, as structurally idiosyncratic residues. We also found a high average correlation (\overline{R} R = 0.85) between thirty amino acid mutability scales and the mutational inertia (I X ), which measures the energetic cost weighted by the number of observations at the most probable amino acid conformation. These results indicate that amino acid substitutions follow two optimally-efficient principles: (a) amino acids interchangeability privileges their secondary structural similarity, and (b) the amino acid mutability depends directly on its biosynthetic energy cost, and inversely with its frequency. These two principles are the underlying rules governing the observed amino acid substitutions. © 2017 The Author(s)

    The Era of Radiogenomics in Precision Medicine: An Emerging Approach to Support Diagnosis, Treatment Decisions, and Prognostication in Oncology

    Get PDF
    With the rapid development of new technologies, including artificial intelligence and genome sequencing, radiogenomics has emerged as a state-of-the-art science in the field of individualized medicine. Radiogenomics combines a large volume of quantitative data extracted from medical images with individual genomic phenotypes and constructs a prediction model through deep learning to stratify patients, guide therapeutic strategies, and evaluate clinical outcomes. Recent studies of various types of tumors demonstrate the predictive value of radiogenomics. And some of the issues in the radiogenomic analysis and the solutions from prior works are presented. Although the workflow criteria and international agreed guidelines for statistical methods need to be confirmed, radiogenomics represents a repeatable and cost-effective approach for the detection of continuous changes and is a promising surrogate for invasive interventions. Therefore, radiogenomics could facilitate computer-aided diagnosis, treatment, and prediction of the prognosis in patients with tumors in the routine clinical setting. Here, we summarize the integrated process of radiogenomics and introduce the crucial strategies and statistical algorithms involved in current studies

    Sequence- and structure-based approaches to deciphering enzyme evolution in the Haloalkonoate Dehalogenase superfamily

    Full text link
    Understanding how changes in functional requirements of the cell select for changes in protein sequence and structure is a fundamental challenge in molecular evolution. This dissertation delineates some of the underlying evolutionary forces using as a model system, the Haloalkanoate Dehalogenase Superfamily (HADSF). HADSF members have unique cap-core architecture with the Rossmann-fold core domain accessorized by variable cap domain insertions (delineated by length, topology, and point of insertion). To identify the boundaries of variable domain insertions in protein sequences, I have developed a comprehensive computational strategy (CapPredictor or CP) using a novel sequence alignment algorithm in conjunction with a structure-guided sequence profile. Analysis of more than 40,000 HADSF sequences led to the following observations: (i) cap-type classes exhibit similar distributions across different phyla, indicating existence of all cap-types in the last universal common ancestor, and (ii) comparative analysis of the predicted cap-type and functional diversity indicated that cap-type does not dictate the divergence of substrate recognition and chemical pathway, and hence biological function. By analyzing a unique dataset of core- and cap-domain-only protein structures, I investigated the consequences of the accessory cap domain on the sequence-structure relationship of the core domain. The relationship between sequence and structure divergence in the core fold was shown to be monotonic and independent of the corresponding cap type. However, core domains with the same cap type bore a greater similarity than the core domains with different cap types, suggesting coevolution of the cap and core domains. Remarkably, a few degrees of freedom are needed to describe the structural diversity in the Rossmann fold accounting for the majority of the observed structural variance. Finally, I examined the location and role of conserved residue positions and co-evolving residue pairs in the core domain in the context of the cap domain. Positions critical for function were conserved while non-conserved positions mapped to highly mobile regions. Notably, we found exponential dependence of co-variance on inter-residue distance. Collectively, these novel algorithms and analyses contribute to an improved understanding of enzyme evolution, especially in the context of the use of domain insertions to expand substrate specificity and chemical mechanism

    Genomic Analysis of Antibiotics Resistance in Pathogens

    Get PDF
    The emergence of antibiotic-resistant pathogens currently represents a serious threat to public health and the economy. Due to antibiotic treatments in humans and veterinary medicine, prophylactic use and environmental contamination, bacteria are today more frequently exposed to unnatural doses of antibiotics and their selective effect.Antibiotic resistance can be encoded on chromosomes, plasmids, or other mobile genetic elements in bacteria. It may also result from mutations that lead to changes in the affinity of antibiotics for their targets or in the ability of antibiotics to act on bacterial growth or death. Exposure of bacteria, bacterial populations, and microbial communities to antibiotics at different concentrations shapes their genomic dynamics, as does the mobilisation and spread of resistance determinants. It is, therefore, essential to understand the dynamics and mobilisation of genes encoding antibiotic resistance, in human, animal, plant, and environmental microbiomes, through genomic and metagenomic approaches and bioinformatics analyses.This Special Issue gathers research publications on the horizontal transfer of antibiotic-resistance genes, their dissemination and epidemiology, their association with bacterial virulence, between bacterial genotypes and their phenotypes, and other related research topics

    Adapting the EMPIRIC Approach to Investigate Evolutionary Constraints in Influenza A Virus Surface Proteins

    Get PDF
    Controlling influenza A virus (IAV) infections remains a challenge largely due to the high replication and mutation rates of the virus. IAV is a negative-sense RNA virus with two main surface proteins — hemagglutinin (HA) and neuraminidase (NA). HA recognizes and binds sialic acid on host cell receptors to initiate virus entry. NA also recognizes sialic acid on host cell receptors but functions by cleaving sialic acid interactions to release progeny virus. Because both HA and NA interact with sialic acid on the host cell surface with opposing effects, their balance is essential for optimal viral infectivity. However, the evolutionary constraints that maintain HA and NA function, while conserving a functional balance, are not fully understood. I adapted the comprehensive and systematic mutational scanning technology, termed EMPIRIC (Exceedingly Meticulous and Parallel Investigation of Randomized Individual Codons), to investigate the local fitness landscape of regions of HA under standard conditions and under drug pressure. We observed that synonymous substitutions had a higher mean absolute fitness effect in the signal than a neighboring HA region used as a control. Folding ∆G calculations revealed a hairpin loop that appeared to be differentially enriched between human and swine IAV variants in sequences of circulating strains. However, the molecular mechanism resulting in the observed host species-specific constraints remains undefined. Studying the fitness landscape of the receptor binding site of HA revealed the high sensitivity of this region to mutation. However, modulating the levels of NA activity by mutation and by using the NA inhibitor oseltamivir enabled the identification of HA mutations with adaptive potential under selection pressure by oseltamivir. These results highlight the importance of the HA-NA functional balance virus replication and in the development of resistance to oseltamivir inhibitors. These studies provide improved understanding of IAV biology, and can inform the development of improved antiviral agents with reduced likelihood for resistance
    • …
    corecore