14 research outputs found

    Robust learning to rank models and their biomedical applications

    Get PDF
    There exist many real-world applications such as recommendation systems, document retrieval, and computational biology where the correct ordering of instances is of equal or greater importance than predicting the exact value of some discrete or continuous outcome. Learning-to-Rank (LTR) refers to a group of algorithms that apply machine learning techniques to tackle these ranking problems. Despite their empirical success, most existing LTR models are not built to be robust to errors in labeling or annotation, distributional data shift, or adversarial data perturbations. To fill this gap, we develop four LTR frameworks that are robust to various types of perturbations. First, Pairwise Elastic Net Regression Ranking (PENRR) is an elastic-net-based regression method for drug sensitivity prediction. PENRR infers robust predictors of drug responses from patient genomic information. The special design of this model (comparing each drug with other drugs in the same cell line and comparing that drug with itself in other cell lines) significantly enhances the accuracy of the drug prediction model under limited data. This approach is also able to solve the problem of fitting on the insensitive drugs that is commonly encountered in regression-based models. Second, Regression-based Ranking by Pairwise Cluster Comparisons (RRPCC) is a ridge-regression-based method for ranking clusters of similar protein complex conformations generated by an underlying docking program (i.e., ClusPro). Rather than using regression to predict scores, which would equally penalize deviations for either low-quality and high-quality clusters, we seek to predict the difference of scores for any pair of clusters corresponding to the same complex. RRPCC combines these pairwise assessments to form a ranked list of clusters, from higher to lower quality. We apply RRPCC to clusters produced by the automated docking server ClusPro and, depending on the training/validation strategy, we show. improvement by 24%–100% in ranking acceptable or better quality clusters first, and by 15%–100% in ranking medium or better quality clusters first. Third, Distributionally Robust Multi-Output Regression Ranking (DRMRR) is a listwise LTR model that induces robustness into LTR problems using the Distributionally Robust Optimization framework. Contrasting to existing methods, the scoring function of DRMRR was designed as a multivariate mapping from a feature vector to a vector of deviation scores, which captures local context information and cross-document interactions. DRMRR employs ranking metrics (i.e., NDCG) in its output. Particularly, we used the notion of position deviation to define a vector of relevance score instead of a scalar one. We then adopted the DRO framework to minimize a worst-case expected multi-output loss function over a probabilistic ambiguity set that is defined by the Wasserstein metric. We also presented an equivalent convex reformulation of the DRO problem, which is shown to be tighter than the ones proposed by the previous studies. Fourth, Inversion Transformer-based Neural Ranking (ITNR) is a Transformer-based model to predict drug responses using RNAseq gene expression profiles, drug descriptors, and drug fingerprints. It utilizes a Context-Aware-Transformer architecture as its scoring function that ensures the modeling of inter-item dependencies. We also introduced a new loss function using the concept of Inversion and approximate permutation matrices. The accuracy and robustness of these LTR models are verified through three medical applications, namely cluster ranking in protein-protein docking, medical document retrieval, and drug response prediction

    Quality assessment of docked protein interfaces using 3D convolution

    Get PDF
    2021 Spring.Includes bibliographical references.Proteins play a vital role in most biological processes, most of which occur through interactions between proteins. When proteins interact they form a complex, whose functionality is different from the individual proteins in the complex. Therefore understanding protein interactions and their interfaces is an important problem. Experimental methods for this task are expensive and time consuming, which has led to the development of docking methods for predicting the structures of protein complexes. These methods produce a large number of potential solutions, and the energy functions used in these methods are not good enough to find solutions that are close to the native state of the complex. Deep learning and its ability to model complex problems has opened up the opportunity to model protein complexes and learn from scratch how to rank docking solutions. As a part of this work, we have developed a 3D convolutional network approach that uses raw atomic densities to address this problem. Our method achieves performance which is on par with state-of-art methods. We have evaluated our model on docked protein structures simulated from four docking tools namely ZDOCK, HADDOCK, FRODOCK and ClusPro on targets from Docking Benchmark Data version 5 (DBD5)

    Inner-View of Nanomaterial Incited Protein Conformational Changes: Insights into Designable Interaction.

    Get PDF
    Nanoparticle bioreactivity critically depends upon interaction between proteins and nanomaterials (NM). The formation of the "protein corona" (PC) is the effect of such nanoprotein interactions. PC has a wide usage in pharmaceuticals, drug delivery, medicine, and industrial biotechnology. Therefore, a detailed in-vitro, in-vivo, and in-silico understanding of nanoprotein interaction is fundamental and has a genuine contemporary appeal. NM surfaces can modify the protein conformation during interaction, or NMs themselves can lead to self-aggregations. Both phenomena can change the whole downstream bioreactivity of the concerned nanosystem. The main aim of this review is to understand the mechanistic view of NM-protein interaction and recapitulate the underlying physical chemistry behind the formation of such complicated macromolecular assemblies, to provide a critical overview of the different models describing NM induced structural and functional modification of proteins. The review also attempts to point out the current limitation in understanding the field and highlights the future scopes, involving a plausible proposition of how artificial intelligence could be aided to explore such systems for the prediction and directed design of the desired NM-protein interactions

    Enhancing protein interaction prediction using deep learning and protein language models

    Full text link
    Proteins are large macromolecules that play critical roles in many cellular activities in living organisms. These include catalyzing metabolic reactions, mediating signal transduction, DNA replication, responding to stimuli, and transporting molecules, to name a few. Proteins perform their functions by interacting with other proteins and molecules. As a result, determining the nature of such interactions is critically important in many areas of biology and medicine. The primary structure of a protein refers to its specific sequence of amino acids, while the tertiary structure refers to its unique 3D shape, and the quaternary structure refers to the interaction of multiple protein subunits to form a larger, more complex structure. While the number of experimentally determined tertiary and quaternary structures are limited, databases of protein sequences continue to grow at an unprecedented rate, providing a wealth of information for training and improving sequence-based models. Recent developments in the sequence-based model using machine learning and deep learning has shown significant progress toward solving protein-related problems. Specifically, attention-based transformer models, a recent breakthrough in Natural Language Processing (NLP), has shown that large models trained on unlabeled data are able to learn powerful representations of protein sequences and can lead to significant improvements in understanding protein folding, function, and interactions, as well as in drug discovery and protein engineering. The research in this thesis has pursued two objectives using sequence-based modeling. The first is to use deep learning techniques based on NLP to address an important problem in cellular immune system studies, namely, predicting Major Histocompatibility Complex (MHC)-Peptide binding. The second is to improve the performance of the Cluspro docking server, a well-known protein-protein docking tool, in three ways: (i) integrating Cluspro with AlphaFold2, a well-known accurate protein structure predictor, for enhanced protein model docking, (ii) predicting distance maps to improve docking accuracy, and (iii) using regression techniques to rank protein clusters for better results

    Plausible blockers of Spike RBD in SARS-CoV2-molecular design and underlying interaction dynamics from high-level structural descriptors

    Get PDF
    COVID-19 is characterized by an unprecedented abrupt increase in the viral transmission rate (SARS-CoV-2) relative to its pandemic evolutionary ancestor, SARS-CoV (2003). The complex molecular cascade of events related to the viral pathogenicity is triggered by the Spike protein upon interacting with the ACE2 receptor on human lung cells through its receptor binding domain (RBDSpike). One potential therapeutic strategy to combat COVID-19 could thus be limiting the infection by blocking this key interaction. In this current study, we adopt a protein design approach to predict and propose non-virulent structural mimics of the RBDSpike which can potentially serve as its competitive inhibitors in binding to ACE2. The RBDSpike is an independently foldable protein domain, resilient to conformational changes upon mutations and therefore an attractive target for strategic re-design. Interestingly, in spite of displaying an optimal shape fit between their interacting surfaces (attributed to a consequently high mutual affinity), the RBDSpike-ACE2 interaction appears to have a quasi-stable character due to a poor electrostatic match at their interface. Structural analyses of homologous protein complexes reveal that the ACE2 binding site of RBDSpike has an unusually high degree of solvent-exposed hydrophobic residues, attributed to key evolutionary changes, making it inherently "reaction-prone." The designed mimics aimed to block the viral entry by occupying the available binding sites on ACE2, are tested to have signatures of stable high-affinity binding with ACE2 (cross-validated by appropriate free energy estimates), overriding the native quasi-stable feature. The results show the apt of directly adapting natural examples in rational protein design, wherein, homology-based threading coupled with strategic "hydrophobic ↔ polar" mutations serve as a potential breakthrough

    Phosphorylcholine and KR12-Containing Corneal Implants in HSV-1-Infected Rabbit Corneas

    Get PDF
    Severe HSV-1 infection can cause blindness due to tissue damage from severe inflammation. Due to the high risk of graft failure in HSV-1-infected individuals, cornea transplantation to restore vision is often contraindicated. We tested the capacity for cell-free biosynthetic implants made from recombinant human collagen type III and 2-methacryloyloxyethyl phosphorylcholine (RHCIII-MPC) to suppress inflammation and promote tissue regeneration in the damaged corneas. To block viral reactivation, we incorporated silica dioxide nanoparticles releasing KR12, the small bioactive core fragment of LL37, an innate cationic host defense peptide produced by corneal cells. KR12 is more reactive and smaller than LL37, so more KR12 molecules can be incorporated into nanoparticles for delivery. Unlike LL37, which was cytotoxic, KR12 was cell-friendly and showed little cytotoxicity at doses that blocked HSV-1 activity in vitro, instead enabling rapid wound closure in cultures of human epithelial cells. Composite implants released KR12 for up to 3 weeks in vitro. The implant was also tested in vivo on HSV-1-infected rabbit corneas where it was grafted by anterior lamellar keratoplasty. Adding KR12 to RHCIII-MPC did not reduce HSV-1 viral loads or the inflammation resulting in neovascularization. Nevertheless, the composite implants reduced viral spread sufficiently to allow stable corneal epithelium, stroma, and nerve regeneration over a 6-month observation period

    Mass & secondary structure propensity of amino acids explain their mutability and evolutionary replacements

    Get PDF
    Why is an amino acid replacement in a protein accepted during evolution? The answer given by bioinformatics relies on the frequency of change of each amino acid by another one and the propensity of each to remain unchanged. We propose that these replacement rules are recoverable from the secondary structural trends of amino acids. A distance measure between high-resolution Ramachandran distributions reveals that structurally similar residues coincide with those found in substitution matrices such as BLOSUM: Asn Asp, Phe Tyr, Lys Arg, Gln Glu, Ile Val, Met → Leu; with Ala, Cys, His, Gly, Ser, Pro, and Thr, as structurally idiosyncratic residues. We also found a high average correlation (\overline{R} R = 0.85) between thirty amino acid mutability scales and the mutational inertia (I X ), which measures the energetic cost weighted by the number of observations at the most probable amino acid conformation. These results indicate that amino acid substitutions follow two optimally-efficient principles: (a) amino acids interchangeability privileges their secondary structural similarity, and (b) the amino acid mutability depends directly on its biosynthetic energy cost, and inversely with its frequency. These two principles are the underlying rules governing the observed amino acid substitutions. © 2017 The Author(s)

    Protein contour modelling and computation for complementarity detection and docking

    Get PDF
    The aim of this thesis is the development and application of a model that effectively and efficiently integrates the evaluation of geometric and electrostatic complementarity for the protein-protein docking problem. Proteins perform their biological roles by interacting with other biomolecules and forming macromolecular complexes. The structural characterization of protein complexes is important to understand the underlying biological processes. Unfortunately, there are several limitations to the available experimental techniques, leaving the vast majority of these complexes to be determined by means of computational methods such as protein-protein docking. The ultimate goal of the protein-protein docking problem is the in silico prediction of the three-dimensional structure of complexes of two or more interacting proteins, as occurring in living organisms, which can later be verified in vitro or in vivo. These interactions are highly specific and take place due to the simultaneous formation of multiple weak bonds: the geometric complementarity of the contours of the interacting molecules is a fundamental requirement in order to enable and maintain these interactions. However, shape complementarity alone cannot guarantee highly accurate docking predictions, as there are several physicochemical factors, such as Coulomb potentials, van der Waals forces and hydrophobicity, affecting the formation of protein complexes. In order to set up correct and efficient methods for the protein-protein docking, it is necessary to provide a unique representation which integrates geometric and physicochemical criteria in the complementarity evaluation. To this end, a novel local surface descriptor, capable of capturing both the shape and electrostatic distribution properties of macromolecular surfaces, has been designed and implemented. The proposed methodology effectively integrates the evaluation of geometrical and electrostatic distribution complementarity of molecular surfaces, while maintaining efficiency in the descriptor comparison phase. The descriptor is based on the 3D Zernike invariants which possess several attractive features, such as a compact representation, rotational and translational invariance and have been shown to adequately capture global and local protein surface shape similarity and naturally represent physicochemical properties on the molecular surface. Locally, the geometric similarity between two portions of protein surface implies a certain degree of complementarity, but the same cannot be stated about electrostatic distributions. Complementarity in electrostatic distributions is more complex to handle, as charges must be matched with opposite ones even if they do not have the same magnitude. The proposed method overcomes this limitation as follows. From a unique electrostatic distribution function, two separate distribution functions are obtained, one for the positive and one for the negative charges, and both functions are normalised in [0, 1]. Descriptors are computed separately for the positive and negative charge distributions, and complementarity evaluation is then done by cross-comparing descriptors of distributions of charges of opposite signs. The proposed descriptor uses a discrete voxel-based representation of the Connolly surface on which the corresponding electrostatic potentials have been mapped. Voxelised surface representations have received a lot of interest in several bioinformatics and computational biology applications as a simple and effective way of jointly representing geometric and physicochemical properties of proteins and other biomolecules by mapping auxiliary information in each voxel. Moreover, the voxel grid can be defined at different resolutions, thus giving the means to effectively control the degree of detail in the discrete representation along with the possibility of producing multiple representations of the same molecule at different resolutions. A specific algorithm has been designed for the efficient computation of voxelised macromolecular surfaces at arbitrary resolutions, starting from experimentally-derived structural data (X-ray crystallography, NMR spectroscopy or cryo-electron microscopy). Fast surface generation is achieved by adapting an approximate Euclidean Distance Transform algorithm in the Connolly surface computation step and by exploiting the geometrical relationship between the latter and the Solvent Accessible surface. This algorithm is at the base of VoxSurf (Voxelised Surface calculation program), a tool which can produce discrete representations of macromolecules at very high resolutions starting from the three-dimensional information of their corresponding PDB files. By employing compact data structures and implementing a spatial slicing protocol, the proposed tool can calculate the three main molecular surfaces at high resolutions with limited memory demands. To reduce the surface computation time without affecting the accuracy of the representation, two parallel algorithms for the computation of voxelised macromolecular surfaces, based on a spatial slicing procedure, have been introduced. The molecule is sliced in a user-defined number of parts and the portions of the overall surface can be calculated for each slice in parallel. The molecule is sliced with planes perpendicular to the abscissa axis of the Cartesian coordinate system defined in the molecule's PDB entry. The first algorithms uses an overlapping margin of one probe-sphere radius length among slices in order to guarantee the correctness of the Euclidean Distance Transform. Because of this margin, the Connolly surface can be computed nearly independently for each slice. Communications among processes are necessary only during the pocket identification procedure which ensures that pockets spanning through more than one slice are correctly identified and discriminated from solvent-excluded cavities inside the molecule. In the second parallel algorithm the size of the overlapping margin between slices has been reduced to a one-voxel length by adapting a multi-step region-growing Euclidean Distance Transform algorithm. At each step, distance values are first calculated independently for every slice, then, a small portion of the borders' information is exchanged between adjacent slices. The proposed methodologies will serve as a basis for a full-fledged protein-protein docking protocol based on local feature matching. Rigorous benchmark tests have shown that the combined geometric and electrostatic descriptor can effectively identify shape and electrostatic distribution complementarity in the binding sites of protein-protein complexes, by efficiently comparing circular surface patches and significantly decreasing the number of false positives obtained when using a purely-geometric descriptor. In the validation experiments, the contours of the two interacting proteins are divided in circular patches: all possible patch pairs from the two proteins are then evaluated in terms of complementarity and a general ranking is produced. Results show that native patch pairs obtain higher ranks when using the newly proposed descriptor, with respect to the ranks obtained when using the purely-geometric one
    corecore