88 research outputs found

    Optimization and machine learning methods for Computational Protein Docking

    Full text link
    Computational Protein Docking (CPD) is defined as determining the stable complex of docked proteins given information about two individual partners, called receptor and ligand. The problem is often formulated as an energy/score minimization where the decision variables are the 6 rigid body transformation variables for the ligand in addition to more variables corresponding to flexibilities in the protein structures. The scoring functions used in CPD are highly nonlinear and nonconvex with a very large number of local minima, making the optimization problem particularly challenging. Consequently, most docking procedures employ a multistage strategy of (i) Global Sampling using a coarse scoring function to identify promising areas followed by (ii) a Refinement stage using more accurate scoring functions and possibly allowing more degrees of freedom. In the first part of this work, the problem of local optimization in the refinement stage is addressed. The goal of local optimization is to remove steric clashes between protein partners and obtain more realistic score values. The problem is formulated as optimization on the space of rigid motions of the ligand. Employing a recently introduced representation of the space of rigid motions as a manifold, a new Riemannian metric is introduced that is closely related to the Root Mean Square Deviation (RMSD) distance measure widely used in Protein Docking. It is argued that the new metric puts rotational and translational variables on equal footing as far local changes of RMSD is concerned. The implications and modifications for gradient-based local optimization algorithms are discussed. In the second part, a new methodology for resampling and refinement of ligand conformations is introduced. The algorithm is a refinement method where the inputs to the algorithm are ensembles of ligand conformations and the goal is to generate new ensembles of refined conformations, closer to the native complex. The algorithm builds upon a previous work and introduces multiple new innovations: Clustering the input conformations, performing dimensionality reduction using Principle Component Analysis (PCA), underestimating the scoring function and resampling and refinement of new conformations. The performance of the algorithm on a comprehensive benchmark of protein complexes is reported. The third part of this work focuses on using machine learning framework for addressing two specific problems in Protein Docking: (i) Constructing a machine learning model in order to predict whether a given receptor and ligand pair interact. This is of significant importance for constructing the so-called protein interaction networks, an critical step in the Drug Discovery process. The success of the algorithm is verified on a benchmark for discrimination between Biological and Crystallographic Dimers. (ii) A ranking scheme for output predictions of a protein docking server is devised. The machine learning model employs the features of the docking server predictions to produce a ranked list with the top ranked predictions having higher probability of being close to the native solution. Two state-of-the-art approaches to the ranking problem are presented and compared in detail and the implications of using the superior approach for a structural docking server is discussed

    PhagePro: prophage finding tool

    Get PDF
    Dissertação de mestrado em BioinformáticaBacteriophages are viruses that infect bacteria and use them to reproduce. Their reproductive cycle can be lytic or lysogenic. The lytic cycle leads to the bacteria death, given that the bacteriophage hijacks hosts machinery to produce phage parts necessary to assemble a new complete bacteriophage, until cell wall lyse occurs. On the other hand, the lysogenic reproductive cycle comprises the bacteriophage genetic material in the bacterial genome, becoming a prophage. Sometimes, due to external stimuli, these prophages can be induced to perform a lytic cycle. Moreover, the lysogenic cycle can lead to significant modifications in bacteria, for example, antibiotic resistance. To that end, PhagePro was created. This tool finds and characterises prophages inserted in the bacterial genome. Using 42 features, three datasets were created and five machine learning algorithms were tested. All models were evaluated in two phases, during testing and with real bacterial cases. During testing, all three datasets reached the 98 % F1 score mark in their best result. In the second phase, the results of the models were used to predict real bacterial cases and the results compared to the results of two tools, Prophage Hunter and PHASTER. The best model found 110 zones out of 154 and the model with the best result in dataset 3 had 94 in common. As a final test, Agrobacterium fabrum strC68 was extensively analysed. The results show that PhagePro was capable of detecting more regions with proteins associated with phages than the other two tools. In the ligth of the results obtained, PhagePro has shown great potential in the discovery and characterisation of bacterial alterations caused by prophages.Bacteriófagos são vírus que infetam bactérias usando-as para garantir a manutenção do seu genoma. Este processo pode ser realizado por ciclo lítico ou lipogénico. O ciclo lítico consiste em usar a célula para seu proveito, criar bacteriófagos e lisar a célula. Por outro lado, no ciclo lipogénico o bacteriófago insere o seu código genético no genoma da bactéria, o que pode levar à transferência de genes de interesse, tornando-se importante uma monitorização dos profagos. Assim foi desenvolvido o PhagePro, uma ferramenta capaz de encontrar e caracterizar bacteriófagos em genomas bactérias. Foram criadas features para distinguir profagos de bactérias, criando três datasets e usando algoritmos de aprendizagem de máquina. Os modelos foram avaliados durante duas fases, a fase de teste e a fase de casos reais. Na primeira fase de testes, o melhor modelo do dataset 1 teve 98% de F1 score, dataset 2 teve 98% e do dataset 3 também teve 98%. Todos os modelos, para teste em casos reais, foram comparados com previsões de duas ferramentas Prophage Hunter e PHASTER. O modelo com os melhores resultados obteve 110 de 154 zonas em comum com as duas ferramentas e o modelo do dataset 3 teve 94 zonas. Por fim, foi feita a análise dos resultados da bactéria Agrobacterium fabrum strC68. Os resultados obtidos mostram resultados diferentes, mas válidos, as ferramentas comparadas, visto que o PhagePro consegue detectar zonas com proteínas associadas a fagos que as outras tools não conseguem. Em virtude dos resultados obtidos, PhagePro mostrou que é capaz de encontrar e caracterizar profagos em bactérias.Este estudo contou com o apoio da Fundação para a Ciência e Tecnologia (FCT) portuguesa no âmbito do financiamento estratégico da unidade UIDB/04469/2020. A obra também foi parcialmente financiada pelo Projeto PTDC/SAU-PUB/29182/2017 [POCI-01-0145-FEDER-029182]

    Bioinformatics Applications Based On Machine Learning

    Get PDF
    The great advances in information technology (IT) have implications for many sectors, such as bioinformatics, and has considerably increased their possibilities. This book presents a collection of 11 original research papers, all of them related to the application of IT-related techniques within the bioinformatics sector: from new applications created from the adaptation and application of existing techniques to the creation of new methodologies to solve existing problems

    Image Processing and Simulation Toolboxes of Microscopy Images of Bacterial Cells

    Get PDF
    Recent advances in microscopy imaging technology have allowed the characterization of the dynamics of cellular processes at the single-cell and single-molecule level. Particularly in bacterial cell studies, and using the E. coli as a case study, these techniques have been used to detect and track internal cell structures such as the Nucleoid and the Cell Wall and fluorescently tagged molecular aggregates such as FtsZ proteins, Min system proteins, inclusion bodies and all the different types of RNA molecules. These studies have been performed with using multi-modal, multi-process, time-lapse microscopy, producing both morphological and functional images. To facilitate the finding of relationships between cellular processes, from small-scale, such as gene expression, to large-scale, such as cell division, an image processing toolbox was implemented with several automatic and/or manual features such as, cell segmentation and tracking, intra-modal and intra-modal image registration, as well as the detection, counting and characterization of several cellular components. Two segmentation algorithms of cellular component were implemented, the first one based on the Gaussian Distribution and the second based on Thresholding and morphological structuring functions. These algorithms were used to perform the segmentation of Nucleoids and to identify the different stages of FtsZ Ring formation (allied with the use of machine learning algorithms), which allowed to understand how the temperature influences the physical properties of the Nucleoid and correlated those properties with the exclusion of protein aggregates from the center of the cell. Another study used the segmentation algorithms to study how the temperature affects the formation of the FtsZ Ring. The validation of the developed image processing methods and techniques has been based on benchmark databases manually produced and curated by experts. When dealing with thousands of cells and hundreds of images, these manually generated datasets can become the biggest cost in a research project. To expedite these studies in terms of time and lower the cost of the manual labour, an image simulation was implemented to generate realistic artificial images. The proposed image simulation toolbox can generate biologically inspired objects that mimic the spatial and temporal organization of bacterial cells and their processes, such as cell growth and division and cell motility, and cell morphology (shape, size and cluster organization). The image simulation toolbox was shown to be useful in the validation of three cell tracking algorithms: Simple Nearest-Neighbour, Nearest-Neighbour with Morphology and DBSCAN cluster identification algorithm. It was shown that the Simple Nearest-Neighbour still performed with great reliability when simulating objects with small velocities, while the other algorithms performed better for higher velocities and when there were larger clusters present

    A guide to machine learning for biologists

    Get PDF
    The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed

    iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization

    Get PDF
    Sequence-based analysis and prediction are fundamental bioinformatic tasks that facilitate understanding of the sequence(-structure)-function paradigm for DNAs, RNAs and proteins. Rapid accumulation of sequences requires equally pervasive development of new predictive models, which depends on the availability of effective tools that support these efforts. We introduce iLearnPlus, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine-learning pipelines for analysis and predictions using nucleic acid and protein sequences. iLearnPlus provides a comprehensive set of algorithms and automates sequence-based feature extraction and analysis, construction and deployment of models, assessment of predictive performance, statistical analysis, and data visualization; all without programming. iLearnPlus includes a wide range of feature sets which encode information from the input sequences and over twenty machine-learning algorithms that cover several deep-learning approaches, outnumbering the current solutions by a wide margin. Our solution caters to experienced bioinformaticians, given the broad range of options, and biologists with no programming background, given the point-and-click interface and easy-to-follow design process. We showcase iLearnPlus with two case studies concerning prediction of long noncoding RNAs (lncRNAs) from RNA transcripts and prediction of crotonylation sites in protein chains. iLearnPlus is an open-source platform available at https://github.com/Superzchen/iLearnPlus/ with the webserver at http://ilearnplus.erc.monash.edu/.Zhen Chen, Pei Zhao, Chen Li, Fuyi Li, Dongxu Xiang, Yong-Zi Chen, Tatsuya Akutsu, Roger J. Daly, Geoffrey I. Webb, Quanzhi Zhao, Lukasz Kurgan, and Jiangning Son

    Tracing Evolution of Gene Transfer Agents Using Comparative Genomics

    Get PDF
    The accumulating evidence suggest that viruses and their components can be domesticated by their hosts, equipping them with convenient molecular toolkits for various functions. One of such domesticated system is Gene Transfer Agents (GTAs) that are produced by some bacteria and archaea. GTAs morphologically resemble small phage-like particles and contain random fragments of their host genome. They are produced only by a small fraction of the microbial population and are released through a lysis of the host cell. Bioinformatic analyses suggest that GTAs are especially abundant in the taxonomic class of Alphaproteobacteria, where they are vertically inherited and evolve as a part of their host genomes. In this work, we extensively analyze evolutionary patterns of alphaproteobacterial GTAs using comparative genomics, phylogenomics and machine learning methods. We initially develop an algorithm that validate the wide presence of GTA elements in alphaproteobacterial genomes, where they are generally mistaken for prophages due to their homology. Furthermore, we demonstrate that GTAs evolve under the selection that reduces the energetic cost of their production, indicating their importance for the conditions of the nutrient depletion. The genome-wide screenings of translational selection and coevolution signatures highlight the significance of GTAs as a stress-response adaptation for the horizontal gene transfer, revealing a set of previously unknown genes that could play a role in the GTA cycle. As production of GTAs leads to the host death, their maintenance is likely to be under a kin or group level selection. By combining our findings with accumulated body of knowledge, this work proposes a conceptual model illustrating the role of GTAs in bacterial populations and their persistence for hundreds of millions of years of evolution

    Modeling the effect of blending multiple components on gasoline properties

    Get PDF
    Global CO2 emissions reached a new historical maximum in 2018 and transportation sector contributed to one fourth of those emissions. Road transport industry has started moving towards more sustainable solutions, however, market penetration for electric vehicles (EV) is still too slow while regulation for biofuels has become stricter due to the risk of inflated food prices and skepticism regarding their sustainability. In spite of this, Europe has ambitious targets for the next 30 years and impending strict policies resulting from these goals will definitely increase the pressure on the oil sector to move towards cleaner practices and products. Although the use of biodiesel is quite extended and bioethanol is already used as a gasoline component, there are no alternative drop-in fuels compatible with spark ignition engines in the market yet. Alternative feedstock is widely available but its characteristics differ from those of crude oil, and lack of homogeneity and substantially lower availability complicate its integration in conventional refining processes. This work explores the possibility of implementing Machine Learning to develop predictive models for auto-ignition properties and to gain a better understanding of the blending behavior of the different molecules that conform commercial gasoline. Additionally, the methodology developed in this study aims to contribute to new characterization methods for conventional and renewable gasoline streams in a simpler, faster and more inexpensive way. To build the models included in this thesis, a palette with seven different compounds was chosen: n-heptane, iso-octane, 1-hexene, cyclopentane, toluene, ethanol and ETBE. A data set containing 243 different combinations of the species in the palette was collected from literature, together with their experimentally measured RON and/or MON. Linear Regression based on Ordinary Least Squares was used as the baseline to compare the performance of more complex algorithms, namely Nearest Neighbors, Support Vector Machines, Decision Trees and Random Forest. The best predictions were obtained with a Support Vector Regression algorithm using a non-linear kernel, able to reproduce synergistic and antagonistic interaction between the seven molecules in the samples

    iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets

    Get PDF
    The rapid accumulation of molecular data motivates development of innovative approaches to computationally characterize sequences, structures and functions of biological and chemical molecules in an efficient, accessible and accurate manner. Notwithstanding several computational tools that characterize protein or nucleic acids data, there are no one-stop computational toolkits that comprehensively characterize a wide range of biomolecules. We address this vital need by developing a holistic platform that generates features from sequence and structural data for a diverse collection of molecule types. Our freely available and easy-to-use iFeatureOmega platform generates, analyzes and visualizes 189 representations for biological sequences, structures and ligands. To the best of our knowledge, iFeatureOmega provides the largest scope when directly compared to the current solutions, in terms of the number of feature extraction and analysis approaches and coverage of different molecules. We release three versions of iFeatureOmega including a webserver, command line interface and graphical interface to satisfy needs of experienced bioinformaticians and less computer-savvy biologists and biochemists. With the assistance of iFeatureOmega, users can encode their molecular data into representations that facilitate construction of predictive models and analytical studies. We highlight benefits of iFeatureOmega based on three research applications, demonstrating how it can be used to accelerate and streamline research in bioinformatics, computational biology, and cheminformatics areas. The iFeatureOmega webserver is freely available at http://ifeatureomega.erc.monash.edu and the standalone versions can be downloaded from https://github.com/Superzchen/iFeatureOmega-GUI/ and https://github.com/Superzchen/iFeatureOmega-CLI/.Zhen Chen, Xuhan Liu, Pei Zhao, Chen Li, Yanan Wang, Fuyi Li, Tatsuya Akutsu, Chris Bain, Robin B. Gasser, Junzhou Li, Zuoren Yang, Xin Gao, Lukasz Kurgan, and Jiangning Son
    corecore