10 research outputs found

    atomium - A Python structure parser

    Get PDF
    Summary: Structural biology relies on specific file formats to convey information about macromolecular structures. Traditionally this has been the PDB format, but increasingly newer formats, such as PDBML, mmCIF and MMTF are being used. Here we present atomium, a modern, lightweight, Python library for parsing, manipulating and saving PDB, mmCIF and MMTF file formats. In addition, we provide a web service, pdb2json, which uses atomium to give a consistent JSON representation to the entire Protein Data Bank. Availability and implementation: atomium is implemented in Python and its performance is equivalent to the existing library BioPython. However, it has significant advantages in features and API design. atomium is available from atomium.bioinf.org.uk and pdb2json can be accessed at pdb2json.bioinf.org.uk Supplementary information: Supplementary data are available at Bioinformatics online

    Biotite: a unifying open source computational biology framework in Python

    Get PDF
    Background: As molecular biology is creating an increasing amount of sequence and structure data, the multitude of software to analyze this data is also rising. Most of the programs are made for a specific task, hence the user often needs to combine multiple programs in order to reach a goal. This can make the data processing unhandy, inflexible and even inefficient due to an overhead of read/write operations. Therefore, it is crucial to have a comprehensive, accessible and efficient computational biology framework in a scripting language to overcome these limitations. Results: We have developed the Python package Biotite: a general computational biology framework, that represents sequence and structure data based on NumPy ndarrays. Furthermore the package contains seamless interfaces to biological databases and external software. The source code is freely accessible at https://github.com/biotite-dev/biotite. Conclusions: Biotite is unifying in two ways: At first it bundles popular tasks in sequence analysis and structural bioinformatics in a consistently structured package. Secondly it adresses two groups of users: novice programmers get an easy access to Biotite due to its simplicity and the comprehensive documentation. On the other hand, advanced users can profit from its high performance and extensibility. They can implement their algorithms upon Biotite, so they can skip writing code for general functionality (like file parsers) and can focus on what their software makes unique

    Sharing data from molecular simulations

    Get PDF
    Given the need for modern researchers to produce open, reproducible scientific output, the lack of standards and best practices for sharing data and workflows used to produce and analyze molecular dynamics (MD) simulations has become an important issue in the field. There are now multiple well-established packages to perform molecular dynamics simulations, often highly tuned for exploiting specific classes of hardware, each with strong communities surrounding them, but with very limited interoperability/transferability options. Thus, the choice of the software package often dictates the workflow for both simulation production and analysis. The level of detail in documenting the workflows and analysis code varies greatly in published work, hindering reproducibility of the reported results and the ability for other researchers to build on these studies. An increasing number of researchers are motivated to make their data available, but many challenges remain in order to effectively share and reuse simulation data. To discuss these and other issues related to best practices in the field in general, we organized a workshop in November 2018 (https://bioexcel.eu/events/workshop-on-sharing-data-from-molecular-simulations/). Here, we present a brief overview of this workshop and topics discussed. We hope this effort will spark further conversation in the MD community to pave the way toward more open, interoperable, and reproducible outputs coming from research studies using MD simulations

    Predicting and Characterising Zinc Metal Binding Sites in Proteins

    Get PDF
    Zinc is one of the most important biologically active metals. Ten per cent of the human genome is thought to encode a zinc binding protein and its uses encompass catalysis, structural stability, gene expression and immunity. Knowing whether a protein binds to zinc can offer insights into its function, and knowing precisely where it binds zinc can show the mechanism by which it carries out its intended function, as well as provide suggestions as to how pharmaceutical molecules might disrupt or enhance this function where required for medical interventions. At present, there is no specific resource devoted to identifying and presenting all currently known zinc binding sites. This PhD has resulted in the creation of ZincBind — a database of zinc binding sites (ZincBindDB), predictive models of zinc binding at the family level (ZincBindPredict) and a user-friendly, modern website frontend (ZincBindWeb). Both ZincBindDB and ZincBindPredict are also available as GraphQL APIs. The database of zinc binding sites currently contains 38,141 sites, and is automatically updated every week. The predictive models, trained using the Random Forest Machine Learning algorithm, all achieve an MCC ≥ 0.88, recall ≥0.93 and precision ≥0.91 for the structural models (mean MCC = 0.97), while the sequence models have MCC ≥ 0.64, recall ≥0.80 and pre- cision ≥0.83 (mean MCC = 0.87), outperforming competing, previous predictive models

    In Silico Design and Selection of CD44 Antagonists:implementation of computational methodologies in drug discovery and design

    Get PDF
    Drug discovery (DD) is a process that aims to identify drug candidates through a thorough evaluation of the biological activity of small molecules or biomolecules. Computational strategies (CS) are now necessary tools for speeding up DD. Chapter 1 describes the use of CS throughout the DD process, from the early stages of drug design to the use of artificial intelligence for the de novo design of therapeutic molecules. Chapter 2 describes an in-silico workflow for identifying potential high-affinity CD44 antagonists, ranging from structural analysis of the target to the analysis of ligand-protein interactions and molecular dynamics (MD). In Chapter 3, we tested the shape-guided algorithm on a dataset of macrocycles, identifying the characteristics that need to be improved for the development of new tools for macrocycle sampling and design. In Chapter 4, we describe a detailed reverse docking protocol for identifying potential 4-hydroxycoumarin (4-HC) targets. The strategy described in this chapter is easily transferable to other compounds and protein datasets for overcoming bottlenecks in molecular docking protocols, particularly reverse docking approaches. Finally, Chapter 5 shows how computational methods and experimental results can be used to repurpose compounds as potential COVID-19 treatments. According to our findings, the HCV drug boceprevir could be clinically tested or used as a lead molecule to develop compounds that target COVID-19 or other coronaviral infections. These chapters, in summary, demonstrate the importance, application, limitations, and future of computational methods in the state-of-the-art drug design process

    Development of a programming library for general bioinformatics

    Get PDF
    Bioinformatics progresses at an unprecedented pace. At the same time the software implementing the essential algorithms is often incompatible with each other in terms of data input and output. In consequence it can require substantial effort to establish a workflow that combines different programs. Furthermore, the flexibility of such software is usually limited to a relatively small number of options. These circumstances hamper the adaption of these programs to new problems. An alternative approach to command line programs are programming libraries, that enable the user to apply already implemented algorithms and at the same time to harness the full feature spectrum of a programming language. In this thesis the Python bioinformatics package Biotite is presented. It unifies popular algorithms from sequence and structure analysis into a flexible library, which is applicable to a wide range of biological questions. Furthermore, new algorithms are presented, enhancing the bioinformatician’s toolkit with a novel sequence alignment visualization approach and universally applicable hydrogen prediction method. Finally, via the application of Biotite this thesis provides new insights into the molecular mechanism of cation channels and novel evaluation methods for sequencing data from SELEX experiments

    Modeling homo- and hetero-oligomers using in silico prediction of protein quaternary structure

    Get PDF
    Cellular processes often depend on interactions between proteins and the formation of macromolecular complexes. The impairment of such interactions can lead to deregulation of pathways resulting in disease states, and it is hence crucial to gain insights into the nature of the macromolecular assemblies. Detailed structural knowledge about complexes and protein-protein interactions is growing, but experimentally determined three-dimensional multimeric assemblies are outnumbered by complexes supported by non-structural experimental evidence. In this thesis, we aim to fill this gap by modeling multimeric structures by homology, and we ask which properties of proteins within a family can assist in the prediction of the correct quaternary structure. Specifically, we introduce a description of protein-protein interface conservation as a function of evolutionary distance. This enables us to reduce the noise in deep multiple sequence alignments where sequences of proteins organized in different oligomeric states are interspersed. We also define a distance measure to structurally compare homologous multimeric protein complexes. This allows us to hierarchically cluster protein structures and quantify the diversity of alternative biological assemblies known today in the Protein Data Bank (PDB). We find that a combination of conservation scores, structural clustering, and classical interface descriptors, is able to improve the selection of homologous protein templates leading to reliable models of protein complexes

    MMTF-An efficient file format for the transmission, visualization, and analysis of macromolecular structures.

    No full text
    Recent advances in experimental techniques have led to a rapid growth in complexity, size, and number of macromolecular structures that are made available through the Protein Data Bank. This creates a challenge for macromolecular visualization and analysis. Macromolecular structure files, such as PDB or PDBx/mmCIF files can be slow to transfer, parse, and hard to incorporate into third-party software tools. Here, we present a new binary and compressed data representation, the MacroMolecular Transmission Format, MMTF, as well as software implementations in several languages that have been developed around it, which address these issues. We describe the new format and its APIs and demonstrate that it is several times faster to parse, and about a quarter of the file size of the current standard format, PDBx/mmCIF. As a consequence of the new data representation, it is now possible to visualize structures with millions of atoms in a web browser, keep the whole PDB archive in memory or parse it within few minutes on average computers, which opens up a new way of thinking how to design and implement efficient algorithms in structural bioinformatics. The PDB archive is available in MMTF file format through web services and data that are updated on a weekly basis
    corecore