9 research outputs found

    Exploring the potential of 3D Zernike descriptors and SVM for protein\u2013protein interface prediction

    Get PDF
    Abstract Background The correct determination of protein–protein interaction interfaces is important for understanding disease mechanisms and for rational drug design. To date, several computational methods for the prediction of protein interfaces have been developed, but the interface prediction problem is still not fully understood. Experimental evidence suggests that the location of binding sites is imprinted in the protein structure, but there are major differences among the interfaces of the various protein types: the characterising properties can vary a lot depending on the interaction type and function. The selection of an optimal set of features characterising the protein interface and the development of an effective method to represent and capture the complex protein recognition patterns are of paramount importance for this task. Results In this work we investigate the potential of a novel local surface descriptor based on 3D Zernike moments for the interface prediction task. Descriptors invariant to roto-translations are extracted from circular patches of the protein surface enriched with physico-chemical properties from the HQI8 amino acid index set, and are used as samples for a binary classification problem. Support Vector Machines are used as a classifier to distinguish interface local surface patches from non-interface ones. The proposed method was validated on 16 classes of proteins extracted from the Protein–Protein Docking Benchmark 5.0 and compared to other state-of-the-art protein interface predictors (SPPIDER, PrISE and NPS-HomPPI). Conclusions The 3D Zernike descriptors are able to capture the similarity among patterns of physico-chemical and biochemical properties mapped on the protein surface arising from the various spatial arrangements of the underlying residues, and their usage can be easily extended to other sets of amino acid properties. The results suggest that the choice of a proper set of features characterising the protein interface is crucial for the interface prediction task, and that optimality strongly depends on the class of proteins whose interface we want to characterise. We postulate that different protein classes should be treated separately and that it is necessary to identify an optimal set of features for each protein class

    Sequencing, assembly and annotation of the mitochondrial and plastid genomes of Gelidium pristoides (Turner) KĂĽtzing from Kenton-on-Sea, South Africa

    Get PDF
    The genome is the complete set of an organism's hereditary information that contains all the information necessary for the functioning of that organism. Complete nuclear, mitochondrial and plastid DNA constitute the three main types of genomes which play interconnected roles in an organism. Genome sequencing enables researchers to understand the regulation and expression of the various genes and the proteins they encode. It allows researchers to extract and analyse genes of interests for a variety of studies including molecular, biotechnological, bioinformatics and conservation and evolutionary studies. Genome sequencing of Rhodophyta has received little attention. To date, no published studies are focusing on both whole genome sequencing and sequencing of the organellar genomes of Rhodophyta species found in along the South African coastline. This study focused on genome sequencing, assembly and annotation mitochondrial and plastid genomes of Gelidium pristoides. Gelidium pristoides was collected from Kenton-on-Sea and was morphologically identified at Rhodes University. Its genomic DNA was extracted using the Nucleospin® Plant II kit and quantified using Qubit 2.0, Nanodrop and 1% agarose gel electrophoresis. The Ion Plus Fragment Library kit was used for the preparation of a 600 bp library, which was sequenced in two separate runs through the Ion S5 platform. The produced reads were quality-controlled through the Ion Torrent server version 5.6. and assessed using the FASTQC program. The SPAdes version 3.11.1 assembler was used to assemble the quality-controlled reads, and the resultant genome assembly was quality-assessed using the QUAST 4.1 software. The mitochondrial genome was selected from the produced Gelidium pristoides draft genome using mitochondrial genomes of other Gelidiales as search queries on the local BLAST algorithm of the BioEdit software. Contigs matching the organellar genomes were ordered according to the mitochondrial genomes of other Gelidiales using the trial version of Geneious R11.12 software. The plastid genome was also selected following the same approach but using plastid genomes of Gelidium elegans and Gelidium vagum as search queries. Gaps observed in the organellar genomes were closed by amplification of the relevant gap using polymerase chain reaction with newly designed primers and Sanger sequencing. Open reading frames for both organellar genomes were annotated using the NCBI ORF-Finder and alignments obtained from BlastN and BlastX searches from the NCBI database, while the tRNAs and rRNAs were identified using the tRNAscan-SE1.21 vi and the RNAmmer 1.2 servers. The circular physical map of the mitochondrial genome was constructed using the CGView server. Lastly, in silico analysis of cytochrome c oxidase 3 and Heat Shock Protein 70 was performed using the PRIMO and the SWISS-MODEL pipelines respectively. Their phylogenies were analysed through Clustal omega and the trees viewed on TreeView 1.6.6 software. Qubit and Nanodrop genomic DNA qualification revealed A260/A280 and A230/A260 ratios of 1.81 and 1.52 respectively. The 1% agarose gel electrophoresis further confirmed the good quality of the genomic DNA used for library preparation and sequencing. Pre-assembly quality control of reads resulted in a total of 30 792 074 high-quality reads which were assembled into a total of 94140 contigs, making up an estimated genome length of 217.06 Mb. The largest contig covered up to 13.17 kb of the draft genome, and an N50 statistic value of 3.17 kb was obtained. The G.pristoides mitochondrial genome mapped into a circular molecule of 25012 bp, with an overall GC content of 31.04% and a total of 45 genes distributed into 20 tRNA-coding, 2 rRNAcoding genes and 23 protein-coding genes, mostly adopting the modified genetic code of Rhodophyta. The SecY and rps12 genes overlapped by 41 bp. This study presents a partial plastid genome composed of 89 (38%) fully annotated genes, of which 71 are protein-coding, and 18 are distributed among 15 tRNA-coding, 2 rRNA-coding and 1 RNaseP RNA-coding genes. Sixty-one (26%) partial protein-coding genes were predicted, while approximately 84 (36%) genes are not yet predicted. In silico analysis of the cytochrome c oxidase and heat shock protein 70 showed that the gene sequences obtained in this study and the resultant transcribed protein have sequences and structures that are similar to those from several other different species, thus validating the integrity of the genome sequences. This study provides genomic data necessary for understanding the genomic constituent of G.pristoides and serve as a foundation for studies of individual genes and for resolving evolutionary relationships

    Protein contour modelling and computation for complementarity detection and docking

    Get PDF
    The aim of this thesis is the development and application of a model that effectively and efficiently integrates the evaluation of geometric and electrostatic complementarity for the protein-protein docking problem. Proteins perform their biological roles by interacting with other biomolecules and forming macromolecular complexes. The structural characterization of protein complexes is important to understand the underlying biological processes. Unfortunately, there are several limitations to the available experimental techniques, leaving the vast majority of these complexes to be determined by means of computational methods such as protein-protein docking. The ultimate goal of the protein-protein docking problem is the in silico prediction of the three-dimensional structure of complexes of two or more interacting proteins, as occurring in living organisms, which can later be verified in vitro or in vivo. These interactions are highly specific and take place due to the simultaneous formation of multiple weak bonds: the geometric complementarity of the contours of the interacting molecules is a fundamental requirement in order to enable and maintain these interactions. However, shape complementarity alone cannot guarantee highly accurate docking predictions, as there are several physicochemical factors, such as Coulomb potentials, van der Waals forces and hydrophobicity, affecting the formation of protein complexes. In order to set up correct and efficient methods for the protein-protein docking, it is necessary to provide a unique representation which integrates geometric and physicochemical criteria in the complementarity evaluation. To this end, a novel local surface descriptor, capable of capturing both the shape and electrostatic distribution properties of macromolecular surfaces, has been designed and implemented. The proposed methodology effectively integrates the evaluation of geometrical and electrostatic distribution complementarity of molecular surfaces, while maintaining efficiency in the descriptor comparison phase. The descriptor is based on the 3D Zernike invariants which possess several attractive features, such as a compact representation, rotational and translational invariance and have been shown to adequately capture global and local protein surface shape similarity and naturally represent physicochemical properties on the molecular surface. Locally, the geometric similarity between two portions of protein surface implies a certain degree of complementarity, but the same cannot be stated about electrostatic distributions. Complementarity in electrostatic distributions is more complex to handle, as charges must be matched with opposite ones even if they do not have the same magnitude. The proposed method overcomes this limitation as follows. From a unique electrostatic distribution function, two separate distribution functions are obtained, one for the positive and one for the negative charges, and both functions are normalised in [0, 1]. Descriptors are computed separately for the positive and negative charge distributions, and complementarity evaluation is then done by cross-comparing descriptors of distributions of charges of opposite signs. The proposed descriptor uses a discrete voxel-based representation of the Connolly surface on which the corresponding electrostatic potentials have been mapped. Voxelised surface representations have received a lot of interest in several bioinformatics and computational biology applications as a simple and effective way of jointly representing geometric and physicochemical properties of proteins and other biomolecules by mapping auxiliary information in each voxel. Moreover, the voxel grid can be defined at different resolutions, thus giving the means to effectively control the degree of detail in the discrete representation along with the possibility of producing multiple representations of the same molecule at different resolutions. A specific algorithm has been designed for the efficient computation of voxelised macromolecular surfaces at arbitrary resolutions, starting from experimentally-derived structural data (X-ray crystallography, NMR spectroscopy or cryo-electron microscopy). Fast surface generation is achieved by adapting an approximate Euclidean Distance Transform algorithm in the Connolly surface computation step and by exploiting the geometrical relationship between the latter and the Solvent Accessible surface. This algorithm is at the base of VoxSurf (Voxelised Surface calculation program), a tool which can produce discrete representations of macromolecules at very high resolutions starting from the three-dimensional information of their corresponding PDB files. By employing compact data structures and implementing a spatial slicing protocol, the proposed tool can calculate the three main molecular surfaces at high resolutions with limited memory demands. To reduce the surface computation time without affecting the accuracy of the representation, two parallel algorithms for the computation of voxelised macromolecular surfaces, based on a spatial slicing procedure, have been introduced. The molecule is sliced in a user-defined number of parts and the portions of the overall surface can be calculated for each slice in parallel. The molecule is sliced with planes perpendicular to the abscissa axis of the Cartesian coordinate system defined in the molecule's PDB entry. The first algorithms uses an overlapping margin of one probe-sphere radius length among slices in order to guarantee the correctness of the Euclidean Distance Transform. Because of this margin, the Connolly surface can be computed nearly independently for each slice. Communications among processes are necessary only during the pocket identification procedure which ensures that pockets spanning through more than one slice are correctly identified and discriminated from solvent-excluded cavities inside the molecule. In the second parallel algorithm the size of the overlapping margin between slices has been reduced to a one-voxel length by adapting a multi-step region-growing Euclidean Distance Transform algorithm. At each step, distance values are first calculated independently for every slice, then, a small portion of the borders' information is exchanged between adjacent slices. The proposed methodologies will serve as a basis for a full-fledged protein-protein docking protocol based on local feature matching. Rigorous benchmark tests have shown that the combined geometric and electrostatic descriptor can effectively identify shape and electrostatic distribution complementarity in the binding sites of protein-protein complexes, by efficiently comparing circular surface patches and significantly decreasing the number of false positives obtained when using a purely-geometric descriptor. In the validation experiments, the contours of the two interacting proteins are divided in circular patches: all possible patch pairs from the two proteins are then evaluated in terms of complementarity and a general ranking is produced. Results show that native patch pairs obtain higher ranks when using the newly proposed descriptor, with respect to the ranks obtained when using the purely-geometric one

    Preface

    Get PDF

    Gaze-Based Human-Robot Interaction by the Brunswick Model

    Get PDF
    We present a new paradigm for human-robot interaction based on social signal processing, and in particular on the Brunswick model. Originally, the Brunswick model copes with face-to-face dyadic interaction, assuming that the interactants are communicating through a continuous exchange of non verbal social signals, in addition to the spoken messages. Social signals have to be interpreted, thanks to a proper recognition phase that considers visual and audio information. The Brunswick model allows to quantitatively evaluate the quality of the interaction using statistical tools which measure how effective is the recognition phase. In this paper we cast this theory when one of the interactants is a robot; in this case, the recognition phase performed by the robot and the human have to be revised w.r.t. the original model. The model is applied to Berrick, a recent open-source low-cost robotic head platform, where the gazing is the social signal to be considered

    From models to data: understanding biodiversity patterns from environmental DNA data

    Get PDF
    La distribution de l'abondance des espèces en un site, et la similarité de la composition taxonomique d'un site à l'autre, sont deux mesures de la biodiversité ayant servi de longue date de base empirique aux écologues pour tenter d'établir les règles générales gouvernant l'assemblage des communautés d'organismes. Pour ce type de mesures intégratives, le séquençage haut-débit d'ADN prélevé dans l'environnement (" ADN environnemental ") représente une alternative récente et prometteuse aux observations naturalistes traditionnelles. Cette approche présente l'avantage d'être rapide et standardisée, et donne accès à un large éventail de taxons microbiens jusqu'alors indétectables. Toutefois, ces jeux de données de grande taille à la structure complexe sont difficiles à analyser, et le caractère indirect des observations complique leur interprétation. Le premier objectif de cette thèse est d'identifier les modèles statistiques permettant d'exploiter ce nouveau type de données afin de mieux comprendre l'assemblage des communautés. Le deuxième objectif est de tester les approches retenues sur des données de biodiversité du sol en forêt amazonienne, collectées en Guyane française. Deux grands types de processus sont invoqués pour expliquer l'assemblage des communautés d'organismes : les processus "neutres", indépendants de l'espèce considérée, que sont la naissance, la mort et la dispersion des organismes, et les processus liés à la niche écologique occupée par les organismes, c'est-à-dire les interactions avec l'environnement et entre organismes. Démêler l'importance relative de ces deux types de processus dans l'assemblage des communautés est une question fondamentale en écologie ayant de nombreuses implications, notamment pour l'estimation de la biodiversité et la conservation. Le premier chapitre aborde cette question à travers la comparaison d'échantillons d'ADN environnemental prélevés dans le sol de diverses parcelles forestières en Guyane française, via les outils classiques d'analyse statistique en écologie des communautés. Le deuxième chapitre se concentre sur les processus neutres d'assemblages des communautés. S.P. Hubbell a proposé en 2001 un modèle décrivant ces processus de façon probabiliste, et pouvant être utilisé pour quantifier la capacité de dispersion des organismes ainsi que leur diversité à l'échelle régionale simplement à partir de la distribution d'abondance des espèces observée en un site. Dans ce chapitre, les biais liés à l'utilisation de l'ADN environnemental pour reconstituer la distribution d'abondance des espèces sont discutés, et sont quantifiés au regard de l'estimation des paramètres de dispersion et de diversité régionale. Le troisième chapitre se concentre sur la manière dont les différences non-aléatoires de composition taxonomique entre sites échantillonnés, résultant des divers processus d'assemblage des communautés, peuvent être détectées, représentées et interprétés. Un modèle statistique conçu à l'origine pour classifier les documents à partir des thèmes qu'ils abordent est ici appliqué à des échantillons de sol prélevés selon une grille régulière au sein d'une grande parcelle forestière. La structure spatiale de la composition taxonomique des microorganismes est caractérisée avec succès et reliée aux variations fines des conditions environnementales au sein de la parcelle. Les implications des résultats de la thèse sont enfin discutées. L'accent est mis en particulier sur le potentiel des modèles thématique (" topic models ") pour la modélisation des données de biodiversité issues de l'ADN environnemental.Integrative patterns of biodiversity, such as the distribution of taxa abundances and the spatial turnover of taxonomic composition, have been under scrutiny from ecologists for a long time, as they offer insight into the general rules governing the assembly of organisms into ecological communities. Thank to recent progress in high-throughput DNA sequencing, these patterns can now be measured in a fast and standardized fashion through the sequencing of DNA sampled from the environment (e.g. soil or water), instead of relying on tedious fieldwork and rare naturalist expertise. They can also be measured for the whole tree of life, including the vast and previously unexplored diversity of microorganisms. Taking full advantage of this new type of data is challenging however: DNA-based surveys are indirect, and suffer as such from many potential biases; they also produce large and complex datasets compared to classical censuses. The first goal of this thesis is to investigate how statistical tools and models classically used in ecology or coming from other fields can be adapted to DNA-based data so as to better understand the assembly of ecological communities. The second goal is to apply these approaches to soil DNA data from the Amazonian forest, the Earth's most diverse land ecosystem. Two broad types of mechanisms are classically invoked to explain the assembly of ecological communities: 'neutral' processes, i.e. the random birth, death and dispersal of organisms, and 'niche' processes, i.e. the interaction of the organisms with their environment and with each other according to their phenotype. Disentangling the relative importance of these two types of mechanisms in shaping taxonomic composition is a key ecological question, with many implications from estimating global diversity to conservation issues. In the first chapter, this question is addressed across the tree of life by applying the classical analytic tools of community ecology to soil DNA samples collected from various forest plots in French Guiana. The second chapter focuses on the neutral aspect of community assembly. A mathematical model incorporating the key elements of neutral community assembly has been proposed by S.P. Hubbell in 2001, making it possible to infer quantitative measures of dispersal and of regional diversity from the local distribution of taxa abundances. In this chapter, the biases introduced when reconstructing the taxa abundance distribution from environmental DNA data are discussed, and their impact on the estimation of the dispersal and regional diversity parameters is quantified. The third chapter focuses on how non-random differences in taxonomic composition across a group of samples, resulting from various community assembly processes, can be efficiently detected, represented and interpreted. A method originally designed to model the different topics emerging from a set of text documents is applied here to soil DNA data sampled along a grid over a large forest plot in French Guiana. Spatial patterns of soil microorganism diversity are successfully captured, and related to fine variations in environmental conditions across the plot. Finally, the implications of the thesis findings are discussed. In particular, the potential of topic modelling for the modelling of DNA-based biodiversity data is stressed
    corecore