6 research outputs found

    Extending Graph (Discrete) Derivative Descriptors to N-Tuple Atom-Relations

    Get PDF
    In the present manuscript, an extension of the previously defined Graph Derivative Indices (GDIs) is discussed. To achieve this objective, the concept of a hypermatrix, conceived from the calculation of the frequencies of triple and quadruple atom relations in a set of connected sub-graphs, is introduced. This set of subgraphs is generated following a predefined criterion, known as the event (S), being in this particular case the connectivity among atoms. The triple and quadruple relations frequency matrices serve as a basis for the computation of triple and quadruple discrete derivative indices, respectively. The GDIs are implemented in a computational program denominated DIVATI (acronym for DIscrete DeriVAtive Type Indices), a module of TOMOCOMD-CARDD program. Shannon‟s entropy-based variability analysis demonstrates that the GDIs show major variability than others indices used in QSAR/QSPR researches. In addition, it can be appreciated when the indices are extended over n-elements from the graph, its quality increases, principally when they are used in a combined way. QSPR modeling of the physicochemical properties Log P and Log K of the 2-furylethylenes derivatives reveals that the GDIs obtained using the tripleand quadruple matrix approaches yield superior performance to the duplex matrix approach. Moreover, the statistical parameters for models obtained with the GDI method are superior to those reported in the literature by using other methods. It can therefore be suggested that the GDI method, seem to be a promissory tool to reckon on in QSAR/QSPR studies, virtual screening of compound datasets and similarity/dissimilarity evaluations

    ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins

    Get PDF
    Background: The exponential growth of protein structural and sequence databases is enabling multifaceted approaches to understanding the long sought sequence-structure-function relationship. Advances in computation now make it possible to apply well-established data mining and pattern recognition techniques to these data to learn models that effectively relate structure and function.

    Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone

    Get PDF
    Background: Computational prediction of protein function constitutes one of the more complex problems in Bioinformatics, because of the diversity of functions and mechanisms in that proteins exert in nature. This issue is reinforced especially for proteins that share very low primary or tertiary structure similarity to existing annotated proteomes. In this sense, new alignment-free (AF) tools are needed to overcome the inherent limitations of classic alignment-based approaches to this issue. We have recently introduced AF protein-numerical-encoding programs (TI2BioP and ProtDCal), whose sequence-based features have been successfully applied to detect remote protein homologs, post-translational modifications and antibacterial peptides. Here we aim to demonstrate the applicability of 4 AF protein descriptor families, implemented in our programs, for the identification enzyme-like proteins. At the same time, the use of our novel family of 3D-structure-based descriptors is introduced for the first time. The Dobson & Doig (D&D) benchmark dataset is used for the evaluation of our AF protein descriptors, because of its proven structural diversity that permits one to emulate an experiment within the twilight zone of alignment-based methods (pair-wise identity <30%). The performance of our sequence-based predictor was further assessed using a subset of formerly uncharacterized proteins which currently represent a benchmark annotation dataset. Results: Four protein descriptor families (sequence-composition-based (0D), linear-topology-based (1D), pseudo-fold-topology-based (2D) and 3D-structure features (3D), were assessed using the D&D benchmark dataset. We show that only the families of ProtDCal's descriptors (0D, 1D and 3D) encode significant information for enzymes and non-enzymes discrimination. The obtained 3D-structure-based classifier ranked first among several other SVM-based methods assessed in this dataset. Furthermore, the model leveraging 1D descriptors, showed a higher success rate than EzyPred on a benchmark annotation dataset from the Shewanella oneidensis proteome. Conclusions: The applicability of ProtDCal as a general-purpose-AF protein modelling method is illustrated through the discrimination between two comprehensive protein functional classes. The observed performances using the highly diverse D&D dataset, and the set of formerly uncharacterized (hard-to-annotate) proteins of Shewanella oneidensis, places our methodology on the top range of methods to model and predict protein function using alignment-free approaches

    Anticancer drug discovery using artificial intelligence: an application in pharmacological activity prediction

    Get PDF
    Hematological cancers are a heterogeneous family of diseases that can be divided into leukemias, lymphomas, and myelomas, often called “liquid tumors”. Since they cannot be surgically removable, chemotherapy represents the mainstay of their treatment. However, it still faces several challenges like drug resistance and low response rate, and the need for new anticancer agents is compelling. The drug discovery process is long-term, costly, and prone to high failure rates. With the rapid expansion of biological and chemical "big data", some computational techniques such as machine learning tools have been increasingly employed to speed up and economize the whole process. Machine learning algorithms can create complex models with the aim to determine the biological activity of compounds against several targets, based on their chemical properties. These models are defined as multi-target Quantitative Structure-Activity Relationship (mt-QSAR) and can be used to virtually screen small and large chemical libraries for the identification of new molecules with anticancer activity. The aim of my Ph.D. project was to employ machine learning techniques to build an mt-QSAR classification model for the prediction of cytotoxic drugs simultaneously active against 43 hematological cancer cell lines. For this purpose, first, I constructed a large and diversified dataset of molecules extracted from the ChEMBL database. Then, I compared the performance of different ML classification algorithms, until Random Forest was identified as the one returning the best predictions. Finally, I used different approaches to maximize the performance of the model, which achieved an accuracy of 88% by correctly classifying 93% of inactive molecules and 72% of active molecules in a validation set. This model was further applied to the virtual screening of a small dataset of molecules tested in our laboratory, where it showed 100% accuracy in correctly classifying all molecules. This result is confirmed by our previous in vitro experiments

    Développement de méthodes et d’outils chémoinformatiques pour l’analyse et la comparaison de chimiothèques

    Get PDF
    Some news areas in biology ,chemistry and computing interface, have emerged in order to respond the numerous problematics linked to the drug research. This is what this thesis is all about, as an interface gathered under the banner of chimocomputing. Though, new on a human scale, these domains are nevertheless, already an integral part of the drugs and medicines research. As the Biocomputing, his fundamental pillar remains storage, representation, management and the exploitation through computing of chemistry data. Chimocomputing is now mostly used in the upstream phases of drug research. Combining methods from various fields ( chime, computing, maths, apprenticeship, statistics, etc…) allows the implantation of computing tools adapted to the specific problematics and data of chime such as chemical database storage, understructure research, data visualisation or physoco-chimecals and biologics properties prediction.In that multidisciplinary frame, the work done in this thesis pointed out two important aspects, both related to chimocomputing : (1) The new methods development allowing to ease the visualization, analysis and interpretation of data related to set of the molecules, currently known as chimocomputing and (2) the computing tools development enabling the implantation of these methods.De nouveaux domaines ont vu le jour, à l’interface entre biologie, chimie et informatique, afin de répondre aux multiples problématiques liées à la recherche de médicaments. Cette thèse se situe à l’interface de plusieurs de ces domaines, regroupés sous la bannière de la chémo-informatique. Récent à l’échelle humaine, ce domaine fait néanmoins déjà partie intégrante de la recherche pharmaceutique. De manière analogue à la bioinformatique, son pilier fondateur reste le stockage, la représentation, la gestion et l’exploitation par ordinateur de données provenant de la chimie. La chémoinformatique est aujourd’hui utilisée principalement dans les phases amont de la recherche de médicaments. En combinant des méthodes issues de différents domaines (chimie, informatique, mathématique, apprentissage, statistiques, etc.), elle permet la mise en oeuvre d’outils informatiques adaptés aux problématiques et données spécifiques de la chimie, tels que le stockage de l’information chimique en base de données, la recherche par sous-structure, la visualisation de données, ou encore la prédiction de propriétés physico-chimiques et biologiques.Dans ce cadre pluri-disciplinaire, le travail présenté dans cette thèse porte sur deux aspects importants liés à la chémoinformatique : (1) le développement de nouvelles méthodes permettant de faciliter la visualisation, l’analyse et l’interprétation des données liées aux ensembles de molécules, plus communément appelés chimiothèques, et (2) le développement d’outils informatiques permettant de mettre en oeuvre ces méthodes
    corecore