61 research outputs found

    Classification of nuclear receptors based on amino acid composition and dipeptide composition

    Get PDF
    Nuclear receptors are key transcription factors that regulate crucial gene networks responsible for cell growth, differentiation, and homeostasis. Nuclear receptors form a superfamily of phylogenetically related proteins and control functions associated with major diseases (e.g. diabetes, osteoporosis, and cancer). In this study, a novel method has been developed for classifying the subfamilies of nuclear receptors. The classification was achieved on the basis of amino acid and dipeptide composition from a sequence of receptors using support vector machines. The training and testing was done on a non-redundant data set of 282 proteins obtained from the NucleaRDB data base (1). The performance of all classifiers was evaluated using a 5-fold cross validation test. In the 5-fold cross-validation, the data set was randomly partitioned into five equal sets and evaluated five times on each distinct set while keeping the remaining four sets for training. It was found that different subfamilies of nuclear receptors were quite closely correlated in terms of amino acid composition as well as dipeptide composition. The overall accuracy of amino acid composition-based and dipeptide compositionbased classifiers were 82.6 and 97.5%, respectively. Therefore, our results prove that different subfamilies of nuclear receptors are predictable with considerable accuracy using amino acid or dipeptide composition. Furthermore, based on above approach, an online web service, NRpred, was developed, which is available at www.imtech.res.in/raghava/nrpred

    The interplay of descriptor-based computational analysis with pharmacophore modeling builds the basis for a novel classification scheme for feruloyl esterases

    Get PDF
    One of the most intriguing groups of enzymes, the feruloyl esterases (FAEs), is ubiquitous in both simple and complex organisms. FAEs have gained importance in biofuel, medicine and food industries due to their capability of acting on a large range of substrates for cleaving ester bonds and synthesizing high-added value molecules through esterification and transesterification reactions. During the past two decades extensive studies have been carried out on the production and partial characterization of FAEs from fungi, while much less is known about FAEs of bacterial or plant origin. Initial classification studies on FAEs were restricted on sequence similarity and substrate specificity on just four model substrates and considered only a handful of FAEs belonging to the fungal kingdom. This study centers on the descriptor-based classification and structural analysis of experimentally verified and putative FAEs; nevertheless, the framework presented here is applicable to every poorly characterized enzyme family. 365 FAE-related sequences of fungal, bacterial and plantae origin were collected and they were clustered using Self Organizing Maps followed by k-means clustering into distinct groups based on amino acid composition and physico-chemical composition descriptors derived from the respective amino acid sequence. A Support Vector Machine model was subsequently constructed for the classification of new FAEs into the pre-assigned clusters. The model successfully recognized 98.2% of the training sequences and all the sequences of the blind test. The underlying functionality of the 12 proposed FAE families was validated against a combination of prediction tools and published experimental data. Another important aspect of the present work involves the development of pharmacophore models for the new FAE families, for which sufficient information on known substrates existed. Knowing the pharmacophoric features of a small molecule that are essential for binding to the members of a certain family opens a window of opportunities for tailored applications of FAEs

    A Systematic Prediction of Multiple Drug-Target Interactions from Chemical, Genomic, and Pharmacological Data

    Get PDF
    In silico prediction of drug-target interactions from heterogeneous biological data can advance our system-level search for drug molecules and therapeutic targets, which efforts have not yet reached full fruition. In this work, we report a systematic approach that efficiently integrates the chemical, genomic, and pharmacological information for drug targeting and discovery on a large scale, based on two powerful methods of Random Forest (RF) and Support Vector Machine (SVM). The performance of the derived models was evaluated and verified with internally five-fold cross-validation and four external independent validations. The optimal models show impressive performance of prediction for drug-target interactions, with a concordance of 82.83%, a sensitivity of 81.33%, and a specificity of 93.62%, respectively. The consistence of the performances of the RF and SVM models demonstrates the reliability and robustness of the obtained models. In addition, the validated models were employed to systematically predict known/unknown drugs and targets involving the enzymes, ion channels, GPCRs, and nuclear receptors, which can be further mapped to functional ontologies such as target-disease associations and target-target interaction networks. This approach is expected to help fill the existing gap between chemical genomics and network pharmacology and thus accelerate the drug discovery processes

    Analysis of class C G-protein coupled receptors using supervised classification methods

    Get PDF
    G protein-coupled receptors (GPCRs) are cell membrane proteins with a key role in regulating the function of cells. This is the result of their ability to transmit extracellular signals, which makes them relevant for pharmacology and has led, over the last decade, to active research in the field of proteomics. The current thesis specifically targets class C of GPCRs, which are relevant in therapies for various central nervous system disorders, such as Alzheimer’s disease, anxiety, Parkinson’s disease and schizophrenia. The investigation of protein functionality often relies on the knowledge of crystal three dimensional (3-D) structures, which determine the receptor’s ability for ligand binding responsible for the activation of certain functionalities in the protein. The structural information is therefore paramount, but it is not always known or easily unravelled, which is the case of eukaryotic cell membrane proteins such as GPCRs. In the face of the lack of information about the 3-D structure, research is often bound to the analysis of the primary amino acid sequences of the proteins, which are commonly known and available from curated databases. Much research on sequence analysis has focused on the quantitative analysis of their aligned versions, although, recently, alternative approaches using machine learning techniques for the analysis of alignment-free sequences have been proposed. In this thesis, we focus on the differentiation of class C GPCRs into functional and structural related subgroups based on the alignment-free analysis of their sequences using supervised classification models. In the first part of the thesis, the main topic is the construction of supervised classification models for unaligned protein sequences based on physicochemical transformations and n-gram representations of their amino acid sequences. These models are useful to assess the internal data quality of the externally labeled dataset and to manage the label noise problem from a data curation perspective. In its second part, the thesis focuses on the analysis of the sequences to discover subtype- and region-speci¿c sequence motifs. For that, we carry out a systematic analysis of the topological sequence segments with supervised classification models and evaluate the subtype discrimination capability of each region. In addition, we apply different types of feature selection techniques to the n-gram representation of the amino acid sequence segments to find subtype and region specific motifs. Finally, we compare the findings of this motif search with the partially known 3D crystallographic structures of class C GPCRs.Los receptores acoplados a proteínas G (GPCRs) son proteínas de la membrana celular con un papel clave para la regulación del funcionamiento de una célula. Esto es consecuencia de su capacidad de transmisión de señales extracelulares, lo que les hace relevante en la farmacología y que ha llevado a investigaciones activas en la última década en el área de la proteómica. Esta tesis se centra específicamente en la clase C de GPCRs, que son relevante para terapias de varios trastornos del sistema nervioso central, como la enfermedad de Alzheimer, ansiedad, enfermedad de Parkinson y esquizofrenia. La investigación de la funcionalidad de proteínas muchas veces se basa en el conocimiento de la estructura cristalina tridimensional (3-D), que determina la capacidad del receptor para la unión con ligandos, que son responsables para la activación de ciertas funcionalidades en la proteína. El análisis de secuencias de amino ácidos se ha centrado en muchas investigaciones en el análisis cuantitativo de las versiones alineados de las secuencias, aunque, recientemente, se han propuesto métodos alternativos usando métodos de aprendizaje automático aplicados a las versiones no-alineadas de las secuencias. En esta tesis, nos centramos en la diferenciación de los GPCRs de la clase C en subgrupos funcionales y estructurales basado en el análisis de las secuencias no-alineadas utilizando modelos de clasificación supervisados. Estos modelos son útiles para evaluar la calidad interna de los datos a partir del conjunto de datos etiquetados externamente y para gestionar el problema del 'ruido de datos' desde la perspectiva de la curación de datos. En su segunda parte, la tesis enfoca el análisis de las secuencias para descubrir motivos de secuencias específicos a nivel de subtipo o región. Para eso, llevamos a cabo un análisis sistemático de los segmentos topológicos de la secuencia con modelos supervisados de clasificación y evaluamos la capacidad de discriminar entre subtipos de cada región. Adicionalmente, aplicamos diferentes tipos de técnicas de selección de atributos a las representaciones mediante n-gramas de los segmentos de secuencias de amino ácidos para encontrar motivos específicos a nivel de subtipo y región. Finalmente, comparamos los descubrimientos de la búsqueda de motivos con las estructuras cristalinas parcialmente conocidas para la clase C de GPCRs

    Subsequence-based feature map for protein function classification

    Get PDF
    Automated classification of proteins is indispensable for further in vivo investigation of excessive number of unknown sequences generated by large scale molecular biology techniques. This study describes a discriminative system based on feature space mapping, called subsequence profile map (SPMap) for functional classification of protein sequences. SPMap takes into account the information coming from the subsequences of a protein. A group of protein sequences that belong to the same level of classification is decomposed into fixed-length subsequences and they are clustered to obtain a representative feature space mapping. Mapping is defined as the distribution of the subsequences of a protein sequence over these clusters. The resulting feature space representation is used to train discriminative classifiers for functional families. The aim of this approach is to incorporate information coming from important subregions that are conserved over a family of proteins while avoiding the difficult task of explicit motif identification. The performance of the method was assessed through tests on various protein classification tasks. Our results showed that SPMap is capable of high accuracy classification in most of these tasks. Furthermore SPMap is fast and scalable enough to handle large datasets. © 2007 Elsevier Ltd. All rights reserved

    Predicting the Types of J-Proteins Using Clustered Amino Acids

    Get PDF

    Analysis of class C G-protein coupled receptors using supervised classification methods

    Get PDF
    G protein-coupled receptors (GPCRs) are cell membrane proteins with a key role in regulating the function of cells. This is the result of their ability to transmit extracellular signals, which makes them relevant for pharmacology and has led, over the last decade, to active research in the field of proteomics. The current thesis specifically targets class C of GPCRs, which are relevant in therapies for various central nervous system disorders, such as Alzheimer’s disease, anxiety, Parkinson’s disease and schizophrenia. The investigation of protein functionality often relies on the knowledge of crystal three dimensional (3-D) structures, which determine the receptor’s ability for ligand binding responsible for the activation of certain functionalities in the protein. The structural information is therefore paramount, but it is not always known or easily unravelled, which is the case of eukaryotic cell membrane proteins such as GPCRs. In the face of the lack of information about the 3-D structure, research is often bound to the analysis of the primary amino acid sequences of the proteins, which are commonly known and available from curated databases. Much research on sequence analysis has focused on the quantitative analysis of their aligned versions, although, recently, alternative approaches using machine learning techniques for the analysis of alignment-free sequences have been proposed. In this thesis, we focus on the differentiation of class C GPCRs into functional and structural related subgroups based on the alignment-free analysis of their sequences using supervised classification models. In the first part of the thesis, the main topic is the construction of supervised classification models for unaligned protein sequences based on physicochemical transformations and n-gram representations of their amino acid sequences. These models are useful to assess the internal data quality of the externally labeled dataset and to manage the label noise problem from a data curation perspective. In its second part, the thesis focuses on the analysis of the sequences to discover subtype- and region-speci¿c sequence motifs. For that, we carry out a systematic analysis of the topological sequence segments with supervised classification models and evaluate the subtype discrimination capability of each region. In addition, we apply different types of feature selection techniques to the n-gram representation of the amino acid sequence segments to find subtype and region specific motifs. Finally, we compare the findings of this motif search with the partially known 3D crystallographic structures of class C GPCRs.Los receptores acoplados a proteínas G (GPCRs) son proteínas de la membrana celular con un papel clave para la regulación del funcionamiento de una célula. Esto es consecuencia de su capacidad de transmisión de señales extracelulares, lo que les hace relevante en la farmacología y que ha llevado a investigaciones activas en la última década en el área de la proteómica. Esta tesis se centra específicamente en la clase C de GPCRs, que son relevante para terapias de varios trastornos del sistema nervioso central, como la enfermedad de Alzheimer, ansiedad, enfermedad de Parkinson y esquizofrenia. La investigación de la funcionalidad de proteínas muchas veces se basa en el conocimiento de la estructura cristalina tridimensional (3-D), que determina la capacidad del receptor para la unión con ligandos, que son responsables para la activación de ciertas funcionalidades en la proteína. El análisis de secuencias de amino ácidos se ha centrado en muchas investigaciones en el análisis cuantitativo de las versiones alineados de las secuencias, aunque, recientemente, se han propuesto métodos alternativos usando métodos de aprendizaje automático aplicados a las versiones no-alineadas de las secuencias. En esta tesis, nos centramos en la diferenciación de los GPCRs de la clase C en subgrupos funcionales y estructurales basado en el análisis de las secuencias no-alineadas utilizando modelos de clasificación supervisados. Estos modelos son útiles para evaluar la calidad interna de los datos a partir del conjunto de datos etiquetados externamente y para gestionar el problema del 'ruido de datos' desde la perspectiva de la curación de datos. En su segunda parte, la tesis enfoca el análisis de las secuencias para descubrir motivos de secuencias específicos a nivel de subtipo o región. Para eso, llevamos a cabo un análisis sistemático de los segmentos topológicos de la secuencia con modelos supervisados de clasificación y evaluamos la capacidad de discriminar entre subtipos de cada región. Adicionalmente, aplicamos diferentes tipos de técnicas de selección de atributos a las representaciones mediante n-gramas de los segmentos de secuencias de amino ácidos para encontrar motivos específicos a nivel de subtipo y región. Finalmente, comparamos los descubrimientos de la búsqueda de motivos con las estructuras cristalinas parcialmente conocidas para la clase C de GPCRs.Postprint (published version

    A Balanced Secondary Structure Predictor

    Get PDF
    Secondary structure (SS) refers to the local spatial organization of the polypeptide backbone atoms of a protein. Accurate prediction of SS is a vital clue to resolve the 3D structure of protein. SS has three different components- helix (H), beta (E) and coil (C). Most SS predictors are imbalanced as their accuracy in predicting helix and coil are high, however significantly low in the beta. The objective of this thesis is to develop a balanced SS predictor which achieves good accuracies in all three SS components. We proposed a novel approach to solve this problem by combining a genetic algorithm (GA) with a support vector machine. We prepared two test datasets (CB471 and N295) to compare the performance of our predictors with SPINE X. Overall accuracy of our predictor was 76.4% and 77.2% respectively on CB471 and N295 datasets, while SPINE X gave 76.5% overall accuracy on both test datasets

    A Balanced Secondary Structure Predictor

    Get PDF
    Secondary structure (SS) refers to the local spatial organization of the polypeptide backbone atoms of a protein. Accurate prediction of SS is a vital clue to resolve the 3D structure of protein. SS has three different components- helix (H), beta (E) and coil (C). Most SS predictors are imbalanced as their accuracy in predicting helix and coil are high, however significantly low in the beta. The objective of this thesis is to develop a balanced SS predictor which achieves good accuracies in all three SS components. We proposed a novel approach to solve this problem by combining a genetic algorithm (GA) with a support vector machine. We prepared two test datasets (CB471 and N295) to compare the performance of our predictors with SPINE X. Overall accuracy of our predictor was 76.4% and 77.2% respectively on CB471 and N295 datasets, while SPINE X gave 76.5% overall accuracy on both test datasets

    Protein function and inhibitor prediction by statistical learning approach

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore