562 research outputs found

    Predicting receptor-ligand pairs through kernel learning

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Regulation of cellular events is, often, initiated via extracellular signaling. Extracellular signaling occurs when a circulating ligand interacts with one or more membrane-bound receptors. Identification of receptor-ligand pairs is thus an important and specific form of PPI prediction.</p> <p>Results</p> <p>Given a set of disparate data sources (expression data, domain content, and phylogenetic profile) we seek to predict new receptor-ligand pairs. We create a combined kernel classifier and assess its performance with respect to the Database of Ligand-Receptor Partners (DLRP) 'golden standard' as well as the method proposed by Gertz <it>et al. </it>Among our findings, we discover that our predictions for the tgfβ family accurately reconstruct over 76% of the supported edges (0.76 recall and 0.67 precision) of the receptor-ligand bipartite graph defined by the DLRP "golden standard". In addition, for the tgfβ family, the combined kernel classifier is able to relatively improve upon the Gertz <it>et al. </it>work by a factor of approximately 1.5 when considering that our method has an <it>F</it>-measure of 0.71 while that of Gertz <it>et al. </it>has a value of 0.48.</p> <p>Conclusions</p> <p>The prediction of receptor-ligand pairings is a difficult and complex task. We have demonstrated that using kernel learning on multiple data sources provides a stronger alternative to the existing method in solving this task.</p

    Prediction and classification of chemokines and their receptors

    Get PDF
    Chemokines are low molecular mass cytokine-like proteins that orchestrate myriads of immune functions like leukocyte trafficking, T cell differentiation, angiogenesis, hematopeosis and mast cell degranulation. Chemokines also play a role as HIV-1 inhibitor and act as potent natural adjuvant in antitumor immunotherapy. Receptors for these molecules are all seven-pass transmembrane G-protein-coupled receptors that are intimately involved with chemokines in a wide array of physiological and pathological conditions. These receptors also have a major role as co-receptors for HIV-1 entry into target cells. Therefore, chemokine receptors have proven to be excellent targets for small molecule in pharmaceutical industry. The immense importance of chemokines and their receptors motivated us to develop a support vector machine-based method ChemoPred to predict this important class of proteins and further classify them into subfamilies. ChemoPred is capable of predicting chemokines and chemokine receptors with an accuracy of 95.08% and 92.19%, respectively. The overall accuracy of classification of chemokines into three subfamilies was 96.00% and that of chemokine receptors into three families was 92.87%. The server ChemoPred is freely available at www.imtech.res.in/raghava/chemopred

    Developing statistical and bioinformatic analysis of genomic data from tumours

    Get PDF
    Previous prognostic signatures for melanoma based on tumour transcriptomic data were developed predominantly on cohorts of AJCC (American Joint Committee on Cancer) stages III and IV melanoma. Since 92% of melanoma patients are diagnosed at AJCC stages I and II, there is an urgent need for better prognostic biomarkers to allow patient stratification for receiving early adjuvant therapies. This study uses genome-wide tumour gene expression levels and clinico-histopathological characteristics of patients from the Leeds Melanoma Cohort (LMC). Several unsupervised and supervised classification approaches were applied to the transcriptomic data, to identify biological classes of melanoma, and to develop prognostic classification models respectively. Unsupervised clustering identified six biologically distinct primary melanoma classes (LMC classes). Unlike previous molecular classes of melanoma, the LMC classes were prognostic in both the whole LMC dataset and in stage I tumours. The prognostic value of the LMC classes was replicated in an independent dataset, but insufficient data were available to replicate in an AJCC stage I subset. Supervised classification using the Random Forest (RF) approach provided improved performances when adjustments were made to deal with class imbalance, while this did not improve performance of the Support Vector Machine (SVM). However, RF and SVM had similar results overall, with RF only marginally better. Combining clinical and transcriptomic information in the RF further improved the performance of the prediction model in comparison to using clinical information alone. Finally, the agnostically derived LMC classes and the supervised RF model showed convergence in their association with outcome in some groups of patients, but not in others. In conclusion, this study reports six molecular classes of primary melanoma with prognostic value in stage I disease and overall, and a prognostic classification model that predicts outcome in primary melanoma

    A genotypic method for determining HIV-2 coreceptor usage enables epidemiological studies and clinical decision support

    Get PDF
    Background: CCR5-coreceptor antagonists can be used for treating HIV-2 infected individuals. Before initiating treatment with coreceptor antagonists, viral coreceptor usage should be determined to ensure that the virus can use only the CCR5 coreceptor (R5) and cannot evade the drug by using the CXCR4 coreceptor (X4-capable). However, until now, no online tool for the genotypic identification of HIV-2 coreceptor usage had been available. Furthermore, there is a lack of knowledge on the determinants of HIV-2 coreceptor usage. Therefore, we developed a data-driven web service for the prediction of HIV-2 coreceptor usage from the V3 loop of the HIV-2 glycoprotein and used the tool to identify novel discriminatory features of X4-capable variants. Results: Using 10 runs of tenfold cross validation, we selected a linear support vector machine (SVM) as the model for geno2pheno[coreceptor-hiv2], because it outperformed the other SVMs with an area under the ROC curve (AUC) of 0.95. We found that SVMs were highly accurate in identifying HIV-2 coreceptor usage, attaining sensitivities of 73.5% and specificities of 96% during tenfold nested cross validation. The predictive performance of SVMs was not significantly different (p value 0.37) from an existing rules-based approach. Moreover, geno2pheno[coreceptor-hiv2] achieved a predictive accuracy of 100% and outperformed the existing approach on an independent data set containing nine new isolates with corresponding phenotypic measurements of coreceptor usage. geno2pheno[coreceptor-hiv2] could not only reproduce the established markers of CXCR4-usage, but also revealed novel markers: the substitutions 27K, 15G, and 8S were significantly predictive of CXCR4 usage. Furthermore, SVMs trained on the amino-acid sequences of the V1 and V2 loops were also quite accurate in predicting coreceptor usage (AUCs of 0.84 and 0.65, respectively). Conclusions: In this study, we developed geno2pheno[coreceptor-hiv2], the first online tool for the prediction of HIV-2 coreceptor usage from the V3 loop. Using our method, we identified novel amino-acid markers of X4-capable variants in the V3 loop and found that HIV-2 coreceptor usage is also influenced by the V1/V2 region. The tool can aid clinicians in deciding whether coreceptor antagonists such as maraviroc are a treatment option and enables epidemiological studies investigating HIV-2 coreceptor usage. geno2pheno[coreceptor-hiv2] is freely available at http://coreceptor-hiv2.geno2pheno.org

    Computational approaches for improving treatment and prevention of viral infections

    Get PDF
    The treatment of infections with HIV or HCV is challenging. Thus, novel drugs and new computational approaches that support the selection of therapies are required. This work presents methods that support therapy selection as well as methods that advance novel antiviral treatments. geno2pheno[ngs-freq] identifies drug resistance from HIV-1 or HCV samples that were subjected to next-generation sequencing by interpreting their sequences either via support vector machines or a rules-based approach. geno2pheno[coreceptor-hiv2] determines the coreceptor that is used for viral cell entry by analyzing a segment of the HIV-2 surface protein with a support vector machine. openPrimeR is capable of finding optimal combinations of primers for multiplex polymerase chain reaction by solving a set cover problem and accessing a new logistic regression model for determining amplification events arising from polymerase chain reaction. geno2pheno[ngs-freq] and geno2pheno[coreceptor-hiv2] enable the personalization of antiviral treatments and support clinical decision making. The application of openPrimeR on human immunoglobulin sequences has resulted in novel primer sets that improve the isolation of broadly neutralizing antibodies against HIV-1. The methods that were developed in this work thus constitute important contributions towards improving the prevention and treatment of viral infectious diseases.Die Behandlung von HIV- oder HCV-Infektionen ist herausfordernd. Daher werden neue Wirkstoffe, sowie neue computerbasierte Verfahren benötigt, welche die Therapie verbessern. In dieser Arbeit wurden Methoden zur Unterstützung der Therapieauswahl entwickelt, aber auch solche, welche neuartige Therapien vorantreiben. geno2pheno[ngs-freq] bestimmt, ob Resistenzen gegen Medikamente vorliegen, indem es Hochdurchsatzsequenzierungsdaten von HIV-1 oder HCV Proben mittels Support Vector Machines oder einem regelbasierten Ansatz interpretiert. geno2pheno[coreceptor-hiv2] bestimmt den HIV-2 Korezeptorgebrauch dadurch, dass es einen Abschnitt des viralen Oberflächenproteins mit einer Support Vector Machine analysiert. openPrimeR kann optimale Kombinationen von Primern für die Multiplex-Polymerasekettenreaktion finden, indem es ein Mengenüberdeckungsproblem löst und auf ein neues logistisches Regressionsmodell für die Vorhersage von Amplifizierungsereignissen zurückgreift. geno2pheno[ngs-freq] und geno2pheno[coreceptor-hiv2] ermöglichen die Personalisierung antiviraler Therapien und unterstützen die klinische Entscheidungsfindung. Durch den Einsatz von openPrimeR auf humanen Immunoglobulinsequenzen konnten Primersätze generiert werden, welche die Isolierung von breit neutralisierenden Antikörpern gegen HIV-1 verbessern. Die in dieser Arbeit entwickelten Methoden leisten somit einen wichtigen Beitrag zur Verbesserung der Prävention und Therapie viraler Infektionskrankheiten

    Protein sequences classification by means of feature extraction with substitution matrices

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>This paper deals with the preprocessing of protein sequences for supervised classification. Motif extraction is one way to address that task. It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require this format. However, designing a suitable feature space, for a set of proteins, is not a trivial task. For this purpose, we propose a novel encoding method that uses amino-acid substitution matrices to define similarity between motifs during the extraction step.</p> <p>Results</p> <p>In order to demonstrate the efficiency of such approach, we compare several encoding methods using some machine learning classifiers. The experimental results showed that our encoding method outperforms other ones in terms of classification accuracy and number of generated attributes. We also compared the classifiers in term of accuracy. Results indicated that SVM generally outperforms the other classifiers with any encoding method. We showed that SVM, coupled with our encoding method, can be an efficient protein classification system. In addition, we studied the effect of the substitution matrices variation on the quality of our method and hence on the classification quality. We noticed that our method enables good classification accuracies with all the substitution matrices and that the variances of the obtained accuracies using various substitution matrices are slight. However, the number of generated features varies from a substitution matrix to another. Furthermore, the use of already published datasets allowed us to carry out a comparison with several related works.</p> <p>Conclusions</p> <p>The outcomes of our comparative experiments confirm the efficiency of our encoding method to represent protein sequences in classification tasks.</p

    Exploring the potential of Spherical Harmonics and PCVM for compounds activity prediction

    Get PDF
    Biologically active chemical compounds may provide remedies for several diseases. Meanwhile, Machine Learning techniques applied to Drug Discovery, which are cheaper and faster than wet-lab experiments, have the capability to more effectively identify molecules with the expected pharmacological activity. Therefore, it is urgent and essential to develop more representative descriptors and reliable classification methods to accurately predict molecular activity. In this paper, we investigate the potential of a novel representation based on Spherical Harmonics fed into Probabilistic Classification Vector Machines classifier, namely SHPCVM, to compound the activity prediction task. We make use of representation learning to acquire the features which describe the molecules as precise as possible. To verify the performance of SHPCVM ten-fold cross-validation tests are performed on twenty-one G protein-coupled receptors (GPCRs). Experimental outcomes (accuracy of 0.86) assessed by the classification accuracy, precision, recall, Matthews’ Correlation Coefficient and Cohen’s kappa reveal that using our Spherical Harmonics-based representation which is relatively short and Probabilistic Classification Vector Machines can achieve very satisfactory performance results for GPCRs

    Computational Analysis of RNAi Screening Data to Identify Host Factors Involved in Viral Infection and to Characterize Protein-Protein Interactions

    Get PDF
    The study of gene functions in a variety of different treatments, cell lines and organisms has been facilitated by RNA interference (RNAi) technology that tracks the phenotype of cells after silencing of particular genes. In this thesis, I describe two computational approaches developed to analyze the image data from two different RNAi screens. Firstly, I developed an alternative approach to detect host factors (human proteins) that support virus growth and replication of cells infected with the Hepatitis C virus (HCV). To identify the human proteins that are crucial for the efficiency of viral infection, several RNAi experiments of viral-infected cells have been conducted. However, the target lists from different laboratories have shown only little overlap. This inconsistency might be caused not only by experimental discrepancies, but also by not fully explored possibilities of the data analysis. Observing only viral intensity readouts from the experiments might be insufficient. In this project, I describe our computational development as a new alternative approach to improve the reliability for the host factor identification. Our approach is based on characterizing the clustering of infected cells. The idea is that viral infection is spread by cell-cell contacts, or at least advantaged by the vicinity of cells. Therefore, clustering of the HCV infected cells is observed during spreading of the infection. We developed a clustering detection method basing on a distance-based point pattern analysis (K-function) to identify knockdown genes in which the clusters of HCV infected cells were reduced. The approach could significantly separate between positive and negative controls and found good correlations between the clustering score and intensity readouts from the experimental screens. In comparison to another clustering algorithm, the K-function method was superior to Quadrat analysis method. Statistical normalization approaches were exploited to identify protein targets from our clustering-based approach and the experimental screens. Integrating results from our clustering method, intensity readout analysis and secondary screen, we finally identified five promising host factors that are suitable candidate targets for drug therapy. Secondly, a machine learning based approach was developed to characterize protein-protein interactions (PPIs) in a signaling network. The characterization of each PPI is fundamental to our understanding of the complex signaling system of a human cell. Experiments for PPI identification, such as yeast two-hybrid and FRET analysis, are resource-intensive, and, therefore, computational approaches for analysing large-scale RNAi knockdown screens have become an important pursuit of inferring the functional similarities from the phenotypic similarities of the down-regulated proteins. However, these methods did not provide a more detailed characterization of the PPIs. In this project, I developed a new computational approach that is based on a machine learning technique which employs the mitotic phenotypes of an RNAi screen. It enables the identification of the nature of a PPI, i.e., if it is of rather activating or inhibiting nature. We established a systematic classification using Support Vector Machines (SVMs) that was based on the phenotypic descriptors and used it to classify the interactions that activate or inhibit signal transduction. The machines yielded promising results with good performance when integrating different sets of published descriptors and our own developed descriptors calculated from fractions of specific phenotypes, linear classification of phenotypes, and phenotypic distance to distinct proteins. A comprehensive model generated from the machines was used for further predictions. We investigated the nature of pairs of interacting proteins and generated a consistency score that enhanced the precisions of the classification results. We predicted the activating/inhibiting nature for 214 PPIs with high confidence in signaling pathways and enabled to identify a new subgroup of chemokine receptors. These findings might facilitate an enhanced understanding of the cellular mechanisms during inflammation and immunologic responses. In summary, two computational approaches were developed to analyze the image data of the different RNAi screens: 1) a clustering-based approach was used to identify the host factors that are crucial for HCV infection; and 2) a machine learning-based approach with various descriptors was employed to characterize PPI activities. The results from the host factor analysis revealed novel target proteins that are involved in the spread of the HCV. In addition, the results of the characterization of the PPIs lead to a better understanding of the signaling pathways. The two large-scale RNAi data were successfully analyzed by our established approaches to obtain new insights into virus biology and cellular signaling

    Protein Tertiary Model Assessment Using Granular Machine Learning Techniques

    Get PDF
    The automatic prediction of protein three dimensional structures from its amino acid sequence has become one of the most important and researched fields in bioinformatics. As models are not experimental structures determined with known accuracy but rather with prediction it’s vital to determine estimates of models quality. We attempt to solve this problem using machine learning techniques and information from both the sequence and structure of the protein. The goal is to generate a machine that understands structures from PDB and when given a new model, predicts whether it belongs to the same class as the PDB structures (correct or incorrect protein models). Different subsets of PDB (protein data bank) are considered for evaluating the prediction potential of the machine learning methods. Here we show two such machines, one using SVM (support vector machines) and another using fuzzy decision trees (FDT). First using a preliminary encoding style SVM could get around 70% in protein model quality assessment accuracy, and improved Fuzzy Decision Tree (IFDT) could reach above 80% accuracy. For the purpose of reducing computational overhead multiprocessor environment and basic feature selection method is used in machine learning algorithm using SVM. Next an enhanced scheme is introduced using new encoding style. In the new style, information like amino acid substitution matrix, polarity, secondary structure information and relative distance between alpha carbon atoms etc is collected through spatial traversing of the 3D structure to form training vectors. This guarantees that the properties of alpha carbon atoms that are close together in 3D space and thus interacting are used in vector formation. With the use of fuzzy decision tree, we obtained a training accuracy around 90%. There is significant improvement compared to previous encoding technique in prediction accuracy and execution time. This outcome motivates to continue to explore effective machine learning algorithms for accurate protein model quality assessment. Finally these machines are tested using CASP8 and CASP9 templates and compared with other CASP competitors, with promising results. We further discuss the importance of model quality assessment and other information from proteins that could be considered for the same
    corecore