9 research outputs found

    An improved classification of G-protein-coupled receptors using sequence-derived features

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>G-protein-coupled receptors (GPCRs) play a key role in diverse physiological processes and are the targets of almost two-thirds of the marketed drugs. The 3 D structures of GPCRs are largely unavailable; however, a large number of GPCR primary sequences are known. To facilitate the identification and characterization of novel receptors, it is therefore very valuable to develop a computational method to accurately predict GPCRs from the protein primary sequences.</p> <p>Results</p> <p>We propose a new method called PCA-GPCR, to predict GPCRs using a comprehensive set of 1497 sequence-derived features. The <it>principal component analysis </it>is first employed to reduce the dimension of the feature space to 32. Then, the resulting 32-dimensional feature vectors are fed into a simple yet powerful classification algorithm, called intimate sorting, to predict GPCRs at <it>five </it>levels. The prediction at the first level determines whether a protein is a GPCR or a non-GPCR. If it is predicted to be a GPCR, then it will be further predicted into certain <it>family</it>, <it>subfamily</it>, <it>sub-subfamily </it>and <it>subtype </it>by the classifiers at the second, third, fourth, and fifth levels, respectively. To train the classifiers applied at five levels, a non-redundant dataset is carefully constructed, which contains 3178, 1589, 4772, 4924, and 2741 protein sequences at the respective levels. Jackknife tests on this training dataset show that the overall accuracies of PCA-GPCR at five levels (from the first to the fifth) can achieve up to 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respectively. We further perform predictions on a dataset of 1238 GPCRs at the second level, and on another two datasets of 167 and 566 GPCRs respectively at the fourth level. The overall prediction accuracies of our method are consistently higher than those of the existing methods to be compared.</p> <p>Conclusions</p> <p>The comprehensive set of 1497 features is believed to be capable of capturing information about amino acid composition, sequence order as well as various physicochemical properties of proteins. Therefore, high accuracies are achieved when predicting GPCRs at all the five levels with our proposed method.</p

    딥러닝 기반 단일 거리 공간 내 GPCR 단백질군 계층 구조의 동시적 모델링 기법

    Get PDF
    학위논문(석사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2019. 8. 김선.G 단밸질 연결 수용체(GPCR)은 계층 구조로 형성된 다양한 단백질군으로 구성된다. 단백질 서열을 통한 GPCR에 대한 계산적인 모델링은 군(family), 아군(subfamily), 준아군(sub-subfamily)의 각 계층에서 독립적으로 실행되는 방식으로 이루어져왔다. 하지만 이러한 접근 방식들은 단절된 모델들을 통하여 단백질 내의 정보를 처리하기 때문에 GPCR 종류 사이의 관계는 고려하지 못한다는 한계를 가지고 있다. 본 연구에서는 딥러닝을 이용하여 GPCR의 계층 구조에서 나타나는 특징들을 단일한 모델로 동시적으로 학습하는 방법을 제시한다. 또한 계층적인 관계들을 하나의 벡터 공간에 거리를 통해 표현할 수 있도록 하기 위한 손실함수도 제시한다. 이 연구는 GPCR 수용체들의 여러 계층에서 공통적으로 나타나는 특징들을 학습하고 표현할 수 있도록 하는 방법을 다루고 있다. 여러 심화적인 실험들을 통하여 우리는 기술적인 측면과 생물학적인 측면에서 단백질 간 계층적인 관계가 성공적으로 학습이 되었다는 것을 보였다. 첫번째로, 우리는 임베딩 벡터에 계층적 군집화(hierarchical clustering) 알고리즘을 적용함으로써 계통수(phylogenetic tree)를 만들었고, 군집 알고리즘과 실제 계층 구조와의 수치적인 비교를 통하여 임베딩 벡터를 통해 계통학적 특징에 대한 유추가 가능하다는 것을 보였다. 두번째로, 임베딩 벡터의 군집화 결과에 다중 서열 정렬(multiple sequence alignment)를 적용시킴으로써 생물학적으로 유의미한 서열적 특성들을 찾아낼 수 있다는 것을 보였다. 이는 임베딩 벡터 분석이 GPCR 단백질 연구에 있어 효율적인 첫걸음이 될 수 있다는 것을 보여준다. 이러한 결과는 여러 계층으로 이루어진 단백질군에 대한 동시적인 모델링이 가능하다는 것을 말하고 있다.G protein-coupled receptors (GPCRs) belong to diverse families of proteins that can be defined at multiple levels. Computational modeling of GPCR families from the sequences has been performed separately at each level of family, sub-family, and sub-subfamily. However, relationships between classes are ignored in these approaches as they process the information in the sequences with a group of disconnected models. In this work, we propose a deep learning network to simultaneously learn representations in the GPCR hierarchy with a unified model and a loss term to express hierarchical relations in terms of distances in a single embedding space. The model introduces a method to learn and construct shared representations across hierarchies of the protein family. In extensive experiments, we showed that hierarchical relations between sequences are successfully captured in our model in both of technical and biological aspect. First, we showed that phylogenetic information in the sequences can be inferred from the vectors by constructing phylogenetic tree using hierarchical clustering algorithm and by quantitatively analyzing the quality of clustering results compared to the real label information. Second, inspection on embedding vectors is demonstrated to be a effective first step to-ward an analysis of GPCR proteins by showing that biologically significant sequence features can be revealed from multiple sequence alignments on clustering results on embedding vectors. Our work showed that simultaneous modeling of protein families with multiple hierarchies is possible.Abstract i Chapter Ⅰ. Introduction 1 1.1 Background 1 1.2 Motivation 3 Chapter Ⅱ. Methods 7 2.1 Data Preparation 7 2.1.1 Dataset 7 2.1.2 Data representation 7 2.2 Model architecture 8 2.2.1 Feature extractor with CNN 8 2.2.2 Embedding layer 8 2.2.3 Output layer 9 2.3 Loss function 10 2.3.1 Softmax loss 10 2.3.2 Center loss 10 2.3.3 Overall loss 12 2.4 Training procedure 13 2.5 Evaluation metric 14 2.5.1 Silhouette score 14 2.5.2 Adjusted mutual information score 15 Chapter Ⅲ. Results 17 3.1 Evaluation on hierarchical structure 17 3.1.1 Preservation of distances 17 3.1.2 Phylogenetic tree reconstruction 20 3.1.3 Quantitative evaluation on clustering results 21 3.2 Sequence analysis with embedding vectors 26 3.2.1 Technical analysis 26 3.2.2 Biological analysis 28 3.3 Classification accuracy 30 Chapter Ⅳ. Conclusion 32 References 35Maste

    Effects of luteinizing hormone receptor expression levels on receptor aggregation and function

    Get PDF
    2019 Fall.Includes bibliographical references.Luteinizing hormone receptors (LHR) are G protein-coupled receptors (GPCR) found primarily in female and male reproductive organs where they play a critical role in ovulation and sperm maturation, respectively, as well as maintenance of sex hormone production in both sexes. The role of oligomerization in LHR function is of considerable interest and not well understood. The oligomerization state of LHR has been suggested to have a significant role in signaling, desensitization and internalization of this receptor after activation by either luteinizing hormone (LH) or human chorionic gonadotropin (hCG) [2-8]. Overexpression of membrane proteins such as LHR may result in molecular crowding and may lead to increased protein oligomerization [10]. We hypothesize that LHR are present in the plasma membrane as constitutive small clusters or, alternatively, exist as dimers or mixture of monomers and dimers in the absence of hormone and then undergo varying degrees of aggregation after binding hCG. These differences in LHR organization depend on expression levels of LHR which may, in turn, affects LHR activity. In this project, we examined the effect of LHR expression levels on receptor oligomerization using polarized homo-transfer fluorescence resonance energy transfer (homo-FRET) methods to evaluate receptor interactions in cell lines stably expressing averages of 10,000 receptors per cell, 32,000 receptors per cell, 123,000 receptors per cell or 560,000 receptors per cell. In addition, we measured levels of cyclic adenosine monophosphate (cAMP), a second messenger involved in signal transduction which is produced in response to activation of LHR. This study demonstrated that the oligomerization state of LHR depends on the expression level of LHR, i.e. the number of receptors per cell or the concentration of LHR per unit area. Although LHR appear as in dimers or oligomers in the plasma membrane when receptor expression levels are low, it is clear that, with increased expression levels, LHR are found in larger structures exhibiting lower values for initial anisotropy. The effect of hCG binding on LHR was dependent on the expression level of receptors in the absence of bound hormone. The greatest effect of hCG occurred in cells expressing low numbers of LHR per cell where receptors were able to undergo further aggregation in response to hormone binding. The effect of hCG on highly expressed LHR was negligible; LHR, when highly expressed, were already extensively aggregated and did not undergo measurable changes in their aggregation state. Deglycosylated human chorionic gonadotropin (DG-hCG) had modest effects on cells expressing comparatively few LHR per cell since these receptors were already present in small clusters. Depletion of plasma membrane cholesterol using MβCD caused a decrease in intracellular cAMP level accompanied by decrease in cluster size of LHR as expression level of LHR increased. Together these results are important to our understanding of the roles that the expression levels of LHR, the oligomerization state of LHR and the plasma membrane play in LHR function. The organization of lipids in the bulk membrane and in membrane microdomains affects the ability of LHR to signal as does protein density, particularly when receptor crowding has occurred. These studies also suggest, more generally, that protein organization in the plasma membrane may function as an important pharmacological target that merits further exploration

    A computational intelligence analysis of G proteincoupled receptor sequinces for pharmacoproteomic applications

    Get PDF
    Arguably, drug research has contributed more to the progress of medicine during the past decades than any other scientific factor. One of the main areas of drug research is related to the analysis of proteins. The world of pharmacology is becoming increasingly dependent on the advances in the fields of genomics and proteomics. This dependency brings about the challenge of finding robust methods to analyze the complex data they generate. Such challenge invites us to go one step further than traditional statistics and resort to approaches under the conceptual umbrella of artificial intelligence, including machine learning (ML), statistical pattern recognition and soft computing methods. Sound statistical principles are essential to trust the evidence base built through the use of such approaches. Statistical ML methods are thus at the core of the current thesis. More than 50% of drugs currently available target only four key protein families, from which almost a 30% correspond to the G Protein-Coupled Receptors (GPCR) superfamily. This superfamily regulates the function of most cells in living organisms and is at the centre of the investigations reported in the current thesis. No much is known about the 3D structure of these proteins. Fortunately, plenty of information regarding their amino acid sequences is readily available. The automatic grouping and classification of GPCRs into families and these into subtypes based on sequence analysis may significantly contribute to ascertain the pharmaceutically relevant properties of this protein superfamily. There is no biologically-relevant manner of representing the symbolic sequences describing proteins using real-valued vectors. This does not preclude the possibility of analyzing them using principled methods. These may come, amongst others, from the field of statisticalML. Particularly, kernel methods can be used to this purpose. Moreover, the visualization of high-dimensional protein sequence data can be a key exploratory tool for finding meaningful information that might be obscured by their intrinsic complexity. That is why the objective of the research described in this thesis is twofold: first, the design of adequate visualization-oriented artificial intelligence-based methods for the analysis of GPCR sequential data, and second, the application of the developed methods in relevant pharmacoproteomic problems such as GPCR subtyping and protein alignment-free analysis.Se podría decir que la investigación farmacológica ha desempeñado un papel predominante en el avance de la medicina a lo largo de las últimas décadas. Una de las áreas principales de investigación farmacológica es la relacionada con el estudio de proteínas. La farmacología depende cada vez más de los avances en genómica y proteómica, lo que conlleva el reto de diseñar métodos robustos para el análisis de los datos complejos que generan. Tal reto nos incita a ir más allá de la estadística tradicional para recurrir a enfoques dentro del campo de la inteligencia artificial, incluyendo el aprendizaje automático y el reconocimiento de patrones estadístico, entre otros. El uso de principios sólidos de teoría estadística es esencial para confiar en la base de evidencia obtenida mediante estos enfoques. Los métodos de aprendizaje automático estadístico son uno de los fundamentos de esta tesis. Más del 50% de los fármacos en uso hoy en día tienen como ¿diana¿ apenas cuatro familias clave de proteínas, de las que un 30% corresponden a la super-familia de los G-Protein Coupled Receptors (GPCR). Los GPCR regulan la funcionalidad de la mayoría de las células y son el objetivo central de la tesis. Se desconoce la estructura 3D de la mayoría de estas proteínas, pero, en cambio, hay mucha información disponible de sus secuencias de amino ácidos. El agrupamiento y clasificación automáticos de los GPCR en familias, y de éstas a su vez en subtipos, en base a sus secuencias, pueden contribuir de forma significativa a dilucidar aquellas de sus propiedades de interés farmacológico. No hay forma biológicamente relevante de representar las secuencias simbólicas de las proteínas mediante vectores reales. Esto no impide que se puedan analizar con métodos adecuados. Entre estos se cuentan las técnicas provenientes del aprendizaje automático estadístico y, en particular, los métodos kernel. Por otro lado, la visualización de secuencias de proteínas de alta dimensionalidad puede ser una herramienta clave para la exploración y análisis de las mismas. Es por ello que el objetivo central de la investigación descrita en esta tesis se puede desdoblar en dos grandes líneas: primero, el diseño de métodos centrados en la visualización y basados en la inteligencia artificial para el análisis de los datos secuenciales correspondientes a los GPCRs y, segundo, la aplicación de los métodos desarrollados a problemas de farmacoproteómica tales como la subtipificación de GPCRs y el análisis de proteinas no-alineadas

    Reciprocal Informants: Using Fungal Bioinformatics, Genomics, and Ecology to tie Mechanisms to Ecosystems

    Get PDF
    University of Minnesota Ph.D. dissertation. August 2019. Major: Plant and Microbial Biology. Advisors: Peter Kennedy, H Kistler. 1 computer file (PDF); viii, iv, 126 pages.Across both wild and human-structured ecosystems, fungi interact with every plant species on earth. From mycorrhizal mutualisms, harmless endophytes, and deadly pathogens, the results of these interactions can mean the difference between a plant’s ability to grow and flourish, or languish and expire. Fungal-host dynamics are not static traits, either over evolutionarily time or during the lifetime of individuals where ecological context dependency shapes the outcomes of fungal-host interactions. Understanding the ecological and genetic factors that structure plant-fungal relationships has wide ranging consequences for ecosystems, agro-ecosystems, and human health. However, it’s not well understood how complex genetic mechanisms and ecological pressures work in concert to structure the outcomes of fungal-host interactions, particularly among fungal mutualists. This dissertation contributes to this understanding by investigating how fungal-host relationships are regulated at two levels: broadly, investigating the ecology of fungal-host systems, and specifically, investigating the genetic and genomic basis of how these interactions are mediated. I begin Chapter 1 from the perspective of fungal ecology, investigating the influence of neighborhood (the surrounding plant community) on host specificity patterns using the host-specialist ectomycorrhizal (ECM) genus Suillus. The number of host species that a given fungal species will associate with, and how closely related these host species are, is the study of fungal host specificity. While some fungi associate with only a single species of host (high host specificity), most associate with tens or hundreds of host species (low host specificity). Fungi in the genus Suillus are famous for their high host specificity, primarily associating with plants in the family Pineaceae (particularly White Pines, Red Pines and Larchs). Using a combination of field sampling, sequencing, and colonization bioassays, I present evidence that one species, S. subaureus, has undergone a novel host-expansion onto Angiosperms, and argue that neighborhood effects influence ECM colonization outcomes over both space and time. In Chapter 2, I expand from fungal ecology into fungal genomes. Using genome mining and comparative genomics, I look for signatures of ECM host specificity using 19 genome sequenced Suillus species in relation to 1) other (non-Suillus) ECM fungi and 2) an intrageneric comparison between Suillus that specialize on Red Pine, White Pine or Larch. I present evidence for the involvement of several molecular classes in regulating Suillus host specificity including species specific small secreted proteins, G-protein coupled receptors, and terpene secondary metabolites. Finally, in Chapter 3, I use the genomic and bioinformatic tool sets developed in Chapters 1 and 2, to expand my analysis across the fungal phylogeny and ask questions about a potential molecular correlate of fungal guild and trophic mode: ribosomal DNA (rDNA) copy number. To do this, I developed a bioinformatic pipeline to estimate rDNA copy number variation from whole genome sequence data, and applied it to a phylogenetically and ecologically diverse set of 91 fungal genomes. I present evidence that rDNA copy number is inversely associated phylogenetic distance, but displays a high level of variation, spanning an order of magnitude in Suillus alone, with no detectable correlation to guild occupation or genome size. Taken together, the work presented here shows that genomic and bioinformatic approaches used in concert with classical ecological methodologies, offer great potential to expand our understanding of the two-way influence of ecosystem-level processes and gene-level mechanisms in structuring plant-fungal interactions

    Synthesis and Evaluation of C-10 Nitrogenated Aporphine Alkaloids at Serotonin and Dopamine Receptors

    Full text link
    Aporphine alkaloids, belonging to the isoquinoline class of compounds, have been investigated as a potential source of ligands for Central Nervous System (CNS) receptors. Previous research indicates that the aporphine scaffold may be manipulated to synthesize selective ligands for serotonin and dopamine receptors. Novel aporphine alkaloids containing C10 nitrogen substitutions were synthesized, and their affinities were evaluated at serotonin (5-HT1A, 5-HT1B, 5-HT2A, 5-HT7A) receptors and dopamine (D1, D2, D3, D4, and D5) receptors. Two series of racemic aporphine compounds with C10 nitrogenous functionalities were synthesized and analyzed at the aforementioned receptors. The first series of aporphine alkaloids contain C10 nitro, amine, amide, and methanesulfonamide motifs. Compounds in this C10 monosubstituted series displayed higher affinity at 5-HT1AR and 5-HT7AR and lacked affinity at 5-HT1BR and 5-HT2AR. This series contained compounds with an N6-methyl group and compounds with and N6-propyl group. The N6-methyl substituted C10 nitrogen functionalized aporphine analogs had higher binding affinities at 5-HT7AR versus 5-HT1AR. In contrast the N6-propyl sub-set of compounds exhibited a reversal of this selectivity. Compound 103a was the most potent compound and behaved as an antagonist at 5-HT7AR (Ki = 4.5 ± 0.6 nM, IC50 = 1.25 μM), with 10-fold selectivity over 5-HT1AR (Ki = 49 ± 6.3 nM). These monosubstituted analogs lacked significant binding among all dopamine receptor subtypes. C10 analogs with a benzofused aminothiazole moiety showed higher affinity and selectivity for serotonin receptors as compared to the C10 monosubstituted compounds. These compounds displayed high binding affinities for 5-HT1AR and 5-HT7AR; analogs containing an N6-methyl substitution favor binding at 5-HT7AR. Among the benzofused aminothiazole analogs compound 108a had the best binding affinity at 5-HT7AR (Ki = 6.5±0.8 nM) and functions as an antagonist (IC50 = 0.26 μM). These benzofused aminothiazole analogs also lacked affinity for dopamine receptors. Unlike analogs in the C10 monosubstituted subset, compounds with the benzofused aminothiazole moiety with an N6-methyl substitution displayed moderate affinity for 5-HT1BR. The second series of compounds contained a C1,2,10-trisusbtitution pattern on the aporphine core. The 1,2,10-trisusbtituted series of compounds as a group displayed weak binding affinity at 5-HT1AR and considerably higher binding affinity at 5-HT1BR. These compounds provided moderate affinity at 5-HT2AR and 5-HT7AR. At dopamine receptors, most of the trisubstituted series of compounds failed to show affinity towards D5 receptors suggesting a lack of tolerability at D5R receptors for C10 N substituted aporphines with moderate to low affinity at D1R, thus attaining D1R versus D5R selectivity. Compound 128e was the most potent D1R ligand (Ki = 58 nM) and lacked binding affinity at all other dopamine receptor subtypes. Compounds 103a, 108a, and 128e have been identified as three new lead compounds with promising pharmacodynamic properties for further tool and pharmaceutical optimization
    corecore