373 research outputs found

    Exploratory visualization of misclassified GPCRs from their transformed unaligned sequences using manifold learning techniques

    Get PDF
    Class C G-protein-coupled receptors (GPCRs) are cell membrane proteins of great relevance to biology and pharmacology. Previous research has revealed an upper boundary on the accuracy that can be achieved in their classification into subtypes from the unaligned transformation of their sequences. To investigate this, we focus on sequences that have been misclassified using supervised methods. These are visualized, using a nonlinear dimensionality reduction technique and phylogenetic trees, and then characterized against the rest of the data and, particularly, against the rest of cases of their own subtype. This should help to discriminate between different types of misclassification and to build hypotheses about database quality problems and the extent to which GPCR sequence transformations limit subtype discriminability. The reported experiments provide a proof of concept for the proposed method.Postprint (published version

    A computational intelligence analysis of G proteincoupled receptor sequinces for pharmacoproteomic applications

    Get PDF
    Arguably, drug research has contributed more to the progress of medicine during the past decades than any other scientific factor. One of the main areas of drug research is related to the analysis of proteins. The world of pharmacology is becoming increasingly dependent on the advances in the fields of genomics and proteomics. This dependency brings about the challenge of finding robust methods to analyze the complex data they generate. Such challenge invites us to go one step further than traditional statistics and resort to approaches under the conceptual umbrella of artificial intelligence, including machine learning (ML), statistical pattern recognition and soft computing methods. Sound statistical principles are essential to trust the evidence base built through the use of such approaches. Statistical ML methods are thus at the core of the current thesis. More than 50% of drugs currently available target only four key protein families, from which almost a 30% correspond to the G Protein-Coupled Receptors (GPCR) superfamily. This superfamily regulates the function of most cells in living organisms and is at the centre of the investigations reported in the current thesis. No much is known about the 3D structure of these proteins. Fortunately, plenty of information regarding their amino acid sequences is readily available. The automatic grouping and classification of GPCRs into families and these into subtypes based on sequence analysis may significantly contribute to ascertain the pharmaceutically relevant properties of this protein superfamily. There is no biologically-relevant manner of representing the symbolic sequences describing proteins using real-valued vectors. This does not preclude the possibility of analyzing them using principled methods. These may come, amongst others, from the field of statisticalML. Particularly, kernel methods can be used to this purpose. Moreover, the visualization of high-dimensional protein sequence data can be a key exploratory tool for finding meaningful information that might be obscured by their intrinsic complexity. That is why the objective of the research described in this thesis is twofold: first, the design of adequate visualization-oriented artificial intelligence-based methods for the analysis of GPCR sequential data, and second, the application of the developed methods in relevant pharmacoproteomic problems such as GPCR subtyping and protein alignment-free analysis.Se podría decir que la investigación farmacológica ha desempeñado un papel predominante en el avance de la medicina a lo largo de las últimas décadas. Una de las áreas principales de investigación farmacológica es la relacionada con el estudio de proteínas. La farmacología depende cada vez más de los avances en genómica y proteómica, lo que conlleva el reto de diseñar métodos robustos para el análisis de los datos complejos que generan. Tal reto nos incita a ir más allá de la estadística tradicional para recurrir a enfoques dentro del campo de la inteligencia artificial, incluyendo el aprendizaje automático y el reconocimiento de patrones estadístico, entre otros. El uso de principios sólidos de teoría estadística es esencial para confiar en la base de evidencia obtenida mediante estos enfoques. Los métodos de aprendizaje automático estadístico son uno de los fundamentos de esta tesis. Más del 50% de los fármacos en uso hoy en día tienen como ¿diana¿ apenas cuatro familias clave de proteínas, de las que un 30% corresponden a la super-familia de los G-Protein Coupled Receptors (GPCR). Los GPCR regulan la funcionalidad de la mayoría de las células y son el objetivo central de la tesis. Se desconoce la estructura 3D de la mayoría de estas proteínas, pero, en cambio, hay mucha información disponible de sus secuencias de amino ácidos. El agrupamiento y clasificación automáticos de los GPCR en familias, y de éstas a su vez en subtipos, en base a sus secuencias, pueden contribuir de forma significativa a dilucidar aquellas de sus propiedades de interés farmacológico. No hay forma biológicamente relevante de representar las secuencias simbólicas de las proteínas mediante vectores reales. Esto no impide que se puedan analizar con métodos adecuados. Entre estos se cuentan las técnicas provenientes del aprendizaje automático estadístico y, en particular, los métodos kernel. Por otro lado, la visualización de secuencias de proteínas de alta dimensionalidad puede ser una herramienta clave para la exploración y análisis de las mismas. Es por ello que el objetivo central de la investigación descrita en esta tesis se puede desdoblar en dos grandes líneas: primero, el diseño de métodos centrados en la visualización y basados en la inteligencia artificial para el análisis de los datos secuenciales correspondientes a los GPCRs y, segundo, la aplicación de los métodos desarrollados a problemas de farmacoproteómica tales como la subtipificación de GPCRs y el análisis de proteinas no-alineadas

    Visual Characterization of Misclassified Class C GPCRs through Manifold-based Machine Learning Methods

    Get PDF

    Bayesian nonparametric clusterings in relational and high-dimensional settings with applications in bioinformatics.

    Get PDF
    Recent advances in high throughput methodologies offer researchers the ability to understand complex systems via high dimensional and multi-relational data. One example is the realm of molecular biology where disparate data (such as gene sequence, gene expression, and interaction information) are available for various snapshots of biological systems. This type of high dimensional and multirelational data allows for unprecedented detailed analysis, but also presents challenges in accounting for all the variability. High dimensional data often has a multitude of underlying relationships, each represented by a separate clustering structure, where the number of structures is typically unknown a priori. To address the challenges faced by traditional clustering methods on high dimensional and multirelational data, we developed three feature selection and cross-clustering methods: 1) infinite relational model with feature selection (FIRM) which incorporates the rich information of multirelational data; 2) Bayesian Hierarchical Cross-Clustering (BHCC), a deterministic approximation to Cross Dirichlet Process mixture (CDPM) and to cross-clustering; and 3) randomized approximation (RBHCC), based on a truncated hierarchy. An extension of BHCC, Bayesian Congruence Measuring (BCM), is proposed to measure incongruence between genes and to identify sets of congruent loci with identical evolutionary histories. We adapt our BHCC algorithm to the inference of BCM, where the intended structure of each view (congruent loci) represents consistent evolutionary processes. We consider an application of FIRM on categorizing mRNA and microRNA. The model uses latent structures to encode the expression pattern and the gene ontology annotations. We also apply FIRM to recover the categories of ligands and proteins, and to predict unknown drug-target interactions, where latent categorization structure encodes drug-target interaction, chemical compound similarity, and amino acid sequence similarity. BHCC and RBHCC are shown to have improved predictive performance (both in terms of cluster membership and missing value prediction) compared to traditional clustering methods. Our results suggest that these novel approaches to integrating multi-relational information have a promising future in the biological sciences where incorporating data related to varying features is often regarded as a daunting task

    Comparsion of digestion and particle-associated bacteria after in situ incubation of different barley varieties in the rumen of cattle

    Get PDF
    The chemical composition of barley grain, including the structure of starch, can vary among barley varieties and result in different digestion efficiencies. It is not known if compositional differences in barley can affect the particle-associated bacteria (PAB) involved in digestion. Therefore, the objective of this study was to characterize the in situ rumen digestion and PAB of four barley grain varieties. Three ruminally-cannulated heifers were fed a low grain (60% barley silage, 37 % barley grain and 3% supplement) or high grain (37% barley silage, 60% barley grain and 3% supplement) diet. Four different barley varieties (Fibar, Xena, McGwire and Hilose) and corn as a control were included in the experiment. A series of rumen incubations were carried out. One set of bags (3 heifers x 3 bags/time point/treatment; n=9) containing 3 g of ground grain was used to estimate dry matter (DM), starch and crude protein (CP) disappearance. A second set of bags (2 heifers x 3 bags/time point/ treatment; n=6) containing 5 g of ground grain were incubated and used for DNA extraction. A third set of bags (2 heifers x 2 bags/time point/treatment; n=4) containing ground grain (5 g) were incubated and examined using scanning electron microscopy (SEM). The same two heifers were used for DNA and SEM bags. Bags to estimate nutrient digestion were incubated for 0, 2, 4, 12, 24 and 48 h and for 2, 4, and 12 h for DNA extraction and SEM. DNA was extracted to characterize PAB via 16S rRNA gene sequencing followed by analysis using QIIME. In the low grain diet, McGwire had the highest effective degradability (ED) of DM (P<0.01), followed by Xena, Fibar, Hilose, and corn, respectively. The ED of starch was highest (P<0.01) for Fibar, McGwire, and Xena, followed by Hilose and Corn while the ED of protein showed that Corn had lower ED than barley grains. For the high grain diet, Fibar and McGwire had the highest ED of DM (P<0.01), followed by Fibar, Hilose and corn, respectively. The ED of starch was highest (P<0.01) for Xena and Fibar, followed by McGwire, Hilose and corn. The ED of protein was highest (P<0.01) for Fibar (55.0%) and lowest for Corn (32.0%). Barley variety did not affect the relative abundance of phyla, but they did differ with incubation time in both the low and high grain diets. However, after 12 h of incubation the diversity of bacteria differed from that after 2 and 4 h of incubation in the rumen with both diets. Lactobacillus (approximately 80%) dominanted after 12 h of incubation when cattle were fed low grain diet, with both Prevotella and Lactobacillus being the most abundant genera after 12 h of incubation with the high grain diet. This study found that the diversity of PAB on barley grain was not affected by barley variety, despite there being differences in digestion kinetics. However, time affected PAB, illustrating that the bacterial biofilm involved in the digestion of grains clearly undergoes compositional shifts during ruminal digestion. Moreover, the digestibility and bacterial biofilm were affected by differences in endosperm structure between corn and barley. This is probably because corn and barley differ in their endosperm structure, especially with regard to the protein matrix, which could affect digestibility and the formation of grain-associated bacterial biofilm

    Cross-species network and transcript transfer

    Get PDF
    Metabolic processes, signal transduction, gene regulation, as well as gene and protein expression are largely controlled by biological networks. High-throughput experiments allow the measurement of a wide range of cellular states and interactions. However, networks are often not known in detail for specific biological systems and conditions. Gene and protein annotations are often transferred from model organisms to the species of interest. Therefore, the question arises whether biological networks can be transferred between species or whether they are specific for individual contexts. In this thesis, the following aspects are investigated: (i) the conservation and (ii) the cross-species transfer of eukaryotic protein-interaction and gene regulatory (transcription factor- target) networks, as well as (iii) the conservation of alternatively spliced variants. In the simplest case, interactions can be transferred between species, based solely on the sequence similarity of the orthologous genes. However, such a transfer often results either in the transfer of only a few interactions (medium/high sequence similarity threshold) or in the transfer of many speculative interactions (low sequence similarity threshold). Thus, advanced network transfer approaches also consider the annotations of orthologous genes involved in the interaction transfer, as well as features derived from the network structure, in order to enable a reliable interaction transfer, even between phylogenetically very distant species. In this work, such an approach for the transfer of protein interactions is presented (COIN). COIN uses a sophisticated machine-learning model in order to label transferred interactions as either correctly transferred (conserved) or as incorrectly transferred (not conserved). The comparison and the cross-species transfer of regulatory networks is more difficult than the transfer of protein interaction networks, as a huge fraction of the known regulations is only described in the (not machine-readable) scientific literature. In addition, compared to protein interactions, only a few conserved regulations are known, and regulatory elements appear to be strongly context-specific. In this work, the cross-species analysis of regulatory interaction networks is enabled with software tools and databases for global (ConReg) and thousands of context-specific (CroCo) regulatory interactions that are derived and integrated from the scientific literature, binding site predictions and experimental data. Genes and their protein products are the main players in biological networks. However, to date, the aspect is neglected that a gene can encode different proteins. These alternative proteins can differ strongly from each other with respect to their molecular structure, function and their role in networks. The identification of conserved and species-specific splice variants and the integration of variants in network models will allow a more complete cross-species transfer and comparison of biological networks. With ISAR we support the cross-species transfer and comparison of alternative variants by introducing a gene-structure aware (i.e. exon-intron structure aware) multiple sequence alignment approach for variants from orthologous and paralogous genes. The methods presented here and the appropriate databases allow the cross-species transfer of biological networks, the comparison of thousands of context-specific networks, and the cross-species comparison of alternatively spliced variants. Thus, they can be used as a starting point for the understanding of regulatory and signaling mechanisms in many biological systems.In biologischen Systemen werden Stoffwechselprozesse, Signalübertragungen sowie die Regulation von Gen- und Proteinexpression maßgeblich durch biologische Netzwerke gesteuert. Hochdurchsatz-Experimente ermöglichen die Messung einer Vielzahl von zellulären Zuständen und Wechselwirkungen. Allerdings sind für die meisten Systeme und Kontexte biologische Netzwerke nach wie vor unbekannt. Gen- und Proteinannotationen werden häufig von Modellorganismen übernommen. Demnach stellt sich die Frage, ob auch biologische Netzwerke und damit die systemischen Eigenschaften ähnlich sind und übertragen werden können. In dieser Arbeit wird: (i) Die Konservierung und (ii) die artenübergreifende Übertragung von eukaryotischen Protein-Interaktions- und regulatorischen (Transkriptionsfaktor-Zielgen) Netzwerken, sowie (iii) die Konservierung von Spleißvarianten untersucht. Interaktionen können im einfachsten Fall nur auf Basis der Sequenzähnlichkeit zwischen orthologen Genen übertragen werden. Allerdings führt eine solche Übertragung oft dazu, dass nur sehr wenige Interaktionen übertragen werden können (hoher bis mittlerer Sequenzschwellwert) oder dass ein Großteil der übertragenden Interaktionen sehr spekulativ ist (niedriger Sequenzschwellwert). Verbesserte Methoden berücksichtigen deswegen zusätzlich noch die Annotationen der Orthologen, Eigenschaften der Interaktionspartner sowie die Netzwerkstruktur und können somit auch Interaktionen auf phylogenetisch weit entfernte Arten (zuverlässig) übertragen. In dieser Arbeit wird ein solcher Ansatz für die Übertragung von Protein-Interaktionen vorgestellt (COIN). COIN verwendet Verfahren des maschinellen Lernens, um Interaktionen als richtig (konserviert) oder als falsch übertragend (nicht konserviert) zu klassifizieren. Der Vergleich und die artenübergreifende Übertragung von regulatorischen Interaktionen ist im Vergleich zu Protein-Interaktionen schwieriger, da ein Großteil der bekannten Regulationen nur in der (nicht maschinenlesbaren) wissenschaftlichen Literatur beschrieben ist. Zudem sind im Vergleich zu Protein-Interaktionen nur wenige konservierte Regulationen bekannt und regulatorische Elemente scheinen stark kontextabhängig zu sein. In dieser Arbeit wird die artenübergreifende Analyse von regulatorischen Netzwerken mit Softwarewerkzeugen und Datenbanken für globale (ConReg) und kontextspezifische (CroCo) regulatorische Interaktionen ermöglicht. Regulationen wurden dafür aus Vorhersagen, experimentellen Daten und aus der wissenschaftlichen Literatur abgeleitet und integriert. Grundbaustein für viele biologische Netzwerke sind Gene und deren Proteinprodukte. Bisherige Netzwerkmodelle vernachlässigen allerdings meist den Aspekt, dass ein Gen verschiedene Proteine kodieren kann, die sich von der Funktion, der Proteinstruktur und der Rolle in Netzwerken stark voneinander unterscheiden können. Die Identifizierung von konservierten und artspezifischen Proteinprodukten und deren Integration in Netzwerkmodelle würde einen vollständigeren Übertrag und Vergleich von Netzwerken ermöglichen. In dieser Arbeit wird der artenübergreifende Vergleich von Proteinprodukten mit einem multiplen Sequenzalignmentverfahren für alternative Varianten von paralogen und orthologen Genen unterstützt, unter Berücksichtigung der bekannten Exon-Intron-Grenzen (ISAR). Die in dieser Arbeit vorgestellten Verfahren, Datenbanken und Softwarewerkzeuge ermöglichen die Übertragung von biologischen Netzwerken, den Vergleich von tausenden kontextspezifischen Netzwerken und den artenübergreifenden Vergleich von alternativen Varianten. Sie können damit die Ausgangsbasis für ein Verständnis von Kommunikations- und Regulationsmechanismen in vielen biologischen Systemen bilden

    Homology sequence analysis using GPU acceleration

    Get PDF
    A number of problems in bioinformatics, systems biology and computational biology field require abstracting physical entities to mathematical or computational models. In such studies, the computational paradigms often involve algorithms that can be solved by the Central Processing Unit (CPU). Historically, those algorithms benefit from the advancements of computing power in the serial processing capabilities of individual CPU cores. However, the growth has slowed down over recent years, as scaling out CPU has been shown to be both cost-prohibitive and insecure. To overcome this problem, parallel computing approaches that employ the Graphics Processing Unit (GPU) have gained attention as complementing or replacing traditional CPU approaches. The premise of this research is to investigate the applicability of various parallel computing platforms to several problems in the detection and analysis of homology in biological sequence. I hypothesize that by exploiting the sheer amount of computation power and sequencing data, it is possible to deduce information from raw sequences without supplying the underlying prior knowledge to come up with an answer. I have developed such tools to perform analysis at scales that are traditionally unattainable with general-purpose CPU platforms. I have developed a method to accelerate sequence alignment on the GPU, and I used the method to investigate whether the Operational Taxonomic Unit (OTU) classification problem can be improved with such sheer amount of computational power. I have developed a method to accelerate pairwise k-mer comparison on the GPU, and I used the method to further develop PolyHomology, a framework to scaffold shared sequence motifs across large numbers of genomes to illuminate the structure of the regulatory network in yeasts. The results suggest that such approach to heterogeneous computing could help to answer questions in biology and is a viable path to new discoveries in the present and the future.Includes bibliographical reference

    Characterization of Acyltransferases and WRINKLED Orthologs Involved in TAG Biosynthesis in Avocado

    Get PDF
    Triacylglycerols (TAG) or storage oils in plants are utilized by humans for nutrition, production of biomaterials and fuels. Since nonseed tissues comprise the bulk biomass, it is pertinent to understand how to improve their TAG content. Typically, the final step in TAG biosynthesis is catalyzed by diacylglycerol (DAG) acyltransferases (DGAT) and/or phospholipid: diacylglycerol acyltransferases (PDAT), which also determine the content and composition of TAG. Besides enzymatic regulation of TAG synthesis, transcription factors such as WRINKLED1 (WRI1) play a critical role during fatty acid synthesis. In this study, mesocarp of Persea americana, with \u3e 60% TAG by dry weight and oleic acid as the major constituent was used as a model system to explore TAG synthesis in nonseed tissues. Based on the transcriptome data of avocado, it was hypothesized that both DGAT and PDAT are likely to catalyze the conversion of DAG to TAG, and orthologs of WRI1 transcription factors regulate fatty acid biosynthesis. Here, with comprehensive in silico analyses, putative PamDGAT1 and 2 (Pam; Persea americana), PamPDAT1, and PamWRI1 and 2 were identified. When acyltransferases were expressed into TAG-deficient mutant yeast strain (H1246), only DGAT1 restored TAG synthesis capacity, with a preference for oleic acid. However, in planta, when transiently expressed in Nicotiana benthamiana leaves, PamDGAT1, PamPDAT1, PamWRI1, and PamWRI2 increased lipid contents, PamDGAT2 remained inactive. The data reveals that putative PamDGAT1, PamPDAT1 are functional and preferred acyltransferases in avocado and both PamWRI1 and 2 regulate fatty acid synthesis. In conclusion, while nonseed tissue of a basal angiosperm has certain distinct regulatory features, DAG to TAG conversion remains highly conserved

    Unipept: computational exploration of metaproteome data

    Get PDF

    Description of two cultivated and two uncultivated new Salinibacter species, one named following the rules of the bacteriological code: Salinibacter grassmerensis sp. nov.; and three named following the rules of the SeqCode: Salinibacter pepae sp. nov., Salinibacter abyssi sp. nov., and Salinibacter pampae sp. nov.

    Get PDF
    Current -omics methods allow the collection of a large amount of information that helps in describing the microbial diversity in nature. Here, and as a result of a culturomic approach that rendered the collection of thousands of isolates from 5 different hypersaline sites (in Spain, USA and New Zealand), we obtained 21 strains that represent two new Salinibacter species. For these species we propose the names Salinibacter pepae sp. nov. and Salinibacter grassmerensis sp. nov. (showing average nucleotide identity (ANI) values < 95.09% and 87.08% with Sal. ruber M31T, respectively). Metabolomics revealed species-specific discriminative profiles. Sal. ruber strains were distinguished by a higher percentage of polyunsaturated fatty acids and specific N-functionalized fatty acids; and Sal. altiplanensis was distinguished by an increased number of glycosylated molecules. Based on sequence characteristics and inferred phenotype of metagenome-assembled genomes (MAGs), we describe two new members of the genus Salinibacter. These species dominated in different sites and always coexisted with Sal. ruber and Sal. pepae. Based on the MAGs from three Argentinian lakes in the Pampa region of Argentina and the MAG of the Romanian lake Fără Fund, we describe the species Salinibacter pampae sp. nov. and Salinibacter abyssi sp. nov. respectively (showing ANI values 90.94% and 91.48% with Sal. ruber M31T, respectively). Sal. grassmerensis sp. nov. name was formed according to the rules of the International Code for Nomenclature of Prokaryotes (ICNP), and Sal. pepae, Sal. pampae sp. nov. and Sal. abyssi sp. nov. are proposed following the rules of the newly published Code of Nomenclature of Prokaryotes Described from Sequence Data (SeqCode). This work constitutes an example on how classification under ICNP and SeqCode can coexist, and how the official naming a cultivated organism for which the deposit in public repositories is difficult finds an intermediate solution.This study was funded by the Spanish Ministry of Science, Innovation and Universities projects PGC2018-096956-B-C41, RTC-2017-6405-1 and PID2021-126114NB-C42, which were also supported by the European Regional Development Fund (FEDER). RRM acknowledges the financial support of the sabbatical stay at Georgia Tech and HelmholzZentrum München by the grants PRX18/00048 and PRX21/00043 respectively also from the Spanish Ministry of Science, Innovation and Universities. This research was carried out within the framework of the activities of the Spanish Government through the “Maria de Maeztu Centre of Excellence” accreditation to IMEDEA (CSIC-UIB) (CEX2021-001198). KTK’s research was supported, in part, by the U.S. National Science Foundation (Award No. 1831582 and No. 2129823). IMG. AC and HLB were financially supported by a grant of the Ministry of Research, Innovation and Digitization, CNCS/CCCDI – UEFISCDI, project number PN-III-P4-ID-PCE-2020-1559, within PNCDI III. HLB acknowledges Ocna Sibiului City Hall (Sibiu County, Romania) for granting the access to Fără Fund Lake and A. Baricz and D.F. Bogdan for technical support during sampling and sample preparation. MBS thanks Dominion Salt for their assistance in sample Lake Grassmere. MELL acknowledges the financial support of the Argentinian National Scientific and Technical Research Council (Grant CONICET-NSFC 2017 N° IF-2018-10102222-APN-GDCT-CONICET) and the National Geographic Society (Grant # NGS 357R-18). BPH was supported by NASA (award 80NSSC18M0027). TV acknowledges the “Margarita Salas” postdoctoral grant, funded by the Spanish Ministry of Universities, within the framework of Recovery, Transformation and Resilience Plan, and funded by the European Union (NextGenerationEU), with the participation of the University of Balearic Islands (UIB)