30 research outputs found

    Benchmarking network propagation methods for disease gene identification

    Get PDF
    In-silico identification of potential target genes for disease is an essential aspect of drug target discovery. Recent studies suggest that successful targets can be found through by leveraging genetic, genomic and protein interaction information. Here, we systematically tested the ability of 12 varied algorithms, based on network propagation, to identify genes that have been targeted by any drug, on gene-disease data from 22 common non-cancerous diseases in OpenTargets. We considered two biological networks, six performance metrics and compared two types of input gene-disease association scores. The impact of the design factors in performance was quantified through additive explanatory models. Standard cross-validation led to over-optimistic performance estimates due to the presence of protein complexes. In order to obtain realistic estimates, we introduced two novel protein complex-aware cross-validation schemes. When seeding biological networks with known drug targets, machine learning and diffusion-based methods found around 2-4 true targets within the top 20 suggestions. Seeding the networks with genes associated to disease by genetics decreased performance below 1 true hit on average. The use of a larger network, although noisier, improved overall performance. We conclude that diffusion-based prioritisers and machine learning applied to diffusion-based features are suited for drug discovery in practice and improve over simpler neighbour-voting methods. We also demonstrate the large impact of choosing an adequate validation strategy and the definition of seed disease genesPeer ReviewedPostprint (published version

    Non-targeted metabolomics reveals alterations in liver and plasma of gilt-head bream exposed to oxybenzone

    Get PDF
    The extensive use of the organic UV filter oxybenzone has led to its ubiquitous occurrence in the aquatic environment, causing an ecotoxicological risk to biota. Although some studies reported adverse effects, such as reproductive toxicity, further research needs to be done in order to assess its molecular effects and mechanism of action. Therefore, in the present work, we investigated metabolic perturbations in juvenile gilt-head bream (Sparus aurata) exposed over 14 days via the water to oxybenzone (50¿mg/L). The non-targeted analysis of brain, liver and plasma extracts was performed by means of UHPLC-qOrbitrap MS in positive and negative modes with both C18 and HILIC separation. Although there was no mortality or alterations in general physiological parameters during the experiment, and the metabolic profile of brain was not affected, the results of this study showed that oxybenzone could perturb both liver and plasma metabolome. The pathway enrichment suggested that different pathways in lipid metabolism (fatty acid elongation, a-linolenic acid metabolism, biosynthesis of unsaturated fatty acids and fatty acid metabolism) were significantly altered, as well as metabolites involved in phenylalanine and tyrosine metabolism. Overall, these changes are signs of possible oxidative stress and energy metabolism modification. Therefore, this research indicates that oxybenzone has adverse effects beyond the commonly studied hormonal activity, and demonstrates the sensitivity of metabolomics to assess molecular-level effects of emerging contaminants.Peer ReviewedPostprint (author's final draft

    Enriquecimiento mediante vías metabólicas de datos de Cromatografía Líquida- Espectrometría de Masas a través de análisis espectral de grafos

    Get PDF
    Una de las técnicas experimentales más extendidas en el ámbito de investigación biológica y la química analítica es la Cromatografía Líquida – Espectrometría de Masas, CL/EM, cuya salida informa sobre los compuestos presentes en las muestras mediante una técnica de separación física acoplada a una separación en función de la relación carga-masa. Las técnicas de enriquecimiento de vías metabólicas son preciadas en el tratamiento de conjuntos extensivos de datos, puesto que traducen esta información sobre computestos en términos de vías metabólicas a la vez que reducen el ruido estadístico. Las vías metabólicas son fuente de conocimiento por su estrecha relación con los mecanismos biológicos. Este trabajo propone una nueva técnica de enriquecimiento de datos obtenidos en CL/EM mediante una estrategia en dos bloques. El primero consiste en plasmar la base de datos Kyoto Encyclopedia of Genes and Genomes en grafos interpretables. El segundo trata de aplicar algoritmos de difusión de calor y PageRank sobre dichos grafos, con el objetivo de llevar a término el enriquecimiento. Estos procedimientos se han aplicado en un caso real y sus resultados coinciden con los de validación funcional.Peer ReviewedPostprint (author's final draft

    FELLA: an R package to enrich metabolomics data

    Get PDF
    Background: Pathway enrichment techniques are useful for understanding experimental metabolomics data. Their purpose is to give context to the affected metabolites in terms of the prior knowledge contained in metabolic pathways. However, the interpretation of a prioritized pathway list is still challenging, as pathways show overlap and cross talk effects. Results: We introduce FELLA, an R package to perform a network-based enrichment of a list of affected metabolites. FELLA builds a hierarchical representation of an organism biochemistry from the Kyoto Encyclopedia of Genes and Genomes (KEGG), containing pathways, modules, enzymes, reactions and metabolites. In addition to providing a list of pathways, FELLA reports intermediate entities (modules, enzymes, reactions) that link the input metabolites to them. This sheds light on pathway cross talk and potential enzymes or metabolites as targets for the condition under study. FELLA has been applied to six public datasets -three from Homo sapiens, two from Danio rerio and one from Mus musculus- and has reproduced findings from the original studies and from independent literature. Conclusions: The R package FELLA offers an innovative enrichment concept starting from a list of metabolites, based on a knowledge graph representation of the KEGG database that focuses on interpretability. Besides reporting a list of pathways, FELLA suggests intermediate entities that are of interest per se. Its usefulness has been shown at several molecular levels on six public datasets, including human and animal models. The user can run the enrichment analysis through a simple interactive graphical interface or programmatically. FELLA is publicly available in Bioconductor under the GPL-3 license.Peer ReviewedPostprint (published version

    Amitriptyline at an environmentally relevant concentration alters the profile of metabolites beyond monoamines in gilt-head bream

    Get PDF
    The antidepressant amitriptyline is a widely used selective serotonin reuptake inhibitor that is found in the aquatic environment. The present work investigates alterations in the brain and liver metabolome of gilt-head bream (Sparus aurata) following exposure at an environmentally relevant concentration (0.2 µg/L) of amitriptyline for 7 days. Analysis of variance-simultaneous component analysis (ASCA) was used to identify metabolites that distinguished exposed from control animals. Overall, alterations in lipid metabolism suggest the occurrence of oxidative stress in both brain and liver, a common adverse effect of xenobiotics. However, alterations in the amino acid arginine were also observed, likely related to the nitric oxide system, which is known to be associated with the mechanism of action of antidepressants. Additionally, changes on asparagine and methionine levels in brain and pantothenate, uric acid, formylisoglutamine/N-formimino-L-glutamate levels in liver could indicate alteration of amino acid metabolism in both tissues, and the perturbation of glutamate in liver suggests that the energy metabolism was also affected. These results revealed that environmentally relevant concentrations of amitriptyline perturbed a fraction of the metabolome which is not typically associated with antidepressant exposure in fish.Peer ReviewedPostprint (author's final draft

    Null diffusion-based enrichment for metabolomics data

    Get PDF
    Metabolomics experiments identify metabolites whose abundance varies as the conditions under study change. Pathway enrichment tools help in the identification of key metabolic processes and in building a plausible biological explanation for these variations. Although several methods are available for pathway enrichment using experimental evidence, metabolomics does not yet have a comprehensive overview in a network layout at multiple molecular levels. We propose a novel pathway enrichment procedure for analysing summary metabolomics data based on sub-network analysis in a graph representation of a reference database. Relevant entries are extracted from the database according to statistical measures over a null diffusive process that accounts for network topology and pathway crosstalk. Entries are reported as a sub-pathway network, including not only pathways, but also modules, enzymes, reactions and possibly other compound candidates for further analyses. This provides a richer biological context, suitable for generating new study hypotheses and potential enzymatic targets. Using this method, we report results from cells depleted for an uncharacterised mitochondrial gene using GC and LC-MS data and employing KEGG as a knowledge base. Partial validation is provided with NMR-based tracking of 13C glucose labelling of these cells.Peer ReviewedPostprint (author's final draft

    Biological Pathway enrichment of Liquid Chromatography - Mass Spectrometry data through spectral graph analysis

    No full text
    Els experiments de recerca biol ogica sobre el metabolisme hum a solen usar CL/EM, Cromatografi a Li quida - Espectrometria de Masses, per obtenir informaci o sobre els compostos a les mostres,per o aquest tipus de dades acostuma a presentar soroll estadi stic. Les t ecniques d'enriquiment filtren soroll i permeten, amb l'ajut d'una base de dades, trobar una explicaci o en termes de vies metab oliques afectades. Aix o implica una millor comprensi o de la biologia i un acc es m es e ficient a les bases de dades. Aquest projecte engloba la concepci o d'un paquet en l'entorn R que cont e dos blocs diferenciats. La primera part converteix la base de dades KEGG, Kyoto Encyclopedia of Genes and Genomes, en grafs amb signi cat biol ogic. Com a novetat respecte a les t ecniques alternatives de l'estat de l'art, els grafs aquí utilitzats contenen cinc tipus de v ertexs: vies metab oliques, m oduls, enzims, reaccions i compostos. La segona part aplica algorismes de difusi o de calor i PageRank sobre aquests grafs, tenint en compte una entrada de compostos provinents de CL/EM. Entre les dues parts s'assoleix l'enriquiment de les dades. Els m etodes proposats s'han executat en un cas real i s'han validat per experts en la mat eria. Han obtingut bons resultats, com tamb e ho han fet t ecniques de l'estat de l'art, i han anat un pas m es enll a en la interpretaci o de resultats. En particular, en lloc de proporcionar nom es una llista de vies metab oliques afectades s'ha constru t un graf que les relaciona amb els m oduls, els enzims, les reaccions i els compostos involucrats. Finalment, ja que els grafs i els algorismes permeten una amplia personalitzaci o, s'han elaborat recomanacions per una futura ampliaci o

    Statistical normalisation of network propagation methods for computational biology

    Get PDF
    Tesi en modalitat de compendi de publicacions. Aplicat embargament des de la data defensa fins al dia 1 d'agost de 2021Premi extraordinari doctorat UPC curs 2019-2020, àmbit d’Enginyeria industrialThe advent of high-throughput technologies and their decreasing cost have fostered the creation of a rich ecosystem of public database resources. In an era of affordable data acquisition, the core challenge has shifted to improve data interpretation, in order to understand normal and disease states. To that end, leveraging the current contextual knowledge in the form of annotations and biological networks is a powerful data amplifier to elucidate novel hypotheses. Label propagation and diffusion are the linchpin of the state of the art in network algorithms. In its simplest form, label propagation predicts the labels of a given node (for instance a gene, protein or metabolite) using those of its interactors. More elaborated approaches propagate beyond direct interactors, with robust performance in many computational biology domains. It has been pointed out that the topological structure of biological networks can bias propagation algorithms. Poorly known entities are overlooked and harder to link to experimental findings, which in turn keeps them barely annotated. Some efforts try to break this circularity by statistically normalising the topological bias, but the properties of the bias and the real benefit of its removal are yet to be carefully examined. This thesis covers two blocks. First, a characterisation of the bias in diffusion-based algorithms, with the implementation of statistical normalisations. Second, the application of such normalisation in classical computational biology problems: pathway analysis for metabolomics data and target gene prediction for drug development. In the first block, the presence of the bias is confirmed and linked to the network topology, albeit dependent on which nodes have labels. Equivalences are proven between diffusion processes with variations on their definitions, thus easing its choice. Closed forms on the first and second statistical moments of the null distributions of the diffusion scores are provided and linked to the spectral features of the network. The normalisation can be detrimental if the bias favours nodes with positive labels. An ad-hoc study of the data and the expected properties of the findings is recommended for an optimal choice. To that end, this thesis contributes the diffuStats software package, easing the computation and benchmark of several normalised and unnormalised diffusion scores. The second block starts with pathway analysis for metabolomics data. This choice is driven by the relative lack of computational solutions for metabolomics, whose output still requires an effortful interpretation. Here, a knowledge graph is conceived to connect the metabolites to the biological pathways through intermediate entities, like reactions and enzymes. Given the metabolites of interest, a propagation process is run to prioritise a relevant sub-network, suitable for manual inspection. The statistical normalisation is required due to the network design and properties. The usefulness of this approach is proven not only regarding pathway findings, but also examining the metabolites and reactions within the suggested sub-networks. The knowledge network construction and the propagation algorithm are distributed in the FELLA software package. The second practical application is the prediction of plausible gene targets in disease. Besides benchmarking the effect of the statistical normalisation, particular care is put into obtaining meaningful performance estimates for practical drug development. Target data is usually known at the protein complex level, which leads to performance over-estimation if ignored. Here, this effect is corrected in a varied comparison of prioritisation algorithms, networks, performance metrics and diseases. The results support that the statistical normalisation has a small but negative impact. After correcting for the protein complex structure, network-based algorithms are still deemed useful for drug discovery.La aparición de tecnologías experimentales de alto rendimiento ha propiciado la creación de un rico entorno de bases de datos que aglomeran todo tipo de anotaciones moleculares. Dada la creciente facilidad para la adquisición de datos en varios niveles moleculares, el reto central de la biología computacional ha virado hacia la interpretación de dicho volumen de datos. La comprensión de los procesos de normalidad y enfermedad involucrados en los cambios observados en los estudios experimentales es el motor que expande la frontera del conocimiento humano. Para ello, es fundamental aprovechar la herencia de conocimiento previo, recogido en las bases de datos en forma de anotaciones y redes biológicas, y minarlo en busca de nuevos patrones e hipótesis. Los algoritmos más extendidos para extraer conocimiento de las redes biológicas son los denominados métodos de propagación y difusión. Su trasfondo es el principio de culpa por asociación, que postula que las entidades biológicas que mantienen relación o interacción son más propensas a compartir funciones y propiedades. Dichos algoritmos aprovechan las interacciones conocidas, en formato de red, para predecir propiedades de nodos (por ejemplo, genes, proteínas o metabolitos) usando las propiedades de sus interactores. Existe evidencia de que la estructura topológica de las redes sesga los algoritmos de propagación, de forma que los nodos mejor descritos gozan de una ventaja sistemática. Los nodos menos conocidos quedan en desventaja, se entorpece el descubrimiento de su implicación en los experimentos, a su vez perpetuando nuestro pobre conocimiento sobre ellos. La literatura ofrece algunos estudios donde se normaliza dicho efecto, pero las propiedades intrínsecas del sesgo y el beneficio real de dicha normalización requiere un estudio más detallado. El objeto de esta tesis tiene dos vertientes. Primero, la caracterización de la estadística del sesgo en los algoritmos de propagación, la concepción de normalizaciones estadísticas y su distribución como software científico. Segundo, la aplicación de dicha normalización en problemas clásicos de biología computacional. Concretamente, en el análisis de vías biológicas para datos de metabolómica y en la predicción de genes como dianas terapéuticas en el desarrollo de fármacos. Ambos problemas son abordables mediante técnicas de propagación y, por lo tanto, potencialmente sensibles al efecto del sesgo topológico. En el primer bloque, se corrobora la existencia del sesgo y su dependencia no sólo de la estructura de la red, sino de los nodos en los que se define la propagación. Se demuestran equivalencias matemáticas entre ciertas variaciones en la definición de la propagación, facilitando así su elección. Se proporcionan expresiones cerradas sobre los momentos estadísticos de la difusión y se halla una conexión con las propiedades espectrales de las redes. Un punto importante es que la normalización no siempre ayuda, y su aplicabilidad dependerá de cada caso particular y de las hipótesis sobre la topología de los nodos que deben ser descubiertos. Para ello, esta tesis deja como resultado diffuStats, un software disponible en un repositorio púlico, que permite calcular y comparar la propagación con ciertas variantes, y con presencia o ausencia de normalización. En el segundo bloque, se escoge el análisis de vías en metabolómica dada la relativa juventud de los estudios metabolómicos y, por ende, su falta de herramientas informáticas dedicadas. El análisis de vías clásico parte de una lista de metabolitos de interés, normalmente procedentes de un estudio, y reporta una lista de vías o procesos metabólicos estadísticamente relacionados con ellos. Algunas variantes usan redes de metabolitos para dar más contexto biológico, pero la interpretación de los datos sigue requiriendo un extenso esfuerzo manual. La aportación de esta tesis es la creación de una red de conocimiento que relaciona los metabolitos con las vías a través de las entidades intermedias anotadas, como reacciones y enzimas. Sobre dicha red se aplican algoritmos de propagación para identificar las entidades más relacionadas con los metabolitos de interés. La normalización estadística es necesaria, dada la estructura y las características de la red. Se demuestra no sólo la coherencia de las vías metabólicas propuestas, sino la de los metabolitos y las reacciones priorizadas. La publicación del software FELLA proporciona la construcción de la red de conocimiento y el algoritmo de difusión a la comunidad científica. FELLA va acompañado de seis casos de aplicación en estudios humanos y animales. Por otro lado, se aborda el problema de predicción de genes para dianas terapéuticas a través de redes biológicas. Además de probar el efecto de la normalización estadística, se pone énfasis en estimar el desempeño real esperado en un escenario de desarrollo de fármacos. Los datos de dianas terapéuticas no se suelen conocer al nivel de proteína sino al de complejo o familia de proteínas. La mayoría de estudios no lo tiene en cuenta, llegando a estimaciones optimistas sobre el desempeño esperado. En esta tesis se propone un estudio exhaustivo que corrige el efecto de los complejos de proteínas, compara algoritmos de propagación con distintas métricas de rendimiento por su informatividad y explora el rol de la red biológica y de la enfermedad en cuestión. Se demuestra que la normalización estadística tiene poco efecto en el desempeño y que, en general, los métodos de propagación siguen siendo útiles en el desarrollo de fármacos después de corregir las estimaciones optimistas de su rendimiento.Award-winningPostprint (published version

    Biological Pathway enrichment of Liquid Chromatography - Mass Spectrometry data through spectral graph analysis

    No full text
    Els experiments de recerca biol ogica sobre el metabolisme hum a solen usar CL/EM, Cromatografi a Li quida - Espectrometria de Masses, per obtenir informaci o sobre els compostos a les mostres,per o aquest tipus de dades acostuma a presentar soroll estadi stic. Les t ecniques d'enriquiment filtren soroll i permeten, amb l'ajut d'una base de dades, trobar una explicaci o en termes de vies metab oliques afectades. Aix o implica una millor comprensi o de la biologia i un acc es m es e ficient a les bases de dades. Aquest projecte engloba la concepci o d'un paquet en l'entorn R que cont e dos blocs diferenciats. La primera part converteix la base de dades KEGG, Kyoto Encyclopedia of Genes and Genomes, en grafs amb signi cat biol ogic. Com a novetat respecte a les t ecniques alternatives de l'estat de l'art, els grafs aquí utilitzats contenen cinc tipus de v ertexs: vies metab oliques, m oduls, enzims, reaccions i compostos. La segona part aplica algorismes de difusi o de calor i PageRank sobre aquests grafs, tenint en compte una entrada de compostos provinents de CL/EM. Entre les dues parts s'assoleix l'enriquiment de les dades. Els m etodes proposats s'han executat en un cas real i s'han validat per experts en la mat eria. Han obtingut bons resultats, com tamb e ho han fet t ecniques de l'estat de l'art, i han anat un pas m es enll a en la interpretaci o de resultats. En particular, en lloc de proporcionar nom es una llista de vies metab oliques afectades s'ha constru t un graf que les relaciona amb els m oduls, els enzims, les reaccions i els compostos involucrats. Finalment, ja que els grafs i els algorismes permeten una amplia personalitzaci o, s'han elaborat recomanacions per una futura ampliaci o

    Balancing data on deep learning-based proteochemometric activity classification

    No full text
    In silico analysis of biological activity data has become an essential technique in pharmaceutical development. Specifically, the so-called proteochemometric models aim to share information between targets in machine learning ligand–target activity prediction models. However, bioactivity data sets used in proteochemometric modeling are usually imbalanced, which could potentially affect the performance of the models. In this work, we explored the effect of different balancing strategies in deep learning proteochemometric target–compound activity classification models while controlling for the compound series bias through clustering. These strategies were (1) no_resampling, (2) resampling_after_clustering, (3) resampling_before_clustering, and (4) semi_resampling. These schemas were evaluated in kinases, GPCRs, nuclear receptors, and proteases from BindingDB. We observed that the predicted proportion of positives was driven by the actual data balance in the test set. Additionally, it was confirmed that data balance had an impact on the performance estimates of the proteochemometric model. We recommend a combination of data augmentation and clustering in the training set (semi_resampling) to mitigate the data imbalance effect in a realistic scenario. The code of this analysis is publicly available at https://github.com/b2slab/imbalance_pcm_benchmark.Peer ReviewedPostprint (published version
    corecore