40 research outputs found

    Transcription factor binding site detection through position cross-mutual information variability analysis

    Get PDF
    Regulatory sequence detection is a fundamental challenge in computational biology. One key process in protein synthesis starts with the binding of the transcription factor to its binding site. Different sites can show binding to the same factor. This variability found in binding sequences increases the difficulty of their detection using computational algorithms. In this manuscript, a method for the detection of binding sites is proposed, based on the correlation between binding sequence positions through information theoretical measures. Efficiency values of the method are reported in the form of Receiver Operating Characteristic curves on the detection of different transcription factors of the Saccharomyces cerevisiae organism. We compare our results with other known motif detection Motif Discovery scan (MDscan).Peer ReviewedPostprint (published version

    MEET: Motif Elements Estimation Toolki

    Get PDF
    MEET (Motif Elements Estimation Toolkit) es un paquete en R que integra un conjunto de algoritmos para la detección computacional de los puntos de unión de los factores de transcripción (TFBS). El paquete en R MEET incluye cinco programas de búsqueda de motivos: MEME/MAST (Multiple Expectation-Maximization for Motif Elicitation), Q-residuals, MDscan (Motif Discovery scan), ITEME (Information Theory Elements for Motif Estimation) y Match. Además, permite al usuario trabajar con diferentes algoritmos de alineamiento múltiple: MUSCLE (Multiple Sequence Comparison by Log-Expectation), ClustalW y MEME. El paquete puede trabajar en dos modos diferentes, entrenamiento y detección. El modo entrenamiento permite escoger los parámetros óptimos del detector escogido. Y el modo detección permite, una vez escogidos los parámetros, analizar un genoma en busca de puntos de unión. Además, ambos modos pueden combinar los diferentes métodos de alineamiento y de detección, permitiendo al usuario un amplio abanico de posibilidades. Esta característica permite comparar los diferentes métodos computacionales al mismo nivel,sin realizar ningún agravio comparativo debido al alineamiento.Postprint (published version

    Information Theory in Computational Biology: Where We Stand Today

    Get PDF
    "A Mathematical Theory of Communication" was published in 1948 by Claude Shannon to address the problems in the field of data compression and communication over (noisy) communication channels. Since then, the concepts and ideas developed in Shannon's work have formed the basis of information theory, a cornerstone of statistical learning and inference, and has been playing a key role in disciplines such as physics and thermodynamics, probability and statistics, computational sciences and biological sciences. In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational biology-gene expression and transcriptomics, alignment-free sequence comparison, sequencing and error correction, genome-wide disease-gene association mapping, metabolic networks and metabolomics, and protein sequence, structure and interaction analysis

    Beyond hairballs: depicting complexity of a kinase-phosphatase network in the budding yeast

    Full text link
    Les kinases et les phosphatases (KP) représentent la plus grande famille des enzymes dans la cellule. Elles régulent les unes les autres ainsi que 60 % du protéome, formant des réseaux complexes kinase-phosphatase (KP-Net) jouant un rôle essentiel dans la signalisation cellulaire. Ces réseaux caractérisés d’une organisation de type commandes-exécutions possèdent généralement une structure hiérarchique. Malgré les nombreuse études effectuées sur le réseau KP-Net chez la levure, la structure hiérarchique ainsi que les principes fonctionnels sont toujours peux connu pour ce réseau. Dans ce contexte, le but de cette thèse consistait à effectuer une analyse d’intégration des données provenant de différentes sources avec la structure hiérarchique d’un réseau KP-Net de haute qualité chez la levure, S. cerevisiae, afin de générer des hypothèses concernant les principes fonctionnels de chaque couche de la hiérarchie du réseau KP-Net. En se basant sur une curation de données d’interactions effectuée dans la présente et dans d’autres études, le plus grand et authentique réseau KP-Net reconnu jusqu’à ce jour chez la levure a été assemblé dans cette étude. En évaluant le niveau hiérarchique du KP-Net en utilisant la métrique de la centralisation globale et en élucidant sa structure hiérarchique en utilisant l'algorithme vertex-sort (VS), nous avons trouvé que le réseau KP-Net possède une structure hiérarchique ayant la forme d’un sablier, formée de trois niveaux disjoints (supérieur, central et inférieur). En effet, le niveau supérieur du réseau, contenant un nombre élevé de KPs, était enrichi par des KPs associées à la régulation des signaux cellulaire; le niveau central, formé d’un nombre limité de KPs fortement connectées les unes aux autres, était enrichi en KPs impliquées dans la régulation du cycle cellulaire; et le niveau inférieur, composé d’un nombre important de KPs, était enrichi en KPs impliquées dans des processus cellulaires diversifiés. En superposant une grande multitude de propriétés biologiques des KPs sur le réseau KP-Net, le niveau supérieur était enrichi en phosphatases alors que le niveau inférieur en était appauvri, suggérant que les phosphatases seraient moins régulées par phosphorylation et déphosphorylation que les kinases. De plus, le niveau central était enrichi en KPs représentant des « bottlenecks », participant à plus d’une voie de signalisation, codées par des gènes essentiels et en KPs qui étaient les plus strictement régulées dans l’espace et dans le temps. Ceci implique que les KPs qui jouent un rôle essentiel dans le réseau KP-Net devraient être étroitement contrôlées. En outre, cette étude a montré que les protéines des KPs classées au niveau supérieur du réseau sont exprimées à des niveaux d’abondance plus élevés et à un niveau de bruit moins élevé que celles classées au niveau inférieur du réseau, suggérant que l’expression des enzymes à des abondances élevées invariables au niveau supérieur du réseau KP-Net pourrait être importante pour assurer un système robuste de signalisation. L’étude de l’algorithme VS a montré que le degré des nœuds affecte leur classement dans les différents niveaux d’un réseau hiérarchique sans biaiser les résultats biologiques du réseau étudié. En outre, une analyse de robustesse du réseau KP-Net a montré que les niveaus du réseau KP-Net sont modérément stable dans des réseaux bruités générés par ajout d’arrêtes au réseau KP-Net. Cependant, les niveaux de ces réseaux bruités et de ceux du réseau KP-Net se superposent significativement. De plus, les propriétés topologiques et biologiques du réseau KP-Net étaient retenues dans les réseaux bruités à différents niveaux. Ces résultats indiquant que bien qu’une robustesse partielle de nos résultats ait été observée, ces derniers représentent l’état actuel de nos connaissances des réseaux KP-Nets. Finalement, l’amélioration des techniques dédiées à l’identification des substrats des KPs aideront davantage à comprendre comment les réseaux KP-Nets fonctionnent. À titre d’exemple, je décris, dans cette thèse, une stratégie que nous avons conçu et qui permet à déterminer les interactions KP-substrats et les sous-unités régulatrices sur lesquelles ces interactions dépendent. Cette stratégie est basée sur la complémentation des fragments de protéines basée sur la cytosine désaminase chez la levure (OyCD PCA). L’OyCD PCA représente un essai in vivo à haut débit qui promet une description plus précise des réseaux KP-Nets complexes. En l’appliquant pour déterminer les substrats de la kinase cycline-dépendante de type 1 (Cdk1, appelée aussi Cdc28) chez la levure et l’implication des cyclines dans la phosphorylation de ces substrats par Cdk1, l’essai OyCD PCA a montré un comportement compensatoire collectif des cyclines pour la majorité des substrats. De plus, cet essai a montré que la tubuline- γ est phosphorylée spécifiquement par Clb3-Cdk1, établissant ainsi le moment pendant lequel cet événement contrôle l'assemblage du fuseau mitotique.Kinases and phosphatases (KP) form the largest family of enzymes in living cells. They regulate each other and 60 % of the proteome forming complex kinase-phosphatase networks (KP-Net) essential for cell signaling. Such networks having the command-execution aspect tend to have a hierarchical structure. Despite the extensive study of the KP-Net in the budding yeast, the hierarchical structure as well as the functional principles of this network are still not known. In this context, this thesis aims to perform an integrative analysis of multi-omics data with the hierarchical structure of a bona fide KP-Net in the budding yeast Saccharomyces cerevisiae, in order to generate hypotheses about the functional principles of each layer in the KP-Net hierarchy. Based on a literature curation effort accomplished in this and in other studies, the largest bona fide KP-Net of the S. cerevisiae known to date was assembled in this thesis. By assessing the hierarchical level of the KP-Net using the global reaching centrality and by elucidating the its hierarchical structure using the vertex-sort (VS) algorithm, we found that the KP-Net has a moderate hierarchical structure made of three disjoint layers (top, core and bottom) resembling a bow tie shape. The top layer having a large size was found enriched for signaling regulation; the core layer made of few strongly connected KPs was found enriched mostly for cell cycle regulation; and the bottom layer having a large size was found enriched for diverse biological processes. On overlaying a wide range of KP biological properties on top of the KP-Net hierarchical structure, the top layer was found enriched for and the bottom layer was found depleted for phosphatases, suggesting that phosphatases are less regulated by phosphorylation and dephosphoryation interactions (PDI) than kinases. Moreover, the core layer was found enriched for KPs representing bottlenecks, pathway-shared components, essential genes and for the most tightly regulated KPs in time and space, implying that KPs playing an essential role in the KP-Net should be firmly controlled. Interestingly, KP proteins in the top layer were found more abundant and less noisy than those of the bottom layer, suggesting that availability of enzymes at invariable protein expression level at the top of the network might be important to ensure a robust signaling. Analysis of the VS algorithm showed that node degrees affect their classification in the different layers of a network hierarchical structure without biasing biological results of the sorted network. Robustness analysis of the KP-Net showed that KP-Net layers are moderately stable in noisy networks generated by adding edges to the KP-Net. However, layers of these noisy overlap significantly with those of the KP-Net. Moreover, topological and biological properties of the KP-Net were retained in the noisy networks to different levels. These findings indicate that despite the observed partial robustness of our results, they mostly represent our current knowledge about KP-Nets. Finally, enhancement of techniques dedicated to identify KPs substrates will enhance our understanding about how KP-Nets function. As an example, I describe here a strategy that we devised to help in determining KP-substrate interactions and the regulatory subunits on which these interactions depend. The strategy is based on a protein-fragment complementation assay based on the optimized yeast cytosine deaminase (OyCD PCA). The OyCD PCA represents a large scale in vivo screen that promises a substantial improvement in delineating the complex KP-Nets. We applied the strategy to determine substrates of the cyclin-dependent kinase 1 (Cdk1; also called Cdc28) and cyclins implicated in phosphorylation of these substrates by Cdk1 in S. cerevisiae. The OyCD PCA showed a wide compensatory behavior of cyclins for most of the substrates and the phosphorylation of γ-tubulin specifically by Clb3-Cdk1, thus establishing the timing of the latter event in controlling assembly of the mitotic spindle

    Gaussian Process in Computational Biology: Covariance Functions for Transcriptomics

    Get PDF
    In the field of machine learning, Gaussian process models are widely used families of stochastic process for modelling data observed over time, space or both. Gaussian processes models are nonparametric, meaning that the models are developed on an infinite-dimensional parameter space. The parameter space is then typically learnt as the set of all possible solutions for a given learning problem. Gaussian process distributions are distribution over functions. The covariance function determines the properties of functions samples drawn from the process. Once the decision to model with a Gaussian process has been made the choice of the covariance function is a central step in modelling. In molecular biology and genetics, a transcription factor is a protein that binds to specific DNA sequences and controls the flow of genetic information from DNA to mRNA. To develop models of cellular processes, quantitative estimation of the regulatory relationship between transcription factors and genes is a basic requirement. Quantitative estimation is complex due to various reasons. Many of the transcription factors' activities and their own transcription level are post transcriptionally modified; very often the levels of the transcription factors' expressions are low and noisy. So, from the expression levels of their target genes, it is useful to infer the activity of the transcription factors. Here we developed a Gaussian process based nonparametric regression model to infer the exact transcription factor activities from a combination of mRNA expression levels and DNA-protein binding measurements. Clustering of gene expression time series gives insight into which genes may be coregulated, allowing us to discern the activity of pathways in a given microarray experiment. Of particular interest is how a given group of genes varies with different conditions or genetic backgrounds. In this thesis, we developed a new clustering method that allows each cluster to be parametrized according to the behaviour of the genes across conditions whether they are correlated or anti-correlated. By specifying the correlation between such genes, we gain more information within the cluster about how the genes interrelate. Our study shows the effectiveness of sharing information between replicates and different model conditions while modelling gene expression time series

    Computation in Complex Networks

    Get PDF
    Complex networks are one of the most challenging research focuses of disciplines, including physics, mathematics, biology, medicine, engineering, and computer science, among others. The interest in complex networks is increasingly growing, due to their ability to model several daily life systems, such as technology networks, the Internet, and communication, chemical, neural, social, political and financial networks. The Special Issue “Computation in Complex Networks" of Entropy offers a multidisciplinary view on how some complex systems behave, providing a collection of original and high-quality papers within the research fields of: • Community detection • Complex network modelling • Complex network analysis • Node classification • Information spreading and control • Network robustness • Social networks • Network medicin

    Statistical methods for gene selection and genetic association studies

    Get PDF
    This dissertation includes five Chapters. A brief description of each chapter is organized as follows. In Chapter One, we propose a signed bipartite genotype and phenotype network (GPN) by linking phenotypes and genotypes based on the statistical associations. It provides a new insight to investigate the genetic architecture among multiple correlated phenotypes and explore where phenotypes might be related at a higher level of cellular and organismal organization. We show that multiple phenotypes association studies by considering the proposed network are improved by incorporating the genetic information into the phenotype clustering. In Chapter Two, we first illustrate the proposed GPN to GWAS summary statistics. Then, we assess contributions to constructing a well-defined GPN with a clear representation of genetic associations by comparing the network properties with a random network, including connectivity, centrality, and community structure. The network topology annotations based on the sparse representations of GPN can be used to understand the disease heritability for the highly correlated phenotypes. In applications of phenome-wide association studies, the proposed GPN can identify more significant pairs of genetic variant and phenotype categories. In Chapter Three, a powerful and computationally efficient gene-based association test is proposed, aggregating information from different gene-based association tests and also incorporating expression quantitative trait locus information. We show that the proposed method controls the type I error rates very well and has higher power in the simulation studies and can identify more significant genes in the real data analyses. In Chapter Four, we develop six statistical selection methods based on the penalized regression for inferring target genes of a transcription factor (TF). In this study, the proposed selection methods combine statistics, machine learning , and convex optimization approach, which have great efficacy in identifying the true target genes. The methods will fill the gap of lacking the appropriate methods for predicting target genes of a TF, and are instrumental for validating experimental results yielding from ChIP-seq and DAP-seq, and conversely, selection and annotation of TFs based on their target genes. In Chapter Five, we propose a gene selection approach by capturing gene-level signals in network-based regression into case-control association studies with DNA sequence data or DNA methylation data, inspired by the popular gene-based association tests using a weighted combination of genetic variants to capture the combined effect of individual genetic variants within a gene. We show that the proposed gene selection approach have higher true positive rates than using traditional dimension reduction techniques in the simulation studies and select potentially rheumatoid arthritis related genes that are missed by existing methods

    Features of the intratumoral T cell receptor repertoire associated with antigen exposure in cancer patients

    Get PDF
    The clinical success of immunotherapies demonstrates the importance of the immune system in tumour control, but the response rates remain low and many biological mechanisms underlying how these therapies work are still uncharacterised. In particular, the specificity of the anti-tumour immune response pre-existing in treatment-naive patients or induced by treatment remains poorly described. In this thesis, I explore how T cell receptor (TCR) sequencing data in multi-omics contexts can be utilised to identify features associated with antigen exposure in cancer patients. In treatment-naive non-small cell lung cancer (NSCLC) patients, multi-region TCR sequencing revealed a pattern of heterogeneity in the TCR repertoire resembling the heterogeneity observed in the mutational profile of these tumours and a range of clonotype frequency values associated with tumour specificity. A novel method was built in order to identify distinct TCR populations that spatially follow the pattern of the well-established clonal/subclonal mutational dichotomy. The impact of immune checkpoint blockade therapy on the TCR repertoire distribution was assessed in advanced renal cell carcinoma in the context of anti- PD1 treatment. TCRs with frequency distribution characteristics similar to what was observed in NSCLC were maintained upon treatment and associated with clinical response. In addition, RNA-sequencing analysis identified a gene expression profile consistent with specific activation of T cells through TCR signalling. Finally, the same methodology was applied to bone marrow samples harvested from B cell acute lymphoblastic leukaemia (B-ALL) patients. A statistical framework was developed in order to efficiently distinguish leukaemic re-arrangements from the non- leukaemic TCR repertoire of B-ALL patients. Subsequently, longitudinal analysis revealed TCR distributions that suggested the presence of cytotoxic T cells which was further characterised in matched single-cell RNA sequencing data

    Large-scale variational inference for Bayesian joint regression modelling of high-dimensional genetic data

    Get PDF
    Genetic association studies have become increasingly important in understanding the molecular bases of complex human traits. The specific analysis of intermediate molecular traits, via quantitative trait locus (QTL) studies, has recently received much attention, prompted by the advance of high-throughput technologies for quantifying gene, protein and metabolite levels. Of great interest is the detection of weak trans-regulatory effects between a genetic variant and a distal gene product. In particular, hotspot genetic variants, which remotely control the levels of many molecular outcomes, may initiate decisive functional mechanisms underlying disease endpoints. This thesis proposes a Bayesian hierarchical approach for joint analysis of QTL data on a genome-wide scale. We consider a series of parallel sparse regressions combined in a hierarchical manner to flexibly accommodate high-dimensional responses (molecular levels) and predictors (genetic variants), and we present new methods for large-scale inference. Existing approaches have limitations. Conventional marginal screening does not account for local dependencies and association patterns common to multiple outcomes and genetic variants, whereas joint modelling approaches are restricted to relatively small datasets by computational constraints. Our novel framework allows information-sharing across outcomes and variants, thereby enhancing the detection of weak trans and hotspot effects, and implements tailored variational inference procedures that allow simultaneous analysis of data for an entire QTL study, comprising hundreds of thousands of predictors, and thousands of responses and samples. The present work also describes extensions to leverage spatial and functional information on the genetic variants, for example, using predictor-level covariates such as epigenomic marks. Moreover, we augment variational inference with simulated annealing and parallel expectation-maximisation schemes in order to enhance exploration of highly multimodal spaces and allow efficient empirical Bayes estimation. Our methods, publicly available as packages implemented in R and C++, are extensively assessed in realistic simulations. Their advantages are illustrated in several QTL applications, including a large-scale proteomic QTL study on two clinical cohorts that highlights novel candidate biomarkers for metabolic disorders
    corecore