Data Science Methods Applied to the Study of The Signature of Regulatory CD4 T Cells in the Human Thymus and its Modulation by the Chromatin Landscape

Abstract

Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science.This work was supported by: GenomePT project (POCI-01-0145-FEDER-022184), supported by COMPETE 2020 - Operational Programme for Competitiveness and Internationalisation (POCI), Lisboa Por tugal Regional Operational Programme (Lisboa2020), Algarve Portugal Regional Opera tional Programme (CRESC Algarve2020), under the PORTUGAL 2020 Partnership Agree ment, through the European Regional Development Fund (ERDF), and by Fundação para a Ciência e a Tecnologia (FCT).Thymic-derived Regulatory T cells (tTregs) play a central role in maintaining im mune homeostasis by suppressing pro-inflammatory activity of conventional T cells (tTconvs). Disruption of tTreg development and/or function is at the origin of many pathologies, from allergies and autoimmunity to chronic inflammation and cancer. To understand tTreg development it is necessary to characterise tTreg genes and uncover the regulation of their expression. This dissertation aims to contribute to the characterisation of regulatory CD4 T cells in the human thymus and the regulation of their development by exploring the relationship between differences in transcription factor binding to chormatin and changes in gene ex pression (differential gene expression). To do this, I analysed vast amounts of epigenomic and transcriptomic data produced by Next-Generation Sequencing, respectively, ATAC-seq and RNA-seq, generated from human tTregs and tTconvs using computational biology and data science methodologies. In this dissertation I will discuss 3 steps of this project where Data Science played an important role: The discovery of a linear relationship between transcription factor ac cessibility to chromatin and associated gene expression in tTregs; the systematization and standardization of a gene set enrichment analysis protocol (GSEA) to detect signatures of activated biological pathways in ranked datasets of differential gene expression; and the de velopment of systematised k-means clustering of Transcription Factor Binding Sites (TFBS), with heatmap visualisation, to discover relationships between the TFBS landscape and gene expression profile of tTregs.GenomePT project (POCI-01-0145-FEDER-022184), supported by COMPETE 2020 - Operational Programme for Competitiveness and Internationalisation (POCI), Lisboa Por tugal Regional Operational Programme (Lisboa2020), Algarve Portugal Regional Opera tional Programme (CRESC Algarve2020), under the PORTUGAL 2020 Partnership Agree ment, through the European Regional Development Fund (ERDF), and by Fundação para a Ciência e a Tecnologia (FCT)

    Similar works