677 research outputs found

    Systematic evaluation of the impact of ChIP-seq read designs on genome coverage, peak identification, and allele-specific binding detection

    Get PDF
    Background: Chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiments revolutionized genome-wide profiling of transcription factors and histone modifications. Although maturing sequencing technologies allow these experiments to be carried out with short (36–50 bps), long (75–100 bps), single-end, or paired-end reads, the impact of these read parameters on the downstream data analysis are not well understood. In this paper, we evaluate the effects of different read parameters on genome sequence alignment, coverage of different classes of genomic features, peak identification, and allele-specific binding detection. Results: We generated 101 bps paired-end ChIP-seq data for many transcription factors from human GM12878 and MCF7 cell lines. Systematic evaluations using in silico variations of these data as well as fully simulated data, revealed complex interplay between the sequencing parameters and analysis tools, and indicated clear advantages of paired-end designs in several aspects such as alignment accuracy, peak resolution, and most notably, allele-specific binding detection. Conclusions: Our work elucidates the effect of design on the downstream analysis and provides insights to investigators in deciding sequencing parameters in ChIP-seq experiments. We present the first systematic evaluation of the impact of ChIP-seq designs on allele-specific binding detection and highlights the power of pair-end designs in such studies

    Systematic evaluation of the impact of ChIP-seq read designs on genome coverage, peak identification, and allele-specific binding detection

    Get PDF
    Background: Chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiments revolutionized genome-wide profiling of transcription factors and histone modifications. Although maturing sequencing technologies allow these experiments to be carried out with short (36–50 bps), long (75–100 bps), single-end, or paired-end reads, the impact of these read parameters on the downstream data analysis are not well understood. In this paper, we evaluate the effects of different read parameters on genome sequence alignment, coverage of different classes of genomic features, peak identification, and allele-specific binding detection. Results: We generated 101 bps paired-end ChIP-seq data for many transcription factors from human GM12878 and MCF7 cell lines. Systematic evaluations using in silico variations of these data as well as fully simulated data, revealed complex interplay between the sequencing parameters and analysis tools, and indicated clear advantages of paired-end designs in several aspects such as alignment accuracy, peak resolution, and most notably, allele-specific binding detection. Conclusions: Our work elucidates the effect of design on the downstream analysis and provides insights to investigators in deciding sequencing parameters in ChIP-seq experiments. We present the first systematic evaluation of the impact of ChIP-seq designs on allele-specific binding detection and highlights the power of pair-end designs in such studies

    Novel computational methods for studying the role and interactions of transcription factors in gene regulation

    Get PDF
    Regulation of which genes are expressed and when enables the existence of different cell types sharing the same genetic code in their DNA. Erroneously functioning gene regulation can lead to diseases such as cancer. Gene regulatory programs can malfunction in several ways. Often if a disease is caused by a defective protein, the cause is a mutation in the gene coding for the protein rendering the protein unable to perform its functions properly. However, protein-coding genes make up only about 1.5% of the human genome, and majority of all disease-associated mutations discovered reside outside protein-coding genes. The mechanisms of action of these non-coding disease-associated mutations are far more incompletely understood. Binding of transcription factors (TFs) to DNA controls the rate of transcribing genetic information from the coding DNA sequence to RNA. Binding affinities of TFs to DNA have been extensively measured in vitro, ligands by exponential enrichment) and Protein Binding Microarrays (PBMs), and the genome-wide binding locations and patterns of TFs have been mapped in dozens of cell types. Despite this, our understanding of how TF binding to regulatory regions of the genome, promoters and enhancers, leads to gene expression is not at the level where gene expression could be reliably predicted based on DNA sequence only. In this work, we develop and apply computational tools to analyze and model the effects of TF-DNA binding. We also develop new methods for interpreting and understanding deep learning-based models trained on biological sequence data. In biological applications, the ability to understand how machine learning models make predictions is as, or even more important as raw predictive performance. This has created a demand for approaches helping researchers extract biologically meaningful information from deep learning model predictions. We develop a novel computational method for determining TF binding sites genome-wide from recently developed high-resolution ChIP-exo and ChIP-nexus experiments. We demonstrate that our method performs similarly or better than previously published methods while making less assumptions about the data. We also describe an improved algorithm for calling allele-specific TF-DNA binding. We utilize deep learning methods to learn features predicting transcriptional activity of human promoters and enhancers. The deep learning models are trained on massively parallel reporter gene assay (MPRA) data from human genomic regulatory elements, designed regulatory elements and promoters and enhancers selected from totally random pool of synthetic input DNA. This unprecedentedly large set of measurements of human gene regulatory element activities, in total more than 100 times the size of the human genome, allowed us to train models that were able to predict genomic transcription start site positions more accurately than models trained on genomic promoters, and to correctly predict effects of disease-associated promoter variants. We also found that interactions between promoters and local classical enhancers are non-specific in nature. The MPRA data integrated with extensive epigenetic measurements supports existence of three different classes of enhancers: classical enhancers, closed chromatin enhancers and chromatin-dependent enhancers. We also show that TFs can be divided into four different, non-exclusive classes based on their activities: chromatin opening, enhancing, promoting and TSS determining TFs. Interpreting the deep learning models of human gene regulatory elements required application of several existing model interpretation tools as well as developing new approaches. Here, we describe two new methods for visualizing features and interactions learned by deep learning models. Firstly, we describe an algorithm for testing if a deep learning model has learned an existing binding motif of a TF. Secondly, we visualize mutual information between pairwise k-mer distributions in sample inputs selected according to predictions by a machine learning model. This method highlights pairwise, and positional dependencies learned by a machine learning model. We demonstrate the use of this model-agnostic approach with classification and regression models trained on DNA, RNA and amino acid sequences.Monet eliöt koostuvat useista erilaisista solutyypeistä, vaikka kaikissa näiden eliöiden soluissa onkin sama DNA-koodi. Geenien ilmentymisen säätely mahdollistaa erilaiset solutyypit. Virheellisesti toimiva säätely voi johtaa sairauksiin, esimerkiksi syövän puhkeamiseen. Jos sairauden aiheuttaa viallinen proteiini, on syynä usein mutaatio tätä proteiinia koodaavassa geenissä, joka muuttaa proteiinia siten, ettei se enää pysty toimittamaan tehtäväänsä riittävän hyvin. Kuitenkin vain 1,5 % ihmisen genomista on proteiineja koodaavia geenejä. Suurin osa kaikista löydetyistä sairauksiin liitetyistä mutaatioista sijaitsee näiden ns. koodaavien alueiden ulkopuolella. Ei-koodaavien sairauksiin liitetyiden mutaatioiden vaikutusmekanismit ovat yleisesti paljon huonommin tunnettuja, kuin koodaavien alueiden mutaatioiden. Transkriptiotekijöiden sitoutuminen DNA:han säätelee transkriptiota, eli geeneissä olevan geneettisen informaation lukemista ja muuntamista RNA:ksi. Transkriptiotekijöiden sitoutumista DNA:han on mitattu kattavasti in vitro-olosuhteissa, ja monien transkriptiotekijöiden sitoutumiskohdat on mitattu genominlaajuisesti useissa eri solutyypeissä. Tästä huolimatta ymmärryksemme siitä miten transkriptioitekijöiden sitoutuminen genomin säätelyelementteihin, eli promoottoreihin ja vahvistajiin, johtaa geenien ilmentymiseen ei ole sellaisella tasolla, että voisimme luotettavasti ennustaa geenien ilmentymistä pelkästään DNA-sekvenssin perusteella. Tässä työssä kehitämme ja sovellamme laskennallisia työkaluja transkriptiotekijöiden sitoutumisesta johtuvan geenien ilmentymisen analysointiin ja mallintamiseen. Kehitämme myös uusia menetelmiä biologisella sekvenssidatalla opetettujen syväoppimismallien tulkitsemiseksi. Koneoppimismallin tekemien ennusteiden ymmärrettävyys on biologisissa sovelluksissa yleensä yhtä tärkeää, ellei jopa tärkeämpää kuin pelkkä raaka ennustetarkkuus. Tämä on synnyttänyt tarpeen uusille menetelmille, jotka auttavat tutkijoita louhimaan biologisesti merkityksellistä tietoa syväoppimismallien ennusteista. Kehitimme tässä työssä uuden laskennallisen työkalun, jolla voidaan määrittää transkriptiotekijöiden sitoutumiskohdat genominlaajuisesti käyttäen mittausdataa hiljattain kehitetyistä korkearesoluutioisista ChIP-exo ja ChIP-nexus kokeista. Näytämme, että kehittämämme menetelmä suoriutuu paremmin, tai vähintään yhtä hyvin kuin aiemmin julkaistut menetelmät tehden näitä vähemmän oletuksia signaalin muodosta. Esittelemme myös parannellun algoritmin transkriptiotekijöiden alleelispesifin sitoutumisen määrittämiseksi. Käytämme syväoppimismenetelmiä oppimaan mitkä ominaisuudet ennustavat ihmisen promoottori- ja voimistajaelementtien aktiivisuutta. Nämä syväoppimismallit on opetettu valtavien rinnakkaisten reportterigeenikokeiden datalla ihmisen genomisista säätelyelementeistä, sekä aktiivisista promoottoreista ja voimistajista, jotka ovat valikoituneet satunnaisesta joukosta synteettisiä DNA-sekvenssejä. Tämä ennennäkemättömän laaja joukko mittauksia ihmisen säätelyelementtien aktiivisuudesta - yli satakertainen määrä DNA sekvenssiä ihmisen genomiin verrattuna - mahdollisti transkription aloituskohtien sijainnin ennustamisen ihmisen genomissa tarkemmin kuin ihmisen genomilla opetetut mallit. Nämä mallit myös ennustivat oikein sairauksiin liitettyjen mutaatioiden vaikutukset ihmisen promoottoreilla. Tuloksemme näyttivät, että vuorovaikutukset ihmisen promoottorien ja klassisten paikallisten voimistajien välillä ovat epäspesifejä. MPRA-data, integroituna kattavien epigeneettisten mittausten kanssa mahdollisti voimistajaelementtien jaon kolmeen luokkaan: klassiset, suljetun kromatiinin, ja kromatiinista riippuvat voimistajat. Tutkimuksemme osoitti, että transkriptiotekijät voidaan jakaa neljään, osittain päällekkäiseen luokkaan niiden aktiivisuuksien perusteella: kromatiinia avaaviin, voimistaviin, promotoiviin ja transkription aloituskohdan määrittäviin transkriptiotekijöihin. Ihmisen genomin säätelyelementtejä kuvaavien syväoppimismallien tulkitseminen vaati sekä olemassa olevien menetelmien soveltamista, että uusien kehittämistä. Kehitimme tässä työssä kaksi uutta menetelmää syväoppimismallien oppimien muuttujien ja niiden välisten vuorovaikutusten visualisoimiseksi. Ensin esittelemme algoritmin, jonka avulla voidaan testata onko syväoppimismalli oppinut jonkin jo tunnetun transkriptiotekijän sitoutumishahmon. Toiseksi, visualisoimme positiokohtaisten k-meerijakaumien keskeisinformaatiota sekvensseissä, jotka on valittu syväoppimismallin ennusteiden perusteella. Tämä menetelmä paljastaa syväoppimismallin oppimat parivuorovaikutukset ja positiokohtaiset riippuvuudet. Näytämme, että kehittämämme menetelmä on mallin arkkitehtuurista riippumaton soveltamalla sitä sekä luokittelijoihin, että regressiomalleihin, jotka on opetettu joko DNA-, RNA-, tai aminohapposekvenssidatalla

    Systematic multi-omics profiling of Ewing sarcoma cell lines

    Get PDF
    100 years after its first description, Ewing sarcoma (EwS), the second most common bone-associated cancer in children and young adults, is still poorly understood. Neither the cell of origin is known, nor the detailed mechanism of expression regulation by the pathognomic fusion oncogene. Similarly, factors causing overt clinical heterogeneity and advanced/targeted therapeutic strategies for patients with non-localized disease remain to be identified. An apparent paradox of EwS is its clinical heterogeneity compared to its silent landscape of genomic mutations. The only highly recurrent mutation in EwS is the characteristic fusion oncogene composed of EWSR1 and an ETS-transcription factor. Interactions of this single driver with the genome have been described and associated with gene expression regulation several times, but always in a small number of cell line models. This thesis aimed at creating a multidimensional dataset on a large number of EwS cell line models with and without fusion oncogene knockdown, the Ewing Sarcoma Cell Line Atlas (ESCLA), to both enable further investigations of expression regulation in EwS and model heterogeneity. In 18 well-characterized EwS cell lines, with three distinct fusion types, an inducible shRNA construct targeting the fusion oncogene was stably integrated. The whole genomes of the cell lines were sequenced with relatively long reads (150 bp) and >30 coverage. For the respective fusion and the histone marks H3K27ac, H3K27me3 and H3K4me3 chromatin immunoprecipitation with subsequent next-generation sequencing (ChIP-Seq) was performed. The transcriptome of the cells with and without fusion knockdown was assessed by ClariomD DNA microarrays, as was the protein expression by mass spectrometry and the CpG island methylation by MethylationEPIC BeadChip arrays. Whole genome sequencing enabled genotyping of several polymorphic potentially fusion binding microsatellites with GGAA motif. ChIP-Seq data were in line with previous publications and identified 50 additional consensus fusion binding sites. Transcriptome and proteome data strongly correlated with each other and displayed expression rearrangement upon fusion knockdown. Only for CpG methylation not any uniform effect of fusion oncogene knockdown was observed. Cell lines with distinct fusion types, EWSR1-FLI1 type 1, 2 and EWSR1-ERG, were for the first time systematically compared to each other. Neither expression regulation, nor methylation profile were dependent on the respective fusion. However, the fusion types differed in their rate of chromoplexy as developmental process. All EWSR1-ERG fusions and 55% of EWSR1-FLI1 type 1 fusions developed from chromoplexy, whereas all EWSR1-FLI1 type 2 fusions were the result of reciprocal translocation. Binding of the fusion to GGAA motifs appeared as multifactorial and still poorly understood process. Among others, high numbers of consecutive GGAA motifs, additional nearby motifs and microsatellites as well as and copy number gains correlated with fusion binding probability. Genes differentially expressed upon fusion knockdown differed from not affected genes in their distance to the next fusion bound GGAA mSat, the number of nearby GGAA mSats, and in the presence of transcription factor bindings sites for NFAT5, NFYC, and E2F2 in their promoters. All these transcription factors were also regulated by the fusion oncogene. A set of 22 genes were identified to be regulated to different extends in the 18 cell line models upon fusion knockdown. This heterogeneity in regulation was in line with heterogeneous expression in patients, which correlated with overall survival. These genes were mainly associated with cell-cycle progression and cell division, transcription factors and targets of those. Yet, evaluated and identified parameters of EWSR1-ETS mediated gene expression regulation were not sufficient to fully explain inter-cell line differences in gene regulation. Several studies demonstrated previously an interaction between the fusion oncogene and GGAA microsatellites, but were limited to few loci. Previous whole exome sequencing projects missed out on these relevant regulatory regions. Reporter assays in vitro revealed enhancer activity of GGAA microsatellites, but in an artificial only mono-allelic approach. Studies and experiments on gene regulation in EwS with only two to three cell lines could hardly model heterogeneity. The here generated ESCLA overcame these obstacles, and supported, refined and expanded previously elaborated models of fusion oncogene mediated gene regulation genome wide. In conclusion, a multidimensional and comprehensive dataset was generated on a collection of EwS cell line models clearly outnumbering previous studies. Moreover, the dataset has already enabled first novel insights on the mechanisms and dependencies of fusion mediated gene regulation and modelled heterogeneity. The generated cell lines and the ESCLA likely constitute a rich resource for the Ewing sarcoma research community. Additionally, the capability of the dataset to model heterogeneity might enforce research on personalized medicine and the development of new treatment strategies for so far incurable advanced disease patients

    Developing bioinformatics applications for the analysis of epigenetic next-generation sequencing data

    Get PDF

    The application of genomic technologies to cancer and companion diagnostics.

    Get PDF
    This thesis describes work undertaken by the author between 1996 and 2014. Genomics is the study of the genome, although it is also often used as a catchall phrase and applied to the transcriptome (study of RNAs) and methylome (study of DNA methylation). As cancer is a disease of the genome the rapid advances in genomic technology, specifically microarrays and next generation sequencing, are creating a wave of change in our understanding of its molecular pathology. Molecular pathology and personalised medicine are being driven by discoveries in genomics, and genomics is being driven by the development of faster, better and cheaper genome sequencing. The next decade is likely to see significant changes in the way cancer is managed for individual cancer patients as next generation sequencing enters the clinic. In chapter 3 I discuss how ERBB2 amplification testing for breast cancer is currently dominated by immunohistochemistry (a single-gene test); and present the development, by the author, of a semi-quantitative PCR test for ERBB2 amplification. I also show that estimating ERBB2 amplification from microarray copy-number analysis of the genome is possible. In chapter 4 I present a review of microarray comparison studies, and outline the case for careful and considered comparison of technologies when selecting a platform for use in a research study. Similar, indeed more stringent, care needs to be applied when selecting a platform for use in a clinical test. In chapter 5 I present co-authored work on the development of amplicon and exome methods for the detection and quantitation of somatic mutations in circulating tumour DNA, and demonstrate the impact this can have in understanding tumour heterogeneity and evolution during treatment. I also demonstrate how next-generation sequencing technologies may allow multiple genetic abnormalities to be analysed in a single test, and in low cellularity tumours and/or heterogenous cancers. Keywords: Genome, exome, transcriptome, amplicon, next-generation sequencing, differential gene expression, RNA-seq, ChIP-seq, microarray, ERBB2, companion diagnostic

    Annotating Gene Expression and Regulatory Elements in Tissues from Healthy Thoroughbred Horses and Identifying Candidate Mutations Associated with Perosomus Elumbis in an Angus Calf

    Get PDF
    Genome annotation has a direct impact on the success of genomic studies. Transcriptome analyses and chromatin immunoprecipitation and sequencing (ChIP-seq) have been used to functionally annotate genomes. These methods can identify protein-coding genes, non-coding transcripts, and cis-regulatory elements across the genome. The primary objective of the first study was to functionally annotate the equine genome through the assessment of nine tissues: adipose, brain, heart, lamina, liver, lung, skeletal, muscle, testis, and ovary. In the first project, 150 bp, paired-end RNA sequencing (RNA-seq) libraries were generated in stallion tissues and compared to previously generated mare RNA-seq libraries to quantify variation in gene expression due to sex and tissue type. On average, each tissue expressed (\u3e 10 transcripts per million) over 8,000 genes, and adipose, liver, and skeletal muscle each had over 900 genes differentially expressed due to sex (P adj The third study examined genomic variation associated with a congenital defect, perosomus elumbis, (PE) in Angus cattle. The affected calf was still-born, displaying lumbar aplasia, and arthrogryposis. Whole-genome sequencing of 31 Angus cattle identified a frameshift mutation in PTK7 as a candidate variant for the development of PE in an Angus calf. Despite the implication of PTK7 in similar phenotypes, additional research is needed to verify the etiology of PE in Angus cattle. Advisor: Jessica L. Peterse

    Functional Analysis of Genomic Variation and Impact on Molecular and Higher Order Phenotypes

    Get PDF
    Reverse genetics methods, particularly the production of gene knockouts and knockins, have revolutionized the understanding of gene function. High throughput sequencing now makes it practical to exploit reverse genetics to simultaneously study functions of thousands of normal sequence variants and spontaneous mutations that segregate in intercross and backcross progeny generated by mating completely sequenced parental lines. To evaluate this new reverse genetic method we resequenced the genome of one of the oldest inbred strains of mice—DBA/2J—the father of the large family of BXD recombinant inbred strains. We analyzed ~100X wholegenome sequence data for the DBA/2J strain, relative to C57BL/6J, the reference strain for all mouse genomics and the mother of the BXD family. We generated the most detailed picture of molecular variation between the two mouse strains to date and identified 5.4 million sequence polymorphisms, including, 4.46 million single nucleotide polymorphisms (SNPs), 0.94 million intersections/deletions (indels), and 20,000 structural variants. We systematically scanned massive databases of molecular phenotypes and ~4,000 classical phenotypes to detect linked functional consequences of sequence variants. In majority of cases we successfully recovered known genotype-to-phenotype associations and in several cases we linked sequence variants to novel phenotypes (Ahr, Fh1, Entpd2, and Col6a5). However, our most striking and consistent finding is that apparently deleterious homozygous SNPs, indels, and structural variants have undetectable or very modest additive effects on phenotypes
    corecore