1,999 research outputs found

    A strand specific high resolution normalization method for chip-sequencing data employing multiple experimental control measurements

    Get PDF
    Background: High-throughput sequencing is becoming the standard tool for investigating protein-DNA interactions or epigenetic modifications. However, the data generated will always contain noise due to e. g. repetitive regions or non-specific antibody interactions. The noise will appear in the form of a background distribution of reads that must be taken into account in the downstream analysis, for example when detecting enriched regions (peak-calling). Several reported peak-callers can take experimental measurements of background tag distribution into account when analysing a data set. Unfortunately, the background is only used to adjust peak calling and not as a preprocessing step that aims at discerning the signal from the background noise. A normalization procedure that extracts the signal of interest would be of universal use when investigating genomic patterns. Results: We formulated such a normalization method based on linear regression and made a proof-of-concept implementation in R and C++. It was tested on simulated as well as on publicly available ChIP-seq data on binding sites for two transcription factors, MAX and FOXA1 and two control samples, Input and IgG. We applied three different peak-callers to (i) raw (un-normalized) data using statistical background models and (ii) raw data with control samples as background and (iii) normalized data without additional control samples as background. The fraction of called regions containing the expected transcription factor binding motif was largest for the normalized data and evaluation with qPCR data for FOXA1 suggested higher sensitivity and specificity using normalized data over raw data with experimental background. Conclusions: The proposed method can handle several control samples allowing for correction of multiple sources of bias simultaneously. Our evaluation on both synthetic and experimental data suggests that the method is successful in removing background noise

    High-throughput genetic analysis and combinatorial chiral separations based on capillary electrophoresis

    Get PDF
    Capillary electrophoresis offers many advantages over conventional analytical methods, such as speed, simplicity, high resolution, low cost, and small sample consumption, especially for the separation of enantiomers. However, chiral method developments still can be time consuming and tedious. We designed a comprehensive enantioseparation protocol employing neutral and sulfated cyclodextrins as chiral selectors for common basic, neutral, and acidic compounds with a 96-capillary array system. By using only four judiciously chosen separation buffers, successful enantioseparations were achieved for 49 out of 54 test compounds spanning a large variety of pKs and structures. Therefore, unknown compounds can be screened in this manner to identify optimal enantioselective conditions in just one run. In addition to superior separation efficiency for small molecules, CE is also the most powerful technique for DNA separations. Using the same multiplexed capillary system with UV absorption detection, DNA sequencing of a short template was done without any dye-labels. Two internal standards were utilized to adjust the migration time variations among capillaries, so that the four electropherograms for the A, T, C, G Sanger reactions can be aligned and base calling can be completed with a level of high confidence. The CE separation of DNA can be applied to study differential gene expression as well. Combined with pattern recognition techniques, small variations among electropherograms obtained by the separation of cDNA fragments produced from the total RNA samples of different human tissues can be revealed. These variations reflect the differences in total RNA expression among tissues. Thus, this CE-based approach can serve as an alternative to the DNA array techniques in gene expression analysis

    Acute Myeloid Leukemia

    Get PDF
    Acute myeloid leukemia (AML) is the most common type of leukemia. The Cancer Genome Atlas Research Network has demonstrated the increasing genomic complexity of acute myeloid leukemia (AML). In addition, the network has facilitated our understanding of the molecular events leading to this deadly form of malignancy for which the prognosis has not improved over past decades. AML is a highly heterogeneous disease, and cytogenetics and molecular analysis of the various chromosome aberrations including deletions, duplications, aneuploidy, balanced reciprocal translocations and fusion of transcription factor genes and tyrosine kinases has led to better understanding and identification of subgroups of AML with different prognoses. Furthermore, molecular classification based on mRNA expression profiling has facilitated identification of novel subclasses and defined high-, poor-risk AML based on specific molecular signatures. However, despite increased understanding of AML genetics, the outcome for AML patients whose number is likely to rise as the population ages, has not changed significantly. Until it does, further investigation of the genomic complexity of the disease and advances in drug development are needed. In this review, leading AML clinicians and research investigators provide an up-to-date understanding of the molecular biology of the disease addressing advances in diagnosis, classification, prognostication and therapeutic strategies that may have significant promise and impact on overall patient survival

    The Transcriptomes of Two Heritable Cell Types Illuminate the Circuit Governing Their Differentiation

    Get PDF
    The differentiation of cells into distinct cell types, each of which is heritable for many generations, underlies many biological phenomena. White and opaque cells of the fungal pathogen Candida albicans are two such heritable cell types, each thought to be adapted to unique niches within their human host. To systematically investigate their differences, we performed strand-specific, massively-parallel sequencing of RNA from C. albicans white and opaque cells. With these data we first annotated the C. albicans transcriptome, finding hundreds of novel differentially-expressed transcripts. Using the new annotation, we compared differences in transcript abundance between the two cell types with the genomic regions bound by a master regulator of the white-opaque switch (Wor1). We found that the revised transcriptional landscape considerably alters our understanding of the circuit governing differentiation. In particular, we can now resolve the poor concordance between binding of a master regulator and the differential expression of adjacent genes, a discrepancy observed in several other studies of cell differentiation. More than one third of the Wor1-bound differentially-expressed transcripts were previously unannotated, which explains the formerly puzzling presence of Wor1 at these positions along the genome. Many of these newly identified Wor1-regulated genes are non-coding and transcribed antisense to coding transcripts. We also find that 5′ and 3′ UTRs of mRNAs in the circuit are unusually long and that 5′ UTRs often differ in length between cell-types, suggesting UTRs encode important regulatory information and that use of alternative promoters is widespread. Further analysis revealed that the revised Wor1 circuit bears several striking similarities to the Oct4 circuit that specifies the pluripotency of mammalian embryonic stem cells. Additional characteristics shared with the Oct4 circuit suggest a set of general hallmarks characteristic of heritable differentiation states in eukaryotes

    Substitutional landscape of a split fluorescent protein fragment using high-density peptide microarrays

    Get PDF
    Split fluorescent proteins have wide applicability as biosensors for protein-protein interactions, genetically encoded tags for protein detection and localization, as well as fusion partners in super-resolution microscopy. We have here established and validated a novel platform for functional analysis of leave-one-out split fluorescent proteins (LOO-FPs) in high throughput and with rapid turnover. We have screened more than 12,000 variants of the beta-strand split fragment using high-density peptide microarrays for binding and functional complementation in Green Fluorescent Protein. We studied the effect of peptide length and the effect of different linkers to the solid support. We further mapped the effect of all possible amino acid substitutions on each position as well as in the context of some single and double amino acid substitutions. As all peptides were tested in 12 duplicates, the analysis rests on a firm statistical basis allowing for confirmation of the robustness and precision of the method. Based on experiments in solution, we conclude that under the given conditions, the signal intensity on the peptide microarray faithfully reflects the binding affinity between the split fragments. With this, we are able to identify a peptide with 9-fold higher affinity than the starting peptide

    Inter-individual variation of the human epigenome & applications

    Get PDF
    Genome-wide association studies (GWAS) have led to the discovery of genetic variants influencing human phenotypes in health and disease. However, almost two decades later, most human traits can still not be accurately predicted from common genetic variants. Moreover, genetic variants discovered via GWAS mostly map to the non-coding genome and have historically resisted interpretation via mechanistic models. Alternatively, the epigenome lies in the cross-roads between genetics and the environment. Thus, there is great excitement towards the mapping of epigenetic inter-individual variation since its study may link environmental factors to human traits that remain unexplained by genetic variants. For instance, the environmental component of the epigenome may serve as a source of biomarkers for accurate, robust and interpretable phenotypic prediction on low-heritability traits that cannot be attained by classical genetic-based models. Additionally, its research may provide mechanisms of action for genetic associations at non-coding regions that mediate their effect via the epigenome. The aim of this thesis was to explore epigenetic inter-individual variation and to mitigate some of the methodological limitations faced towards its future valorisation.Chapter 1 is dedicated to the scope and aims of the thesis. It begins by describing historical milestones and basic concepts in human genetics, statistical genetics, the heritability problem and polygenic risk scores. It then moves towards epigenetics, covering the several dimensions it encompasses. It subsequently focuses on DNA methylation with topics like mitotic stability, epigenetic reprogramming, X-inactivation or imprinting. This is followed by concepts from epigenetic epidemiology such as epigenome-wide association studies (EWAS), epigenetic clocks, Mendelian randomization, methylation risk scores and methylation quantitative trait loci (mQTL). The chapter ends by introducing the aims of the thesis.Chapter 2 focuses on stochastic epigenetic inter-individual variation resulting from processes occurring post-twinning, during embryonic development and early life. Specifically, it describes the discovery and characterisation of hundreds of variably methylated CpGs in the blood of healthy adolescent monozygotic (MZ) twins showing equivalent variation among co-twins and unrelated individuals (evCpGs) that could not be explained only by measurement error on the DNA methylation microarray. DNA methylation levels at evCpGs were shown to be stable short-term but susceptible to aging and epigenetic drift in the long-term. The identified sites were significantly enriched at the clustered protocadherin loci, known for stochastic methylation in neurons in the context of embryonic neurodevelopment. Critically, evCpGs were capable of clustering technical and longitudinal replicates while differentiating young MZ twins. Thus, discovered evCpGs can be considered as a first prototype towards universal epigenetic fingerprint, relevant in the discrimination of MZ twins for forensic purposes, currently impossible with standard DNA profiling. Besides, DNA methylation microarrays are the preferred technology for EWAS and mQTL mapping studies. However, their probe design inherently assumes that the assayed genomic DNA is identical to the reference genome, leading to genetic artifacts whenever this assumption is not fulfilled. Building upon the previous experience analysing microarray data, Chapter 3 covers the development and benchmarking of UMtools, an R-package for the quantification and qualification of genetic artifacts on DNA methylation microarrays based on the unprocessed fluorescence intensity signals. These tools were used to assemble an atlas on genetic artifacts encountered on DNA methylation microarrays, including interactions between artifacts or with X-inactivation, imprinting and tissue-specific regulation. Additionally, to distinguish artifacts from genuine epigenetic variation, a co-methylation-based approach was proposed. Overall, this study revealed that genetic artifacts continue to filter through into the reported literature since current methodologies to address them have overlooked this challenge.Furthermore, EWAS, mQTL and allele-specific methylation (ASM) mapping studies have all been employed to map epigenetic variation but require matching phenotypic/genotypic data and can only map specific components of epigenetic inter-individual variation. Inspired by the previously proposed co-methylation strategy, Chapter 4 describes a novel method to simultaneously map inter-haplotype, inter-cell and inter-individual variation without these requirements. Specifically, binomial likelihood function-based bootstrap hypothesis test for co-methylation within reads (Binokulars) is a randomization test that can identify jointly regulated CpGs (JRCs) from pooled whole genome bisulfite sequencing (WGBS) data by solely relying on joint DNA methylation information available in reads spanning multiple CpGs. Binokulars was tested on pooled WGBS data in whole blood, sperm and combined, and benchmarked against EWAS and ASM. Our comparisons revealed that Binokulars can integrate a wide range of epigenetic phenomena under the same umbrella since it simultaneously discovered regions associated with imprinting, cell type- and tissue-specific regulation, mQTL, ageing or even unknown epigenetic processes. Finally, we verified examples of mQTL and polymorphic imprinting by employing another novel tool, JRC_sorter, to classify regions based on epigenotype models and non-pooled WGBS data in cord blood. In the future, we envision how this cost-effective approach can be applied on larger pools to simultaneously highlight regions of interest in the methylome, a highly relevant task in the light of the post-GWAS era.Moving towards future applications of epigenetic inter-individual variation, Chapters 5 and 6 are dedicated to solving some of methodological issues faced in translational epigenomics.Firstly, due to its simplicity and well-known properties, linear regression is the starting point methodology when performing prediction of a continuous outcome given a set of predictors. However, linear regression is incompatible with missing data, a common phenomenon and a huge threat to the integrity of data analysis in empirical sciences, including (epi)genomics. Chapter 5 describes the development of combinatorial linear models (cmb-lm), an imputation-free, CPU/RAM-efficient and privacy-preserving statistical method for linear regression prediction on datasets with missing values. Cmb-lm provide prediction errors that take into account the pattern of missing values in the incomplete data, even at extreme missingness. As a proof-of-concept, we tested cmb-lm in the context of epigenetic ageing clocks, one of the most popular applications of epigenetic inter-individual variation. Overall, cmb-lm offer a simple and flexible methodology with a wide range of applications that can provide a smooth transition towards the valorisation of linear models in the real world, where missing data is almost inevitable. Beyond microarrays, due to its high accuracy, reliability and sample multiplexing capabilities, massively parallel sequencing (MPS) is currently the preferred methodology of choice to translate prediction models for traits of interests into practice. At the same time, tobacco smoking is a frequent habit sustained by more than 1.3 billion people in 2020 and a leading (and preventable) health risk factor in the modern world. Predicting smoking habits from a persistent biomarker, such as DNA methylation, is not only relevant to account for self-reporting bias in public health and personalized medicine studies, but may also allow broadening forensic DNA phenotyping. Previously, a model to predict whether someone is a current, former, or never smoker had been published based on solely 13 CpGs from the hundreds of thousands included in the DNA methylation microarray. However, a matching lab tool with lower marker throughput, and higher accuracy and sensitivity was missing towards translating the model in practice. Chapter 6 describes the development of an MPS assay and data analysis pipeline to quantify DNA methylation on these 13 smoking-associated biomarkers for the prediction of smoking status. Though our systematic evaluation on DNA standards of known methylation levels revealed marker-specific amplification bias, our novel tool was still able to provide highly accurate and reproducible DNA methylation quantification and smoking habit prediction. Overall, our MPS assay allows the technological transfer of DNA methylation microarray findings and models to practical settings, one step closer towards future applications.Finally, Chapter 7 provides a general discussion on the results and topics discussed across Chapters 2-6. It begins by summarizing the main findings across the thesis, including proposals for follow-up studies. It then covers technical limitations pertaining bisulfite conversion and DNA methylation microarrays, but also more general considerations such as restricted data access. This chapter ends by covering the outlook of this PhD thesis, including topics such as bisulfite-free methods, third-generation sequencing, single-cell methylomics, multi-omics and systems biology.<br/

    Evaluation of Algorithm Performance in ChIP-Seq Peak Detection

    Get PDF
    Next-generation DNA sequencing coupled with chromatin immunoprecipitation (ChIP-seq) is revolutionizing our ability to interrogate whole genome protein-DNA interactions. Identification of protein binding sites from ChIP-seq data has required novel computational tools, distinct from those used for the analysis of ChIP-Chip experiments. The growing popularity of ChIP-seq spurred the development of many different analytical programs (at last count, we noted 31 open source methods), each with some purported advantage. Given that the literature is dense and empirical benchmarking challenging, selecting an appropriate method for ChIP-seq analysis has become a daunting task. Herein we compare the performance of eleven different peak calling programs on common empirical, transcription factor datasets and measure their sensitivity, accuracy and usability. Our analysis provides an unbiased critical assessment of available technologies, and should assist researchers in choosing a suitable tool for handling ChIP-seq data
    • …
    corecore