42 research outputs found

    Using signal processing, evolutionary computation, and machine learning to identify transposable elements in genomes

    Get PDF
    About half of the human genome consists of transposable elements (TE's), sequences that have many copies of themselves distributed throughout the genome. All genomes, from bacterial to human, contain TE's. TE's affect genome function by either creating proteins directly or affecting genome regulation. They serve as molecular fossils, giving clues to the evolutionary history of the organism. TE's are often challenging to identify because they are fragmentary or heavily mutated. In this thesis, novel features for the detection and study of TE's are developed. These features are of two types. The first type are statistical features based on the Fourier transform used to assess reading frame use. These features measure how different the reading frame use is from that of a random sequence, which reading frames the sequence is using, and the proportion of use of the active reading frames. The second type of feature, called side effect machine (SEM) features, are generated by finite state machines augmented with counters that track the number of times the state is visited. These counters then become features of the sequence. The number of possible SEM features is super-exponential in the number of states. New methods for selecting useful feature subsets that incorporate a genetic algorithm and a novel clustering method are introduced. The features produced reveal structural characteristics of the sequences of potential interest to biologists. A detailed analysis of the genetic algorithm, its fitness functions, and its fitness landscapes is performed. The features are used, together with features used in existing exon finding algorithms, to build classifiers that distinguish TE's from other genomic sequences in humans, fruit flies, and ciliates. The classifiers achieve high accuracy (> 85%) on a variety of TE classification problems. The classifiers are used to scan large genomes for TE's. In addition, the features are used to describe the TE's in the newly sequenced ciliate, Tetrahymena thermophile to provide information for biologists useful to them in forming hypotheses to test experimentally concerning the role of these TE's and the mechanisms that govern them

    Remodelling regulates the heterochromatin of retrotransposons in mouse embryonic stem cells

    Get PDF
    Chromatin remodellers slide, assemble, eject or edit nucleosomes influencing chromatin structure, DNA accessibility and transcriptional programmes. The SNF2-like remodeller SMARCAD1 is conserved from yeast to human cells and is highly expressed in mouse embryonic stem cells. Upon its loss cells lose their pluripotent phenotype but its function in ES cells is not known. In order to understand the role of SMARCAD1 in mouse embryonic stem cells a robust ChIP-seq protocol was developed for the tagged and endogenous protein in wild-type and knockdown cell lines. SMARCAD1 binding sites were found predominantly at intergenic sites genome-wide, and overlap with repressive histone modifications. Among the chromatin-bound proteins discovered enriched with SMARCAD1 binding sites is KAP1 (KRAB-associated protein 1; Krüppel-associated box), a critical factor for the silencing of endogenous retroviral elements (ERVs) in mouse ES cell, the histone methyltransferase SETDB1 (SET Domain Bifurcated Histone Lysine Methyltransferase I), and the histone variant H3.3. Taken together, the discovered binding sites provide new understanding of SMARCAD1 function in ES cells and illustrate that in the open chromatin environment characteristic of the pluripotent state, SMARCAD1 is associated with transcriptional repression. An unresolved issue is how SMARCAD1 associates with its binding sites without a DNA-binding domain and no known domains for recruitment by histone modifications. Candidate SMARCAD1 targets were investigated and it was discovered that recruitment is dependent on the interaction with KAP1 via the CUE1 (Coupling of Ubiquitin conjugation to ER degradation) domain of SMARCAD1. Sequential ChIP experiments revealed that KAP1 and SMARCAD1 are co-enriched on their shared targets. Among the discovered binding sites of the remodeller SMARCAD1 are endogenous retroviral elements (ERVs) an abundant type of transposable element derived from viral integrations in the germline. ERV expression is tightly controlled by repressive factors as they pose a threat for genome stability. A series of knockdowns and ChIP-qPCR experiments was performed to understand the cooperation and underlying mechanisms of how these factors shape and control ERV heterochromatin. SMARCAD1 was identified as a crucial component; it is required for the association of the KAP1-SETDB1 silencing machinery over class I and II ERVs, and consequently the maintenance of the histone modifications H3K9me3 and H4K20me3. The histone variant H3.3 has a controversial role in ERV control and is similarly reduced upon the loss of SMARCAD1, suggesting SMARCAD1 may be involved in the turn-over of this variant. In summary, heterochromatin organisation is perturbed when SMARCAD1 is not present at ERVs. The presence of H3K9me3 on the other hand is not necessary for SMARCAD1 binding as shown in a SETDB1 knock-down. The assembly of KAP1 and H3K9me3 is rescued by the ectopic expression of wild-type SMARCAD1 but not by an ATPase mutant. Hence, it is the catalytic activity of SMARCAD1 and chromatin remodelling that is required for the silencing of ERVs. The KAP1 interaction mutant had no effect on the association of KAP1 itself but could similarly not restore H3K9me3 emphasizing that SMARCAD1 is required for successful heterochromatin formation on ERVs and identifying chromatin remodelling as a key mechanism of ERV control in mouse ES cells

    On the molecular basis of mammalian totipotency

    Get PDF
    The transient capacity to autonomously form and organize all of the embryonic and extra- embryonic tissues involved in the development of a complete organism is termed totipotency. In mammals, totipotency is a feature restricted to the earliest cells of the pre-implantation embryo, which harbor this unique capacity during the first 1-3 cell cycles, depending on the species. However, our understanding of the regulatory mechanisms responsible for the establishment, maintenance and termination of such a highly plastic regulatory state remains limited. Mammalian totipotency occurs concomitantly to a set of highly-intermingled biological processes such as global chromatin remodeling, an unusual set of metabolic characteristics and the de-repression of the vast majority of transposable elements, and it is unclear whether these processes act to sustain it. Following a general overview of these processes, in this dissertation I present my contributions to a body of work on an in vitro model system for mammalian totipotency, which exhibits certain molecular features of the in vivo totipotent state. Afterwards, in the second part of this thesis, I present the transcriptional analyses that I have conducted with the aim of understanding the role of transposable element transcription during pre-implantation development. Overall, this work describes a set of phenomena that arise in totipotent cells in vivo and in totipotent-like cells in vitro and explores how recapitulating certain molecular features of totipotent cells in pluripotent cells induces a totipotent-like state in vitro

    Abundance and diversity of endogenous retroviruses in the chicken genome

    Get PDF
    Long terminal repeat (LTR) retrotransposons are autonomous eukaryotic repetitive elements which may elicit prolonged genomic and immunological stress on their host organism. LTR retrotransposons comprise approximately 10 % of the mammalian genome, but previous work identified only 1.35 % of the chicken genome as LTR retrotransposon sequence. This deficit appears inconsistent across birds, as studied Neoaves have contents comparable with mammals, although all birds contain only one LTR retrotransposon class: endogenous retroviruses (ERVs). One group of chicken-specific ERVs (Avian Leukosis Virus subgroup E; ALVEs) remain active and have been linked to commercially detrimental phenotypes, such as reduced lifetime egg count, but their full diversity and range of phenotypic effects are poorly understood. A novel identification pipeline, LocaTR, was developed to identify LTR retrotransposon sequences in the chicken genome. This enabled the annotation of 3.01 % of the genome, including 1,073 structurally intact elements with replicative potential. Elements were depleted within coding regions, and over 40 % of intact elements were found in clusters in gene sparse, poorly recombining regions. RNAseq analysis showed that elements were generally not expressed, but intact transcripts were identified in four cases, supporting the potential for viral recombination and retrotransposition of non-autonomous repeats. LocaTR analysis of seventy-two additional sauropsid genomes revealed highly lineage-specific repeat content, and did not support the proposed deficit in Galliformes. A second, novel bioinformatic pipeline was constructed to identify ALVE insertions in whole genome resequencing data and was applied to eight elite layer lines from Hy-Line International. Twenty ALVEs were identified and diagnostic assays were developed to validate the bioinformatic approach. Each ALVE was sequenced and characterised, with many exhibiting high structural intactness. In addition, a K locus revertant line was identified due to the unexpected presence of ALVE21, confirmed using BioNano optic maps. The ALVE identification pipeline was then applied to ninety chicken lines and 322 different ALVEs were identified, 81 % of which were novel. Overall, broilers and non-commercial chickens had a greater number of ALVEs than were found in layers. Taken together, these two analyses have enabled a thorough characterisation of both the abundance and diversity of chicken ERVs

    The analysis of genetic aberrations in South African oesophageal squamous cell carcinoma patients

    Get PDF
    Estimates for 2017 indicate that 20% of cancers globally are gastrointestinal tract (GIT) cancers, with oesophageal cancer being the 8th most common cancer. Oesophageal squamous cell carcinoma (OSCC) occurs in the upper to mid oesophagus and is present at high incidence in developing countries including South Africa. There are no early symptoms, resulting in late diagnosis and poor prognosis. In this study, tumour and blood DNA was obtained from 35 OSCC patients and subjected to whole genome sequencing (WGS). Bioinformatics analysis pipelines were designed to identify the possibility of novel viral insertions, investigating Human Endogenous Retroviruses (HERV's) insertions alongside the presence of somatic mutations in patient samples. The aims being to identify integration of any foreign DNA, to investigate if there is any linkage between HERV insertion and somatic mutations, and to identify any somatic mutations of potential interest in the OSCC cohort. The novel virus investigations however, proved to be inconclusive and there appeared to be no link between HERV insertions and somatic mutations present in the patients. Very significantly, it was determined that numerous somatic mutations were present in the MUC3A gene of the patient cohort, an interesting observation as no such previous association with OSCC has been recorded. MUC3A is a membrane-bound glycoprotein component of mucous gels, and its aberrant expression has been correlated with invasion and metastasis in a variety of other cancers. However, due to the complexity of the particular gene sequence and the known inconsistencies of variant calling performed on complex data sets, these mutations should be viewed with extreme caution as they are likely to be false positives. Analysis of RNA-seq data showed a 4.6 log2 fold increase in MUC3A expression in the tumour samples of these OSCC patients, with a P-adjusted value of 7.05e-06, suggesting highly significant differential gene expression. Functional enrichment analysis further showed that MUC3A was significantly associated with one of the top 5 gene ontologies (extracellular matrix structural constituent) for molecular function ontology class together with a number of collagen (COL) and MMP genes known to play a role in oncogenic progression and membrane stiffness. GSEA and KEGG analysis indicated predominantly chemokine/cytokine pro-inflammatory enriched pathways. Immunohistochemistry staining showed 10 out of 13 of the samples had no detectable levels of MUC3A protein, suggesting that the production of a non-functional truncated protein may lead to the upregulation of MUC3A expression that could possibly play a role in downstream pro-oncogenic signalling

    Mammalian comparative genomics and epigenomics

    Get PDF
    Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2009.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references.The human genome sequence can be thought of as an instruction manual for our species, written and rewritten over more than a billion of years of evolution. Taking a complete inventory of our genome, dissecting its genes and their functional components, and elucidating how these genes are selectively used to establish and maintain cell types with markedly different behaviors, are key challenges of modern biology. In this thesis we present contributions to our understanding of the structure, function and evolution of the human genome. We rely on two complementary approaches. First, we study signatures of evolutionary processes that have acted on the genome using comparative sequence analysis. We generate high quality draft genome sequences of the chimpanzee, the dog and the opossum. These species share a last common ancestor with humans approximately 6 million, 80 million and 140 million years ago, respectively, and therefore provide distinct perspectives on our evolutionary history. We apply computational methods to explore the functional organization of the genome and to identify genes that contribute to shared and species-specific traits. Second, we study how the genome is bound by proteins and packaged into chromatin in distinct cell types. We develop new methods to map protein-DNA interactions and DNA methylation using single-molecule based sequencing technology. We apply these methods to identify new functional sequence elements based on characteristic chromatin signatures, and to explore the relationship between DNA sequence, chromatin and cellular state.by Tarjei Sigurd Mikkelsen.Ph.D

    Reticulate Evolution: Symbiogenesis, Lateral Gene Transfer, Hybridization and Infectious heredity

    Get PDF
    info:eu-repo/semantics/publishedVersio
    corecore