15 research outputs found

    Longer First Introns Are a General Property of Eukaryotic Gene Structure

    Get PDF
    While many properties of eukaryotic gene structure are well characterized, differences in the form and function of introns that occur at different positions within a transcript are less well understood. In particular, the dynamics of intron length variation with respect to intron position has received relatively little attention. This study analyzes all available data on intron lengths in GenBank and finds a significant trend of increased length in first introns throughout a wide range of species. This trend was found to be even stronger when using high-confidence gene annotation data for three model organisms (Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster) which show that the first intron in the 5′ UTR is - on average - significantly longer than all downstream introns within a gene. A partial explanation for increased first intron length in A. thaliana is suggested by the increased frequency of certain motifs that are present in first introns. The phenomenon of longer first introns can potentially be used to improve gene prediction software and also to detect errors in existing gene annotations

    Patterns of exon-intron architecture variation of genes in eukaryotic genomes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The origin and importance of exon-intron architecture comprises one of the remaining mysteries of gene evolution. Several studies have investigated the variations of intron length, GC content, ordinal position in a gene and divergence. However, there is little study about the structural variation of exons and introns.</p> <p>Results</p> <p>We investigated the length, GC content, ordinal position and divergence in both exons and introns of 13 eukaryotic genomes, representing plant and animal. Our analyses revealed that three basic patterns of exon-intron variation were present in nearly all analyzed genomes (<it>P </it>< 0.001 in most cases): an ordinal reduction of length and divergence in both exon and intron, a co-variation between exon and its flanking introns in their length, GC content and divergence, and a decrease of average exon (or intron) length, GC content and divergence as the total exon numbers of a gene increased. In addition, we observed that the shorter introns had either low or high GC content, and the GC content of long introns was intermediate.</p> <p>Conclusion</p> <p>Although the factors contributing to these patterns have not been identified, our results provide three important clues: common factor(s) exist and may shape both exons and introns; the ordinal reduction patterns may reflect a time-orderly evolution; and the larger first and last exons may be splicing-required. These clues provide a framework for elucidating mechanisms involved in the organization of eukaryotic genomes and particularly in building exon-intron structures.</p

    Diagnostic applications of next generation sequencing: working towards quality standards

    Get PDF
    Over the past 6 years, next generation sequencing (NGS) has been established as a valuable high-throughput method for research in molecular genetics and has successfully been employed in the identification of rare and common genetic variations. All major NGS technology companies providing commercially available instruments (Roche 454, Illumina, Life Technologies) have recently marketed bench top sequencing instruments with lower throughput and shorter run times, thereby broadening the applications of NGS and opening the technology to the potential use for clinical diagnostics. Although the high expectations regarding the discovery of new diagnostic targets and an overall reduction of cost have been achieved, technological challenges in instrument handling, robustness of the chemistry and data analysis need to be overcome. To facilitate the implementation of NGS as a routine method in molecular diagnostics, consistent quality standards need to be developed. Here the authors give an overview of the current standards in protocols and workflows and discuss possible approaches to define quality criteria for NGS in molecular genetic diagnostics

    A Genome-Wide Analysis of FRT-Like Sequences in the Human Genome

    Get PDF
    Efficient and precise genome manipulations can be achieved by the Flp/FRT system of site-specific DNA recombination. Applications of this system are limited, however, to cases when target sites for Flp recombinase, FRT sites, are pre-introduced into a genome locale of interest. To expand use of the Flp/FRT system in genome engineering, variants of Flp recombinase can be evolved to recognize pre-existing genomic sequences that resemble FRT and thus can serve as recombination sites. To understand the distribution and sequence properties of genomic FRT-like sites, we performed a genome-wide analysis of FRT-like sites in the human genome using the experimentally-derived parameters. Out of 642,151 identified FRT-like sequences, 581,157 sequences were unique and 12,452 sequences had at least one exact duplicate. Duplicated FRT-like sequences are located mostly within LINE1, but also within LTRs of endogenous retroviruses, Alu repeats and other repetitive DNA sequences. The unique FRT-like sequences were classified based on the number of matches to FRT within the first four proximal bases pairs of the Flp binding elements of FRT and the nature of mismatched base pairs in the same region. The data obtained will be useful for the emerging field of genome engineering

    GC content around splice sites affects splicing through pre-mRNA secondary structures

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Alternative splicing increases protein diversity by generating multiple transcript isoforms from a single gene through different combinations of exons or through different selections of splice sites. It has been reported that RNA secondary structures are involved in alternative splicing. Here we perform a genomic study of RNA secondary structures around splice sites in humans (<it>Homo sapiens</it>), mice (<it>Mus musculus</it>), fruit flies (<it>Drosophila melanogaster</it>), and nematodes (<it>Caenorhabditis elegans</it>) to further investigate this phenomenon.</p> <p>Results</p> <p>We observe that GC content around splice sites is closely associated with the splice site usage in multiple species. RNA secondary structure is the possible explanation, because the structural stability difference among alternative splice sites, constitutive splice sites, and skipped splice sites can be explained by the GC content difference. Alternative splice sites tend to be GC-enriched and exhibit more stable RNA secondary structures in all of the considered species. In humans and mice, splice sites of first exons and long exons tend to be GC-enriched and hence form more stable structures, indicating the special role of RNA secondary structures in promoter proximal splicing events and the splicing of long exons. In addition, GC-enriched exon-intron junctions tend to be overrepresented in tissue-specific alternative splice sites, indicating the functional consequence of the GC effect. Compared with regions far from splice sites and decoy splice sites, real splice sites are GC-enriched. We also found that the GC-content effect is much stronger than the nucleotide-order effect to form stable secondary structures.</p> <p>Conclusion</p> <p>All of these results indicate that GC content is related to splice site usage and it may mediate the splicing process through RNA secondary structures.</p

    Trinucleotide repeats in human genome and exome

    Get PDF
    Trinucleotide repeats (TNRs) are of interest in genetics because they are used as markers for tracing genotype–phenotype relations and because they are directly involved in numerous human genetic diseases. In this study, we searched the human genome reference sequence and annotated exons (exome) for the presence of uninterrupted triplet repeat tracts composed of six or more repeated units. A list of 32 448 TNRs and 878 TNR-containing genes was generated and is provided herein. We found that some triplet repeats, specifically CNG, are overrepresented, while CTT, ATC, AAC and AAT are underrepresented in exons. This observation suggests that the occurrence of TNRs in exons is not random, but undergoes positive or negative selective pressure. Additionally, TNR types strongly determine their localization in mRNA sections (ORF, UTRs). Most genes containing exon-overrepresented TNRs are associated with gene ontology-defined functions. Surprisingly, many groups of genes that contain TNR types coding for different homo-amino acid tracts associate with the same transcription-related GO categories. We propose that TNRs have potential to be functional genetic elements and that their variation may be involved in the regulation of many common phenotypes; as such, TNR polymorphisms should be considered a priority in association studies

    Lessons learned from additional research analyses of unsolved clinical exome cases

    Get PDF
    BACKGROUND: Given the rarity of most single-gene Mendelian disorders, concerted efforts of data exchange between clinical and scientific communities are critical to optimize molecular diagnosis and novel disease gene discovery. METHODS: We designed and implemented protocols for the study of cases for which a plausible molecular diagnosis was not achieved in a clinical genomics diagnostic laboratory (i.e. unsolved clinical exomes). Such cases were recruited to a research laboratory for further analyses, in order to potentially: (1) accelerate novel disease gene discovery; (2) increase the molecular diagnostic yield of whole exome sequencing (WES); and (3) gain insight into the genetic mechanisms of disease. Pilot project data included 74 families, consisting mostly of parent-offspring trios. Analyses performed on a research basis employed both WES from additional family members and complementary bioinformatics approaches and protocols. RESULTS: Analysis of all possible modes of Mendelian inheritance, focusing on both single nucleotide variants (SNV) and copy number variant (CNV) alleles, yielded a likely contributory variant in 36% (27/74) of cases. If one includes candidate genes with variants identified within a single family, a potential contributory variant was identified in a total of ~51% (38/74) of cases enrolled in this pilot study. The molecular diagnosis was achieved in 30/63 trios (47.6%). Besides this, the analysis workflow yielded evidence for pathogenic variants in disease-associated genes in 4/6 singleton cases (66.6%), 1/1 multiplex family involving three affected siblings, and 3/4 (75%) quartet families. Both the analytical pipeline and the collaborative efforts between the diagnostic and research laboratories provided insights that allowed recent disease gene discoveries (PURA, TANGO2, EMC1, GNB5, ATAD3A, and MIPEP) and increased the number of novel genes, defined in this study as genes identified in more than one family (DHX30 and EBF3). CONCLUSION: An efficient genomics pipeline in which clinical sequencing in a diagnostic laboratory is followed by the detailed reanalysis of unsolved cases in a research environment, supplemented with WES data from additional family members, and subject to adjuvant bioinformatics analyses including relaxed variant filtering parameters in informatics pipelines, can enhance the molecular diagnostic yield and provide mechanistic insights into Mendelian disorders. Implementing these approaches requires collaborative clinical molecular diagnostic and research efforts

    An integrated ChIP-seq analysis platform with customizable workflows

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Chromatin immunoprecipitation followed by next generation sequencing (ChIP-seq), enables unbiased and genome-wide mapping of protein-DNA interactions and epigenetic marks. The first step in ChIP-seq data analysis involves the identification of peaks (i.e., genomic locations with high density of mapped sequence reads). The next step consists of interpreting the biological meaning of the peaks through their association with known genes, pathways, regulatory elements, and integration with other experiments. Although several programs have been published for the analysis of ChIP-seq data, they often focus on the peak detection step and are usually not well suited for thorough, integrative analysis of the detected peaks.</p> <p>Results</p> <p>To address the peak interpretation challenge, we have developed ChIPseeqer, an integrative, comprehensive, fast and user-friendly computational framework for in-depth analysis of ChIP-seq datasets. The novelty of our approach is the capability to combine several computational tools in order to create easily customized workflows that can be adapted to the user's needs and objectives. In this paper, we describe the main components of the ChIPseeqer framework, and also demonstrate the utility and diversity of the analyses offered, by analyzing a published ChIP-seq dataset.</p> <p>Conclusions</p> <p>ChIPseeqer facilitates ChIP-seq data analysis by offering a flexible and powerful set of computational tools that can be used in combination with one another. The framework is freely available as a user-friendly GUI application, but all programs are also executable from the command line, thus providing flexibility and automatability for advanced users.</p

    An Investigation Into The Molecular Basis Underlying Enhancement Of Transcription By The Intron In Budding Yeast

    Get PDF
    It is now quite evident that the introns, which are removed from the primary transcript by the process of splicing, are involved in a variety of important functions in eukaryotic cells. One of the evolutionarily conserved functions of introns is their role in regulating transcription of genes that harbors them. This effect of a splicing-competent intron on transcription is known as ‘Intron-Mediated Enhancement of transcription’ (IME). It has been observed that the intron-containing genes are often transcribed more efficiently than non-intronic genes. However, the molecular mechanism underlying IME in budding yeast and higher eukaryotes is not entirely clear, and that forms the basis of my thesis. To address this issue, I have organized my research project into three specific aims. The primary objective of the first aim was to investigate the mechanism of enhancement of transcription by an intron. I found that the intron-mediated enhancement in budding yeast is dependent on the gene assuming a unique architecture called gene loop. In the second aim, I explored the molecular basis underlying enhancement of transcription by the intron-facilitated gene loop. In the third aim, I determined the effect of position of an intron within a gene on its transcription regulatory potential. In the first aim, I randomly selected six genes and compared their transcription in the presence and absence of an intron by strand-specific TRO approach. I observed a sharp decline in transcription in the absence of intron. Furthermore, I found that the gene assumed a looped conformation in the presence of an intron. Intron-dependent gene loop was stabilized by three types of interactions; the promoter-terminator, the promoter-5’ splice site and the terminator-3ꞌ splice site interactions. More importantly, I found that the intron-dependent enhancement was completely dependent on gene looping as no enhancement of transcription by an intron was observed in the looping defective mutant. In the second aim, I investigated how the intron-mediated gene looping regulates transcription. My hypothesis was that intron-mediated gene looping confers directionality, and thereby enhances transcription. During initiation of transcription, the promoter-bound RNAP II has a tendency to transcribe both the downstream coding region in sense direction producing mRNA, as well as the upstream non-coding region in the anti-sense direction producing uaRNA (upstream-antisense RNA). However, there are certain checkpoints in the cell that allows the selective transcription in the sense direction over anti-sense direction, hence maintaining promoter directionality. My results reveal that the intron-dependent gene looping facilitates the recruitment of termination factors in the promoter-proximal region. These termination factors then selectively terminates the uaRNA synthesis, and hence confers directionality. My last aim was to see the effect of position of an intron within a gene on transcription of the gene. The generally accepted view is that the intron should be present close to the 5ꞌ end of the gene to bring about enhancement of transcription. Whether the presence of intron near the 3ꞌ end of the gene results in enhancement of transcription in yeast was unclear. To address the issue, I inserted the intron in the intron-less version of IMD4 gene at three positions, and showed that even the terminator-proximal intron can enhance transcription. Till now my results have shown that the terminal-proximal intron also enhances transcription in a way similar to the promoter-proximal intron, that is, by conferring promoter directionality

    Selection and Population Structure in Drosophila melanogaster

    Get PDF
    In this thesis I scrutinized a specific region of the X chromosome of Drosophila melanogaster for evidence of positive directional selection. In addition, I analyzed the structure of six Southeast (SE) Asian populations of this species. In the first chapter, I analyzed a region that showed no polymorphism in a previous scan of the X chromosome in a European D. melanogaster population. This region, which I named the wapl region, is located on the distal part of the X chromosome, in cytological division 2C10 - 2E1. I observed a 60.5 - kb stretch of DNA encompassing the genes ph-d, ph-p, CG3835, bcn92, Pgd, wapl and Cyp4d1 that almost completely lacks variation in the European sample. Loci flanking this region show a skewed frequency spectrum at segregating sites, strong haplotype structure, and high levels of linkage disequilibrium. Neutrality tests revealed that these patterns of variation are unlikely under the neutral equilibrium model or simple bottleneck scenarios. In contrast, newly developed likelihood ratio tests suggest that strong positive selection has acted recently on the region under investigation, resulting in a selective sweep. Evidence is presented that this sweep may have originated in an ancestral population in Africa. In the second chapter, I revisited the center of the wapl region analyzed in chapter 1. I concentrated on the African D. melanogaster sample, as the valley of reduced variation found in the previous study was much narrower in the African sample than in the European one, which should help to pinpoint the target of selection. About 80% of the valley of reduced nucleotide variation was sequenced. This valley is located between the genes ph-d and Pgd. I therefore termed this part the ph-d - Pgd region. The new results confirm previous conclusions about selection having shaped nucleotide variability in this part of the D. melanogaster genome. Moreover, by sequencing the center of the selective sweep I was able to establish the haplotype structure in that region and to infer the historical context of the sweep. Most likely a positively selected substitution occurred at ph-p and was fixed before the out-of-Africa expansion of D. melanogaster, possibly >30,000 years ago. This substitution might be associated with the specialization of ph-p in gene regulation. In addition, the results obtained from the European sample indicate that sequence variation was not affected by demography alone. In fact, it was found that selection affected nucleotide diversity in the ph-d - Pgd region of the European sample as well. Since heterozygosity across the whole wapl region is substantially reduced, I propose that an additional selective sweep occurred at a different site in the European population. This is supported by an analysis regarding the time since the fixation of the (first) beneficial mutation at ph-p, which points toward a substitution in D. melanogaster before the colonization of Europe. In chapter 3, I obtained sequence data from six SE Asian samples for ten putatively neutrally evolving X-linked loci. Population genetic parameters were estimated and compared to those previously obtained from the European and the African sample. I observe substantially lower levels of nucleotide diversity in SE Asia than in either Africa or Europe. In particular, samples taken from more peripheral populations (e.g. Manila and Cebu, located on the Philippines) show a paucity of haplotypes. Common summary statistics indicate that genetic drift had a significant impact on these populations, which also led to considerable population substructure. One sample, i.e. Kuala Lumpur, however, shows rather high levels of heterozygosity among all SE Asian samples and is on average least differentiated from these. This indicates that the Kuala Lumpur population is ancestral to the other SE Asian populations, which is supported by a high amount of shared polymorphic sites. Finally, I revisited the wapl region, as analyzed in the first chapter, and find evidence that the selective sweep is older in Kuala Lumpur than in Europe
    corecore