96 research outputs found

    A large-scale analysis of mRNA polyadenylation of human and mouse genes

    Get PDF
    mRNA polyadenylation is a critical cellular process in eukaryotes. It involves 3′ end cleavage of nascent mRNAs and addition of the poly(A) tail, which plays important roles in many aspects of the cellular metabolism of mRNA. The process is controlled by various cis-acting elements surrounding the cleavage site, and their binding factors. In this study, we surveyed genome regions containing cleavage sites [herein called poly(A) sites], for 13 942 human and 11 155 mouse genes. We found that a great proportion of human and mouse genes have alternative polyadenylation (∼54 and 32%, respectively). The conservation of alternative polyadenylation type or polyadenylation configuration between human and mouse orthologs is statistically significant, indicating that alternative polyadenylation is widely employed by these two species to produce alternative gene transcripts. Genes belonging to several functional groups, indicated by their Gene Ontology annotations, are biased with respect to polyadenylation configuration. Many poly(A) sites harbor multiple cleavage sites (51.25% human and 46.97% mouse sites), leading to heterogeneous 3′ end formation for transcripts. This implies that the cleavage process of polyadenylation is largely imprecise. Different types of poly(A) sites, with regard to their relative locations in a gene, are found to have distinct nucleotide composition in surrounding genomic regions. This large-scale study provides important insights into the mechanism of polyadenylation in mammalian species and represents a genomic view of the regulation of gene expression by alternative polyadenylation

    Analysis and Annotation of Nucleic Acid Sequence

    Full text link

    A cause for consilience: Utilizing multiple genomic data types to resolve problematic nodes within Arthropoda and Ecdysozoa

    Get PDF
    A major turning point in the study of metazoan evolution was the recognition of the existence of the Ecdysozoa in 1997. This is a group of eight animal phyla (Nematoda, Nematomorpha, Loricifera, Kinorhyncha, Priapulida, Tardigrada, Onychophora and Arthropoda). Ecdysozoa is the most specious clade of animals to ever exist and the relationships among its eight phyla are still heatedly debated. Similarly also the relationships among the three sub-phyla (Chelicerata, Pancrustacea and Myriapoda) within the most important ecdysozoan phylum (the Arthropoda) are still debated. Indeed, the two major problems in ecdysozoan phylogeny refer to the relationships of Myriapoda within Arthropoda, and of Tardigrada within Ecdysozoa. Difficulties in ecdysozoan relationships resides in lineages characterized by rapid, deep divergences and subsequently long periods of divergent evolution. Phylogenetic signal to resolve the relationships of these lineages is diluted, increasing the likelihood of recovery of phylogenetic artifacts. In an attempt to resolve the relationships within Ecdysozoa, consilience of three independent phylogenetic data sets was investigated. EST and rRNA and microRNA (miRNA) data were sampled across all major ecdysozoan phyla. In particular, a major contribution of this thesis is the first time sequencing of miRNAs for all the panarthropod phyla. MicroRNAs are genome regulatory elements that recently emerged as a source of useful phylogenetic data (Sempere et al. 2006) because of their low homoplasy levels. The considered data sets were analysed under phylogenetic methods and models, implemented to minimize the occurrence of phylogenetic reconstruction artifacts to understand the evolution of Ecdysozoa. Analyses of independent data types recovered well supported and corroborating evidence for the monophyly of Panarthropoda (Arthropoda, Onychophora and Tardigrada), a sister group relationships between Myriapoda and Pancrustacea within Arthropoda, and the paraphyly of Cycloneuralia (Nematoda, Nematomorpha, Loricifera, Kinorhyncha and Priapulida).

    A cause for consilience: Utilizing multiple genomic data types to resolve problematic nodes within Arthropoda and Ecdysozoa

    Get PDF
    A major turning point in the study of metazoan evolution was the recognition of the existence of the Ecdysozoa in 1997. This is a group of eight animal phyla (Nematoda, Nematomorpha, Loricifera, Kinorhyncha, Priapulida, Tardigrada, Onychophora and Arthropoda). Ecdysozoa is the most specious clade of animals to ever exist and the relationships among its eight phyla are still heatedly debated. Similarly also the relationships among the three sub-phyla (Chelicerata, Pancrustacea and Myriapoda) within the most important ecdysozoan phylum (the Arthropoda) are still debated. Indeed, the two major problems in ecdysozoan phylogeny refer to the relationships of Myriapoda within Arthropoda, and of Tardigrada within Ecdysozoa. Difficulties in ecdysozoan relationships resides in lineages characterized by rapid, deep divergences and subsequently long periods of divergent evolution. Phylogenetic signal to resolve the relationships of these lineages is diluted, increasing the likelihood of recovery of phylogenetic artifacts. In an attempt to resolve the relationships within Ecdysozoa, consilience of three independent phylogenetic data sets was investigated. EST and rRNA and microRNA (miRNA) data were sampled across all major ecdysozoan phyla. In particular, a major contribution of this thesis is the first time sequencing of miRNAs for all the panarthropod phyla. MicroRNAs are genome regulatory elements that recently emerged as a source of useful phylogenetic data (Sempere et al. 2006) because of their low homoplasy levels. The considered data sets were analysed under phylogenetic methods and models, implemented to minimize the occurrence of phylogenetic reconstruction artifacts to understand the evolution of Ecdysozoa. Analyses of independent data types recovered well supported and corroborating evidence for the monophyly of Panarthropoda (Arthropoda, Onychophora and Tardigrada), a sister group relationships between Myriapoda and Pancrustacea within Arthropoda, and the paraphyly of Cycloneuralia (Nematoda, Nematomorpha, Loricifera, Kinorhyncha and Priapulida).

    The Presence and Distribution of Crotoxin in the Rock Rattlesnake (Crotalus lepidus)

    Get PDF
    Crotoxin and its homologs (hereafter all referred to as CTx) is a highly lethal heterodimeric beta-neurotoxin found in pitvipers (Crotalinae) and is the main driver of neurotoxic venom phenotypes (Type II). In contrast, hemorrhagic venom phenotypes (Type I) are characterized by high snake venom metalloproteinase expression and low toxicity. Although many rattlesnake species have been classified as either Type I or Type II, population level variation in venom phenotype has also been documented in several species. The presence or absence of CTx is the main component of this variation in venom phenotype and has been most widely studied in large-bodied lowland rattlesnakes (Crotalus scutulatus, C. helleri, and C. horridus). While it has been suspected to be in C. lepidus, a small-bodied montane rattlesnake, there has been no genetic confirmation. We used genomics and transcriptomics to test for the presence, distribution, and evolution of CTx in C. lepidus. We genomically and transcriptomically confirmed the presence and expression of CTx in C. lepidus and found it in 17 out of 104 samples across their range. CTx presence was not significantly associated with longitude, latitude, subspecies, or elevation. However, we did identify several climatic variables associated with CTx presence, including ones that have been identified in previous studies on CTx expression providing insights on the phylogenetic distribution of CTx across rattlesnakes, the variation in crotoxin expression, and highlighting environments to which CTx may be locally adapted. Our results likely support previous hypotheses of an ancestral origin for crotoxin followed by independent sorting in lineages; therefore, future studies should focus on testing for the presence of CTx in other species of montane rattlesnakes

    Modélisation et comparaison de la structure de gènes

    Get PDF
    La bio-informatique est un domaine de recherche multi-disciplinaire, à la croisée de différents domaines : biologie, médecine, mathématiques, statistiques, chimie, physique et informatique. Elle a pour but de concevoir et d’appliquer des modèles et outils statistiques et computationnels visant l’avancement des connaissances en biologie et dans les sciences connexes. Dans ce contexte, la compréhension du fonctionnement et de l’évolution des gènes fait l’objet de nombreuses études en bio-informatique. Ces études sont majoritairement fondées sur la comparaison des gènes et en particulier sur l’alignement de séquences génomiques. Cependant, dans leurs calculs d’alignement de séquences génomiques, les méthodes existantes se basent uniquement sur la similarité des séquences et ne tiennent pas compte de la structure des gènes. L’alignement prenant en compte la structure des séquences offre l’opportunité d’en améliorer la précision ainsi que les résultats des méthodes développées à partir de ces alignements. C’est dans cette hypothèse que s’inscrit l’objectif de cette thèse de doctorat : proposer des modèles tenant compte de la structure des gènes lors de l’alignement des séquences de familles de gènes. Ainsi, par cette thèse, nous avons contribué à accroître les connaissances scientifiques en développant des modèles d’alignement de séquences biologiques intégrant des informations sur la structure de codage et d’épissage des séquences. Nous avons proposé un algorithme et une nouvelle fonction du score pour l’alignement de séquences codantes d’ADN (CDS) en tenant compte de la longueur des décalages du cadre de traduction. Nous avons aussi proposé un algorithme pour aligner des paires de séquences d’une famille de gènes en considérant leurs structures d’épissage. Nous avons également développé un algorithme pour assembler des alignements épissés par paire en alignements multiples de séquences. Enfin, nous avons développé un outil pour la visualisation d’alignements épissés multiples de famille de gènes. Dans cette thèse, nous avons souligné l’importance et démontré l’utilité de tenir compte de la structure des séquences en entrée lors du calcul de leur alignement

    Assembly and Compositional Analysis of Human Genomic DNA - Doctoral Dissertation, August 2002

    Get PDF
    In 1990, the United States Human Genome Project was initiated as a fifteen-year endeavor to sequence the approximately three billion bases making up the human genome (Vaughan, 1996).As of December 31, 2001, the public sequencing efforts have sequenced a total of 2.01 billion finished bases representing 63.0% of the human genome (http://www.ncbi.nlm.nih.gov/genome/seq/page.cgi?F=HsProgress.shtml&&ORG=Hs) to a Bermuda quality error rate of 1/10000 (Smith and Carrano, 1996). In addition, 1.11 billion bases representing 34.8% of the human genome has been sequenced to a rough-draft level. Efforts such as UCSC\u27s GoldenPath (Kent and Haussler, 2001) and NCBI\u27s contig assembly (Jang et al., 1999) attempt to assemble the human genome by incorporating both finished and rough-draft sequence. The availability of the human genome data allows us to ask questions concerning the maintenance of specific regions of the human genome. We consider two hypotheses for maintenance of high G+C regions: the presence of specific repetitive elements and compositional mutation biases. Our results rule out the possibility of the G+C content of repetitive elements determining regions of high and low G+C regions in the human genome. We determine that there is a compositional bias for mutation rates. However, these biases are not responsible for the maintenance of high G+C regions. In addition, we show that regions of the human under less selective pressure will mutate towards a higher A+T composition, regardless of the surrounding G+C composition. We also analyze sequence organization and show that previous studies of isochore regions (Bernardi,1993) cannot be generalized within the human genome. In addition, we propose a method to assemble only those parts of the human genome that are finished into larger contigs. Analysis of the contigs can lead to the mining of meaningful biological data that can give insights into genetic variation and evolution. I suggest a method to help aid in single nucleotide polymorphism (SNP)detection, which can help to determine differences within a population. I also discuss a dynamic-programming based approach to sequence assembly validation and detection of large-scale polymorphisms within a population that is made possible through the availability of large human sequence contigs

    The development and application of informatics-based systems for the analysis of the human transcriptome

    Get PDF
    Philosophiae Doctor - PhDDespite the fact that the sequence of the human genome is now complete it has become clear that the elucidation of the transcriptome is more complicated than previously expected. There is mounting evidence for unexpected and previously underestimated phenomena such as alternative splicing in the transcriptome. As a result, the identification of novel transcripts arising from the genome continues. Furthermore, as the volume of transcript data grows it is becoming increasingly difficult to integrate expression information which is from different sources, is stored in disparate locations, and is described using differing terminologies. Determining the function of translated transcripts also remains a complex task. Information about the expression profile – the location and timing of transcript expression – provides evidence that can be used in understanding the role of the expressed transcript in the organ or tissue under study, or in developmental pathways or disease phenotype observed. In this dissertation I present novel computational approaches with direct biological applications to two distinct but increasingly important areas of research in gene expression research. The first addresses detection and characterisation of alternatively spliced transcripts. The second is the construction of an hierarchical controlled vocabulary for gene expression data and the annotation of expression libraries with controlled terms from the hierarchies. In the final chapter the biological questions that can be approached, and the discoveries that can be made using these systems are illustrated with a view to demonstrating how the application of informatics can both enable and accelerate biological insight into the human transcriptome.South Afric
    • …
    corecore