96 research outputs found
A large-scale analysis of mRNA polyadenylation of human and mouse genes
mRNA polyadenylation is a critical cellular process in eukaryotes. It involves 3′ end cleavage of nascent mRNAs and addition of the poly(A) tail, which plays important roles in many aspects of the cellular metabolism of mRNA. The process is controlled by various cis-acting elements surrounding the cleavage site, and their binding factors. In this study, we surveyed genome regions containing cleavage sites [herein called poly(A) sites], for 13 942 human and 11 155 mouse genes. We found that a great proportion of human and mouse genes have alternative polyadenylation (∼54 and 32%, respectively). The conservation of alternative polyadenylation type or polyadenylation configuration between human and mouse orthologs is statistically significant, indicating that alternative polyadenylation is widely employed by these two species to produce alternative gene transcripts. Genes belonging to several functional groups, indicated by their Gene Ontology annotations, are biased with respect to polyadenylation configuration. Many poly(A) sites harbor multiple cleavage sites (51.25% human and 46.97% mouse sites), leading to heterogeneous 3′ end formation for transcripts. This implies that the cleavage process of polyadenylation is largely imprecise. Different types of poly(A) sites, with regard to their relative locations in a gene, are found to have distinct nucleotide composition in surrounding genomic regions. This large-scale study provides important insights into the mechanism of polyadenylation in mammalian species and represents a genomic view of the regulation of gene expression by alternative polyadenylation
A cause for consilience: Utilizing multiple genomic data types to resolve problematic nodes within Arthropoda and Ecdysozoa
A major turning point in the study of metazoan evolution was the recognition of the
existence of the Ecdysozoa in 1997. This is a group of eight animal phyla (Nematoda,
Nematomorpha, Loricifera, Kinorhyncha, Priapulida, Tardigrada, Onychophora and
Arthropoda). Ecdysozoa is the most specious clade of animals to ever exist and the
relationships among its eight phyla are still heatedly debated. Similarly also the
relationships among the three sub-phyla (Chelicerata, Pancrustacea and Myriapoda)
within the most important ecdysozoan phylum (the Arthropoda) are still debated.
Indeed, the two major problems in ecdysozoan phylogeny refer to the relationships of
Myriapoda within Arthropoda, and of Tardigrada within Ecdysozoa. Difficulties in
ecdysozoan relationships resides in lineages characterized by rapid, deep divergences
and subsequently long periods of divergent evolution. Phylogenetic signal to resolve
the relationships of these lineages is diluted, increasing the likelihood of recovery of
phylogenetic artifacts.
In an attempt to resolve the relationships within Ecdysozoa, consilience of three
independent phylogenetic data sets was investigated. EST and rRNA and microRNA
(miRNA) data were sampled across all major ecdysozoan phyla. In particular, a
major contribution of this thesis is the first time sequencing of miRNAs for all the
panarthropod phyla. MicroRNAs are genome regulatory elements that recently
emerged as a source of useful phylogenetic data (Sempere et al. 2006) because of
their low homoplasy levels.
The considered data sets were analysed under phylogenetic methods and models,
implemented to minimize the occurrence of phylogenetic reconstruction artifacts to
understand the evolution of Ecdysozoa. Analyses of independent data types recovered
well supported and corroborating evidence for the monophyly of Panarthropoda
(Arthropoda, Onychophora and Tardigrada), a sister group relationships between
Myriapoda and Pancrustacea within Arthropoda, and the paraphyly of Cycloneuralia
(Nematoda, Nematomorpha, Loricifera, Kinorhyncha and Priapulida).
A cause for consilience: Utilizing multiple genomic data types to resolve problematic nodes within Arthropoda and Ecdysozoa
A major turning point in the study of metazoan evolution was the recognition of the
existence of the Ecdysozoa in 1997. This is a group of eight animal phyla (Nematoda,
Nematomorpha, Loricifera, Kinorhyncha, Priapulida, Tardigrada, Onychophora and
Arthropoda). Ecdysozoa is the most specious clade of animals to ever exist and the
relationships among its eight phyla are still heatedly debated. Similarly also the
relationships among the three sub-phyla (Chelicerata, Pancrustacea and Myriapoda)
within the most important ecdysozoan phylum (the Arthropoda) are still debated.
Indeed, the two major problems in ecdysozoan phylogeny refer to the relationships of
Myriapoda within Arthropoda, and of Tardigrada within Ecdysozoa. Difficulties in
ecdysozoan relationships resides in lineages characterized by rapid, deep divergences
and subsequently long periods of divergent evolution. Phylogenetic signal to resolve
the relationships of these lineages is diluted, increasing the likelihood of recovery of
phylogenetic artifacts.
In an attempt to resolve the relationships within Ecdysozoa, consilience of three
independent phylogenetic data sets was investigated. EST and rRNA and microRNA
(miRNA) data were sampled across all major ecdysozoan phyla. In particular, a
major contribution of this thesis is the first time sequencing of miRNAs for all the
panarthropod phyla. MicroRNAs are genome regulatory elements that recently
emerged as a source of useful phylogenetic data (Sempere et al. 2006) because of
their low homoplasy levels.
The considered data sets were analysed under phylogenetic methods and models,
implemented to minimize the occurrence of phylogenetic reconstruction artifacts to
understand the evolution of Ecdysozoa. Analyses of independent data types recovered
well supported and corroborating evidence for the monophyly of Panarthropoda
(Arthropoda, Onychophora and Tardigrada), a sister group relationships between
Myriapoda and Pancrustacea within Arthropoda, and the paraphyly of Cycloneuralia
(Nematoda, Nematomorpha, Loricifera, Kinorhyncha and Priapulida).
The Presence and Distribution of Crotoxin in the Rock Rattlesnake (Crotalus lepidus)
Crotoxin and its homologs (hereafter all referred to as CTx) is a highly lethal heterodimeric beta-neurotoxin found in pitvipers (Crotalinae) and is the main driver of neurotoxic venom phenotypes (Type II). In contrast, hemorrhagic venom phenotypes (Type I) are characterized by high snake venom metalloproteinase expression and low toxicity. Although many rattlesnake species have been classified as either Type I or Type II, population level variation in venom phenotype has also been documented in several species. The presence or absence of CTx is the main component of this variation in venom phenotype and has been most widely studied in large-bodied lowland rattlesnakes (Crotalus scutulatus, C. helleri, and C. horridus). While it has been suspected to be in C. lepidus, a small-bodied montane rattlesnake, there has been no genetic confirmation. We used genomics and transcriptomics to test for the presence, distribution, and evolution of CTx in C. lepidus. We genomically and transcriptomically confirmed the presence and expression of CTx in C. lepidus and found it in 17 out of 104 samples across their range. CTx presence was not significantly associated with longitude, latitude, subspecies, or elevation. However, we did identify several climatic variables associated with CTx presence, including ones that have been identified in previous studies on CTx expression providing insights on the phylogenetic distribution of CTx across rattlesnakes, the variation in crotoxin expression, and highlighting environments to which CTx may be locally adapted. Our results likely support previous hypotheses of an ancestral origin for crotoxin followed by independent sorting in lineages; therefore, future studies should focus on testing for the presence of CTx in other species of montane rattlesnakes
Modélisation et comparaison de la structure de gènes
La bio-informatique est un domaine de recherche multi-disciplinaire, à la croisée de différents domaines : biologie, médecine, mathématiques, statistiques, chimie, physique et informatique. Elle a pour but de concevoir et d’appliquer des modèles et outils statistiques et computationnels visant l’avancement des connaissances en biologie et dans les sciences connexes.
Dans ce contexte, la compréhension du fonctionnement et de l’évolution des gènes fait l’objet de nombreuses études en bio-informatique. Ces études sont majoritairement fondées sur la comparaison des gènes et en particulier sur l’alignement de séquences génomiques. Cependant, dans leurs calculs d’alignement de séquences génomiques, les méthodes existantes se basent uniquement sur la similarité des séquences et ne tiennent pas compte de la structure des gènes. L’alignement prenant en compte la structure des séquences offre l’opportunité d’en améliorer la précision ainsi que les résultats des méthodes développées à partir de ces alignements.
C’est dans cette hypothèse que s’inscrit l’objectif de cette thèse de doctorat : proposer des modèles tenant compte de la structure des gènes lors de l’alignement des séquences de familles de gènes. Ainsi, par cette thèse, nous avons contribué à accroître les connaissances scientifiques en développant des modèles d’alignement de séquences biologiques intégrant des informations sur la structure de codage et d’épissage des séquences. Nous avons proposé un algorithme et une nouvelle fonction du score pour l’alignement de séquences codantes d’ADN (CDS) en tenant compte de la longueur des décalages du cadre de traduction. Nous avons aussi proposé un algorithme pour aligner des paires de séquences d’une famille de gènes en considérant leurs structures d’épissage. Nous avons également développé un algorithme pour assembler des alignements épissés par paire en alignements multiples de séquences. Enfin, nous avons développé un outil pour la visualisation d’alignements épissés multiples de famille de gènes. Dans cette thèse, nous avons souligné l’importance et démontré l’utilité de tenir compte de la structure des séquences en entrée lors du calcul de leur alignement
Recommended from our members
Advances in faba bean genetics and genomics
Vicia faba L, is a globally important grain legume whose main centers of diversity are the Fertile Crescent and Mediterranean basin. Because of its small number (six) of exceptionally large and easily observed chromosomes it became a model species for plant cytogenetics the 70s and 80s. It is somewhat ironic therefore, that the emergence of more genomically tractable model plant species such as Arabidopsis and Medicago coincided with a marked decline in genome research on the formerly favored plant cytogenetic model. Thus, as ever higher density molecular marker coverage and dense genetic and even complete genome sequence maps of key crop and model species emerged through the 1990s and early 2000s, genetic and genome knowledge of Vicia faba lagged far behind other grain legumes such as soybean, common bean and pea. However, cheap sequencing technologies have stimulated the production of deep transcriptome coverage from several tissue types and numerous distinct cultivars in recent years. This has permitted the reconstruction of the faba bean meta-transcriptome and has fueled development of extensive sets of Simple Sequence Repeat and Single Nucleotide Polymorphism (SNP) markers. Genetics of faba bean stretches back to the 1930s, but it was not until 1993 that DNA markers were used to construct genetic maps. A series of Random Amplified Polymorphic DNA-based genetic studies mainly targeted at quantitative loci underlying resistance to a series of biotic and abiotic stresses were conducted during the 1990's and early 2000s. More recently, SNP-based genetic maps have permitted chromosome intervals of interest to be aligned to collinear segments of sequenced legume genomes such as the model legume Medicago truncatula, which in turn opens up the possibility for hypotheses on gene content, order and function to be translated from model to crop. Some examples of where knowledge of gene content and function have already been productively exploited are discussed. The bottleneck in associating genes and their functions has therefore moved from locating gene candidates to validating their function and the last part of this review covers mutagenesis and genetic transformation, two complementary routes to validating gene function and unlocking novel trait variation for the improvement of this important grain legume
Assembly and Compositional Analysis of Human Genomic DNA - Doctoral Dissertation, August 2002
In 1990, the United States Human Genome Project was initiated as a fifteen-year endeavor to sequence the approximately three billion bases making up the human genome (Vaughan, 1996).As of December 31, 2001, the public sequencing efforts have sequenced a total of 2.01 billion finished bases representing 63.0% of the human genome (http://www.ncbi.nlm.nih.gov/genome/seq/page.cgi?F=HsProgress.shtml&&ORG=Hs) to a Bermuda quality error rate of 1/10000 (Smith and Carrano, 1996). In addition, 1.11 billion bases representing 34.8% of the human genome has been sequenced to a rough-draft level. Efforts such as UCSC\u27s GoldenPath (Kent and Haussler, 2001) and NCBI\u27s contig assembly (Jang et al., 1999) attempt to assemble the human genome by incorporating both finished and rough-draft sequence. The availability of the human genome data allows us to ask questions concerning the maintenance of specific regions of the human genome. We consider two hypotheses for maintenance of high G+C regions: the presence of specific repetitive elements and compositional mutation biases. Our results rule out the possibility of the G+C content of repetitive elements determining regions of high and low G+C regions in the human genome. We determine that there is a compositional bias for mutation rates. However, these biases are not responsible for the maintenance of high G+C regions. In addition, we show that regions of the human under less selective pressure will mutate towards a higher A+T composition, regardless of the surrounding G+C composition. We also analyze sequence organization and show that previous studies of isochore regions (Bernardi,1993) cannot be generalized within the human genome. In addition, we propose a method to assemble only those parts of the human genome that are finished into larger contigs. Analysis of the contigs can lead to the mining of meaningful biological data that can give insights into genetic variation and evolution. I suggest a method to help aid in single nucleotide polymorphism (SNP)detection, which can help to determine differences within a population. I also discuss a dynamic-programming based approach to sequence assembly validation and detection of large-scale polymorphisms within a population that is made possible through the availability of large human sequence contigs
The development and application of informatics-based systems for the analysis of the human transcriptome
Philosophiae Doctor - PhDDespite the fact that the sequence of the human genome is now complete it has become clear that the elucidation of the transcriptome is more complicated than previously expected. There is mounting evidence for unexpected and previously underestimated phenomena such as alternative splicing in the transcriptome. As a result, the identification of novel transcripts arising from the genome continues. Furthermore, as the volume of transcript data grows it is becoming increasingly difficult to integrate expression information which is from different sources, is stored in disparate locations, and is described using differing terminologies. Determining the function of translated transcripts also remains a complex task. Information about the expression profile – the location and timing of transcript expression – provides evidence that can be used in understanding the role of the expressed transcript in the organ or tissue under study, or in developmental pathways or disease phenotype observed. In this dissertation I present novel computational approaches with direct biological applications to two distinct but increasingly important areas of research in gene expression research. The first addresses detection and characterisation of alternatively spliced transcripts. The second is the construction of an hierarchical controlled vocabulary for gene expression data and the annotation of expression libraries with controlled terms from the hierarchies. In the final chapter the biological questions that can be approached, and the discoveries that can be made using these systems are illustrated with a view to demonstrating how the application of informatics can both enable and accelerate biological insight into the human transcriptome.South Afric
- …