98 research outputs found

    Detection of frameshifts and improving genome annotation

    Get PDF
    We developed a new program called GeneTack for ab initio frameshift detection in intronless protein-coding nucleotide sequences. The GeneTack program uses a hidden Markov model (HMM) of a genomic sequence with possibly frameshifted protein-coding regions. The Viterbi algorithm nds the maximum likelihood path that discriminates between true adjacent genes and a single gene with a frameshift. We tested GeneTack as well as two other earlier developed programs FrameD and FSFind on 17 prokaryotic genomes with frameshifts introduced randomly into known genes. We observed that the average frameshift prediction accuracy of GeneTack, in terms of (Sn+Sp)/2 values, was higher by a signicant margin than the accuracy of the other two programs. GeneTack was used to screen 1,106 complete prokaryotic genomes and 206,991 genes with frameshifts (fs-genes) were identifed. Our goal was to determine if a frameshift transition was due to (i) a sequencing error, (ii) an indel mutation or (iii) a recoding event. We grouped 102,731 genes with frameshifts (fs-genes) into 19,430 clusters based on sequence similarity between their protein products (fs-proteins), conservation of predicted frameshift position, and its direction. While fs-genes in 2,810 clusters were classied as conserved pseudogenes and fs-genes in 1,200 clusters were classied as hypothetical pseudogenes, 5,632 fs-genes from 239 clusters pos- sessing conserved motifs near frameshifts were predicted to be recoding candidates. Experiments were performed for sequences derived from 20 out of the 239 clusters; programmed ribosomal frameshifting with eciency higher than 10% was observed for four clusters. GeneTack was also applied to 1,165,799 mRNAs from 100 eukaryotic species and 45,295 frameshifts were identied. A clustering approach similar to the one used for prokaryotic fs-genes allowed us to group 12,103 fs-genes into 4,087 clusters. Known programmed frameshift genes were among the obtained clusters. Several clusters may correspond to new examples of dual coding genes. We developed a web interface to browse a database containing all the fs-genes predicted by GeneTack in prokaryotic genomes and eukaryotic mRNA sequences. The fs-genes can be retrieved by similarity search to a given query sequence, by fs- gene cluster browsing, etc. Clusters of fs-genes are characterized with respect to their likely origin, such as pseudogenization, phase variation, programmed frameshifts etc. All the tools and the database of fs-genes are available at the GeneTack web site http://topaz.gatech.edu/GeneTack/PhDCommittee Chair: Borodovsky, Mark; Committee Member: Baranov, Pavel; Committee Member: Hammer, Brian; Committee Member: Jordan, King; Committee Member: Konstantinidis, Kostas; Committee Member: Song, L

    The Structure and Evolution of Non-canonical Coiled coils

    Get PDF
    Coiled coils are ubiquitous protein structural elements which support a wide range of biological functions. They can serve as molecular spacers, oligomerization motifs, mechanical levers in membrane fusion, components of cytoskeleton as well as facilitate ion transport and signal transduction. Canonical coiled coils are regular, left-handed supercoiled bundles of two or more α-helices, with a characteristic heptad repeat pattern. However, other periodicities engendering different supercoils are possible. Insertion of two (nonads) or six (hexads) residues in a heptad repeat locally breaks the helices into short β-strands which assemble as a triangular structural element we call the β-layer. In the first project, we structurally characterized two hexad repeat families. Repetitive nonads and hexads yield a new structure, the α/β coiled coil, with regularly alternating α- and β-segments. Conversion of hexads to heptads by insertion of one residue per repeat gives a canonical coiled coil. Our results support previous data that novel backbone structures are possible within the allowed regions of Ramachandran space with minor mutations to a known fold. Secondly, we characterized the human paralogs MCUR1 and CCDC90B of a novel membrane protein family conserved in prokaryotes and mitochondria. The proteins were found to exhibit a conserved head-neck-stalk-anchor architecture, where a membrane-anchored trimeric coiled-coil stalk projects the N-terminal helical head domain via a β-layer neck. Cellular localization studies showed that prokaryotic and eukaryotic proteins localize to the cytoplasmic and inner mitochondrial membranes, respectively, with an N-in C-out topology. Using MCUR1, an essential regulator of Ca(2+) uptake through mitochondrial calcium uniporter (MCU), we studied the role of individual domains and found that the conserved head interacts directly with MCU. Ca(2+) binding destabilizes MCUR1 head domain, which then accelerates its conversion to β-amyloid fibrils. Finally, we studied the effect of frameshift resistant (FSR) repeat amplification on the structure and function of existing and novel proteins. This type of repetition comprising units of n∤3 base-pairs and lacking stop codons, encodes the same protein repeat of n residues in all three frames. We focused primarily on heptad FSR repeats which conform to coiled-coil periodicity and are significantly enriched in bacteria. Using cyanobacterium Microcystis aeruginosa, we investigated the in vivo expression of FSR repeat ORFs with proteome and transcriptome analysis and found that a number of them are highly transcribed, but undetectable at the protein level. Through biophysical and biochemical methods, we showed that FSR repeat insertion products are initially unstructured and mostly non-functional; however, they can obtain beneficial mutations over evolutionary time-scales to become more structured, giving rise to novel cellular functions

    Development and Application of Next-Generation Sequencing Methods to Profile Cellular Translational Dynamics

    Full text link
    The transmission of genetic information from the transcription of DNA to RNA and the subsequent translation of RNA into protein is often abstracted into a linear process. However, as methods and technologies to measure the genomic, transcriptomic, and proteomic content of cells have advanced, so too has our understanding that the transmission of genetic information does not always flow in a lossless manner. For instance, changes observed in messenger RNA (mRNA) abundance are not always retained at the proteomic level. Indeed, a diverse array of mechanisms have been identified that exert regulatory control over this transmission of information. Next-generation short read sequencing has driven many of these insights and provided increasingly nuanced understanding of these regulatory mechanisms. However, the continued development and application of sequencing methodologies and analytics are required to properly contextualize many of these insights on a more global scale. Ribosome profiling is one such recent advancement which enriches for ribosome-protected fragments of mRNA; sequencing and analysis of these ribosome-protected mRNA fragments enables profiling of the translational content of a sample. The aim of this dissertation is to address the need for the development and application of statistical and analytical algorithms to profile the regulatory factors that contribute to the translational dynamics in cells. In the first chapter, I survey the development and application of next-generation sequencing methods for the profiling and computational analysis of translation and translational dynamics. In the second chapter of this thesis, I present SPECtre, a software package that identifies regions of active translation through measurement of the translational engagement of ribosomes over a transcript. SPECtre achieves high sensitivity and specificity in its classification of regions undergoing translation by leveraging the codon-dependent elongation of peptides; this tri-nucleotide periodicity is evident in the alignment of ribosome profiling sequence reads to a reference transcriptome. SPECtre classifies actively translated transcripts according to their coherence in read coverage over a region to an optimal tri-nucleotide signal. In the third chapter, I describe the application of SPECtre to identify the translation of upstream-initiated open-reading frames that may regulate differentiation in a neuron-like cell model. uORFs are transcripts that result from the initiation of translation from AUG, and under certain biological constraints, from non-AUG sequences localized in the 5’ untranslated regions of annotated protein-coding genes. Subsets of these uORFs have been implicated in the regulation of their downstream protein-coding genes in yeast, mice and humans. In this chapter, I provide further evidence for this regulation as well as the spatial context for the functional consequences of uORF translation on downstream protein-coding genes in a neuron-like cell line model of differentiation. Finally, in the fourth chapter, I outline a strategy using our coherence-based translational scoring algorithm to profile ribosomal engagement over chimeric gene fusion breakpoints in prostate cancer. Here, known breakpoints from current annotation databases are integrated with novel junctions nominated by existing whole genome and transcriptomic gene fusion detection algorithms, and the translational profile over these chimeric junctions using SPECtre is measured. This provides an additional layer of translational evidence to known and novel gene fusion breakpoints in prostate cancer. Ongoing development of a database and visualization platform based on these results will enable integrative insights into the transcriptional and translational topology of these breakpoints.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144106/1/stonyc_1.pd

    Studies on cell division and shape in Escherichia coli

    Get PDF

    Structure determination of membrane proteins by electron crystallography

    Get PDF
    A fundamental principle of life is the separation of environments into different compartments. Prokaryotes shield their interior from the environment by a plasma membrane and in some cases also by a cell wall. Eukaryotes refine this compartmentalization by building different organelles for different parts of the cell metabolism. Nevertheless, these different compartments are dependent on each other and are interconnected by membrane proteins that transport specific nutrients, hormones, ions, water and waste products across the membrane and facilitate signal transmission between different compartments. Understanding the structure and function of membrane proteins can therefore allow an enormous insight into the regulation of different metabolic pathways. The electron microscope (EM) proved itself a great tool for studying membrane proteins, offering the unique opportunity to image membrane proteins within a lipid bilayer as close to the natural conditions as possible. Processing of images acquired by an electron microscope poses a challenging task for both scientist and processing hardware. Newly developed and optimized algorithms are needed to improve the image processing to a level that allows atomic resolution to be achieved regularly. Membrane proteins pose a difficult challenge for a structural biologist. To crystallize membrane proteins into well ordered two dimensional (2D) or three dimensional (3D) crystals is one of the most important prerequisites for structural analysis at the atomic level, yet membrane proteins are notoriously difficult to crystallize. One exception may be bacteriorhodopsin, which forms near-perfect crystals already in its native membrane. This may explain the fact that the first 2D electron crystallographic structure determined at 7 Å resolution by Henderson and Unwin[20][43] in 1975 was the structure of bacteriorhodopsin. In 1990 the structure of Br was determined to atomic resolution by Henderson et al.[19], being the first atomic structure of a membrane protein. The structure determination of Br was also the starting point for the mrc program suite, which is widely used at the moment in the, albeit small, 2D electron crystallography community. Using the mrc software Kühlbrandt et al.[26] solved the structure of the light-harvesting chlorophyll a/b-protein complex in 1994. For recording the images they used the spot scan technique developed by Downing in 1991[9]. The first aquaporin water channel determined was aquaporin 1, resolved by Walz et al. in 1997[45] at 6 Å resolution, and subsequently solved to atomic resolution by Murata et al. in 2000[29]. Recently, several more aquaporin structures were determined by 2D electron crystallographic methods, aquaporin-0 (AQP0) by Gonen et al. in 2004[14] at 3 Å and in 2005[13] at 1.9 Å and aquaporin-4 (AQP4) by Hiroaki et al. in 2006[22]. Interestingly, AQP4 shows exactly the same monomer arrangement as SoPIP2;1. The recent publications show that the trend goes from recording solely images to the recording of diffraction data in combination with images or even to recording diffraction data exclusively, and then using methods developed for x-ray crystallography to obtain the phase information. Given the fact that the software available for processing of 2D electron diffraction patterns is less evolved than the one for processing images, and given this new development of increased usage of diffraction patterns, it only makes sense to focus on implementing new and improved programs for 2D electron diffraction processing. In this work I would like to present the advances I achieved in the structural determination of aquaporin 2, as well as my contribution to other projects, in particular the structural investigations of SoPIP2;1 and KdgM. I will also explain the modified sample preparation methods which made data recording at high tilt angles more reliable and achieved an improvement in resolution of the measured data. A second, equally important and detailed part of my thesis is the work invested in improving and extending the image processing to a point where a user, not adept in programming in several languages, can use it and produce good results. For this I improved the functionality and performance at several points, including a strong emphasis on user friendliness and ease of maintenance

    Development of Chimeric Cas9 Nucleases for Accurate and Flexible Genome Editing

    Get PDF
    There has been tremendous amount of effort focused on the development and improvement of genome editing applications over the decades. Particularly, the development of programmable nucleases has revolutionized genome editing with regards to their improvements in mutagenesis efficacy and targeting feasibility. Programmable nucleases are competent for a variety of genome editing applications. There is growing interest in employing the programmable nucleases in therapeutic genome editing applications, such as correcting mutations in genetic disorders. Type II CRISPR-Cas9 bacterial adaptive immunity systems have recently been engineered as RNA-guided programmable nucleases. Native CRISPR-Cas9 nucleases have two stages of sequence-specific target DNA recognition prior to cleavage: the intrinsic binding of the Cas9 nuclease to a short DNA element (the PAM) followed by testing target site complementarity with the programmable guide RNA. The ease of reprogramming CRISPR-Cas9 nucleases for new target sequences makes them favorable genome editing platform for many applications including gene therapy. However, wild-type Cas9 nucleases have limitations: (i) The PAM element requirement restricts the targeting range of Cas9; (ii) despite the presence of two stages of target recognition, wild-type Cas9 can cleave DNA at unintended sites, which is not desired for therapeutic purposes; and (iii) there is a lack of control over the mutagenic editing product that is procuded. In this study, we developed and characterized chimeric Cas9 platforms to provide solutions to these limitations. In these platforms, the DNA-binding affinity of Cas9 protein from S. pyogenes is attenuated such that the target site binding is dependent on a fused programmable DNA-targeting-unit that recognizes a neighboring DNA-sequence. This modification extends the range of usable PAM elements and substantially improves the targeting specify of wild type Cas9. Furthermore, one of the featured chimeric Cas9 variants developed in this study has both robust nuclease activity and ability to generate predictable uniform editing products. These superior properties of the chimeric Cas9 platforms make them favorable for various genome editing applications and bring programmable nucleases one step closer to therapeutic applications

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    Towards the understanding of transcriptional and translational regulatory complexity

    Get PDF
    Considering the same genome within every cell, the observed phenotypic diversity can only arise from highly regulated mechanisms beyond the encoded DNA sequence. We investigated several mechanisms of protein biosynthesis and analyzed DNA methylation patterns, alternative translation sites, and genomic mutations. As chromatin states are determined by epigenetic modifications and nucleosome occupancy,we conducted a structural superimposition approach between DNA methyltransferase 1 (DNMT1) and the nucleosome, which suggests that DNA methylation is dependent on accessibility of DNMT1 to nucleosome–bound DNA. Considering translation, alternative non–AUG translation initiation was observed. We developed reliable prediction models to detect these alternative start sites in a given mRNA sequence. Our tool PreTIS provides initiation confidences for all frame–independent non–cognate and AUG starts. Despite these innate factors, specific sequence variations can additionally affect a phenotype. We conduced a genome–wide analysis with millions of mutations and found an accumulation of SNPs next to transcription starts that could relate to a gene–specific regulatory signal. We also report similar conservation of canonical and alternative translation sites, highlighting the relevance of alternative mechanisms. Finally, our tool MutaNET automates variation analysis by scoring the impact of individual mutations on cell function while also integrating a gene regulatory network.Da sich in jeder Zelle die gleiche genomische Information befindet, kann die vorliegende phänotypische Vielfalt nur durch hochregulierte Mechanismen jenseits der kodierten DNA– Sequenz erklärt werden. Wir untersuchten Mechanismen der Proteinbiosynthese und analysierten DNA–Methylierungsmuster, alternative Translation und genomische Mutationen. Da die Chromatinorganisation von epigenetischen Modifikationen und Nukleosompositionen bestimmt wird, führten wir ein strukturelles Alignment zwischen DNA–Methyltransferase 1 (DNMT1) und Nukleosom durch. Dieses lässt vermuten, dass DNA–Methylierung von einer Zugänglichkeit der DNMT1 zur nukleosomalen DNA abhängt. Hinsichtlich der Translation haben wir verlässliche Vorhersagemodelle entwickelt, um alternative Starts zu identifizieren. Anhand einer mRNA–Sequenz bestimmt unser Tool PreTIS die Initiationskonfidenzen aller alternativen nicht–AUG und AUG Starts. Auch können sich Sequenzvarianten auf den Phänotyp auswirken. In einer genomweiten Untersuchung von mehreren Millionen Mutationen fanden wir eine Anreicherung von SNPs nahe des Transkriptionsstarts,welche auf ein genspezifisches regulatorisches Signal hindeuten könnte. Außerdem beobachteten wir eine ähnliche Konservierung von kanonischen und alternativen Translationsstarts, was die Relevanz alternativer Mechanismen belegt. Auch bewertet unser Tool MutaNET mit Hilfe von Scores und eines Genregulationsnetzwerkes automatisch den Einfluss einzelner Mutationen auf die Zellfunktion

    Directionality of DNA mismatch repair in escherichia coli

    Get PDF
    Non-canonical base pairs that escape the proof-reading activity of the DNA polymerase emerge from DNA replication as DNA mismatches. To promote genomic integrity, these DNA mismatches are corrected by a secondary protection system, called DNA mismatch repair (MMR). Understanding the details of MMR is important for human health as defects in mismatch repair can result in cancer (e.g. hereditary nonpolyposis colorectal cancer, also known as Lynch syndrome). Being normally stochastic in nature, mismatches can emerge at random locations in a chromosome. Therefore, using a molecular tool to generate substrates for the MMR system at a defined locus has been particularly useful in my study of DNA mismatch repair in vivo. In this study, I have used a CTG•CAG repeat array, also called the “TNR array”, to generate frequent substrates for the MMR system in Escherichia coli. In E. coli, the MMR system searches for hemimethylated GATC motifs around a mismatch to initiate removal of the faulty nascent (un-methylated) strand. Analysing the usage of GATC motifs around the TNR array, I have found that the MMR system preferentially utilizes the GATC motifs on the origin distal side of the TNR array demonstrating that the bidirectionality of MMR in vitro is constrained in live cells. My results suggest that in vivo MMR operates by searching for the nearest hemimethylated GATC site located between the mismatch and the replication fork and excision of the nascent strand occurs directionally away from the fork towards the mismatch. Previous in vitro studies have established that the excision reaction during MMR terminates at a discrete point about 100 bp beyond a mismatch. However, in vivo recombination at a 275 bp tandem repeat, which has been proposed to be mediated by single stranded DNA generated during the excision reaction, has suggested that the end point of the excision reaction in live cells may extend much further from the mismatch than this. I have used this assay for extended excision to determine the influence of GATC sites on excision tracts. In this study, modification of the GATC motifs on the origin proximal side of the TNR has shown that the excision reaction does not stop at a GATC motif on the origin proximal side of the mismatch. In addition, sequential modifications of GATC motifs on the origin distal side of the TNR array, thereby shifting the start point of the excision reaction to a greater distance, have suggested that the length of an excision tract is a function of the distance it covers from the start point rather than from a mismatch. My observation of directionality with respect to DNA replication in the recognition of GATC sites suggested that MMR and DNA replication might be coupled in some way and that perhaps active (or blocked) MMR might impede the progress of the replication fork. However, no replication intermediates were detected using two-dimensional agarose gel electrophoresis of genomic DNA fragment containing the TNR array upon restriction digestion. I was therefore unable to support the hypothesis that active or blocked MMR led to a slowing down of DNA replication. Given my observation of a decrease in MMR by separating the mismatch from the closest origin distal GATC site, I set out to test whether MMR caused any selection pressure for the genomic distribution of GATC motifs. To do this, I generated artificial model genomes using a Markovian algorithm based on the nucleotide composition and codon usage in E. coli. Strikingly, the comparison of the distribution of GATC motifs in the E. coli genome with those from artificial sequences has shown that GATC motifs are distributed randomly in E. coli genome, except for a small clustering effect which has been detected for short spaced (0-40 basepairs) GATC motifs. The observed distribution of slightly over-represented GATC motifs in the E. coli genome appears to be a function of the total number of GATC motifs and it seems that the DNA mismatch repair system has evolved to utilize the natural distribution of GATC motifs to maintain genomic integrity
    corecore