98 research outputs found
Detection of frameshifts and improving genome annotation
We developed a new program called GeneTack for ab initio frameshift detection in intronless protein-coding nucleotide sequences. The GeneTack program uses
a hidden Markov model (HMM) of a genomic sequence with possibly frameshifted
protein-coding regions. The Viterbi algorithm nds the maximum likelihood path
that discriminates between true adjacent genes and a single gene with a frameshift.
We tested GeneTack as well as two other earlier developed programs FrameD and
FSFind on 17 prokaryotic genomes with frameshifts introduced randomly into known
genes. We observed that the average frameshift prediction accuracy of GeneTack, in
terms of (Sn+Sp)/2 values, was higher by a signicant margin than the accuracy of
the other two programs.
GeneTack was used to screen 1,106 complete prokaryotic genomes and 206,991
genes with frameshifts (fs-genes) were identifed. Our goal was to determine if a
frameshift transition was due to (i) a sequencing error, (ii) an indel mutation or (iii)
a recoding event. We grouped 102,731 genes with frameshifts (fs-genes) into 19,430
clusters based on sequence similarity between their protein products (fs-proteins),
conservation of predicted frameshift position, and its direction. While fs-genes in
2,810 clusters were classied as conserved pseudogenes and fs-genes in 1,200 clusters
were classied as hypothetical pseudogenes, 5,632 fs-genes from 239 clusters pos-
sessing conserved motifs near frameshifts were predicted to be recoding candidates.
Experiments were performed for sequences derived from 20 out of the 239 clusters;
programmed ribosomal frameshifting with eciency higher than 10% was observed
for four clusters.
GeneTack was also applied to 1,165,799 mRNAs from 100 eukaryotic species and 45,295 frameshifts were identied. A clustering approach similar to the one used for
prokaryotic fs-genes allowed us to group 12,103 fs-genes into 4,087 clusters. Known
programmed frameshift genes were among the obtained clusters. Several clusters may
correspond to new examples of dual coding genes.
We developed a web interface to browse a database containing all the fs-genes
predicted by GeneTack in prokaryotic genomes and eukaryotic mRNA sequences.
The fs-genes can be retrieved by similarity search to a given query sequence, by fs-
gene cluster browsing, etc. Clusters of fs-genes are characterized with respect to their
likely origin, such as pseudogenization, phase variation, programmed frameshifts etc.
All the tools and the database of fs-genes are available at the GeneTack web site
http://topaz.gatech.edu/GeneTack/PhDCommittee Chair: Borodovsky, Mark; Committee Member: Baranov, Pavel; Committee Member: Hammer, Brian; Committee Member: Jordan, King; Committee Member: Konstantinidis, Kostas; Committee Member: Song, L
The Structure and Evolution of Non-canonical Coiled coils
Coiled coils are ubiquitous protein structural elements which support a wide range of biological functions. They can serve as molecular spacers, oligomerization motifs, mechanical levers in membrane fusion, components of cytoskeleton as well as facilitate ion transport and signal transduction. Canonical coiled coils are regular, left-handed supercoiled bundles of two or more α-helices, with a characteristic heptad repeat pattern. However, other periodicities engendering different supercoils are possible. Insertion of two (nonads) or six (hexads) residues in a heptad repeat locally breaks the helices into short β-strands which assemble as a triangular structural element we call the β-layer. In the first project, we structurally characterized two hexad repeat families. Repetitive nonads and hexads yield a new structure, the α/β coiled coil, with regularly alternating α- and β-segments. Conversion of hexads to heptads by insertion of one residue per repeat gives a canonical coiled coil. Our results support previous data that novel backbone structures are possible within the allowed regions of Ramachandran space with minor mutations to a known fold. Secondly, we characterized the human paralogs MCUR1 and CCDC90B of a novel membrane protein family conserved in prokaryotes and mitochondria. The proteins were found to exhibit a conserved head-neck-stalk-anchor architecture, where a membrane-anchored trimeric coiled-coil stalk projects the N-terminal helical head domain via a β-layer neck. Cellular localization studies showed that prokaryotic and eukaryotic proteins localize to the cytoplasmic and inner mitochondrial membranes, respectively, with an N-in C-out topology. Using MCUR1, an essential regulator of Ca(2+) uptake through mitochondrial calcium uniporter (MCU), we studied the role of individual domains and found that the conserved head interacts directly with MCU. Ca(2+) binding destabilizes MCUR1 head domain, which then accelerates its conversion to β-amyloid fibrils. Finally, we studied the effect of frameshift resistant (FSR) repeat amplification on the structure and function of existing and novel proteins. This type of repetition comprising units of n∤3 base-pairs and lacking stop codons, encodes the same protein repeat of n residues in all three frames. We focused primarily on heptad FSR repeats which conform to coiled-coil periodicity and are significantly enriched in bacteria. Using cyanobacterium Microcystis aeruginosa, we investigated the in vivo expression of FSR repeat ORFs with proteome and transcriptome analysis and found that a number of them are highly transcribed, but undetectable at the protein level. Through biophysical and biochemical methods, we showed that FSR repeat insertion products are initially unstructured and mostly non-functional; however, they can obtain beneficial mutations over evolutionary time-scales to become more structured, giving rise to novel cellular functions
Development and Application of Next-Generation Sequencing Methods to Profile Cellular Translational Dynamics
The transmission of genetic information from the transcription of DNA to RNA and the subsequent translation of RNA into protein is often abstracted into a linear process. However, as methods and technologies to measure the genomic, transcriptomic, and proteomic content of cells have advanced, so too has our understanding that the transmission of genetic information does not always flow in a lossless manner. For instance, changes observed in messenger RNA (mRNA) abundance are not always retained at the proteomic level. Indeed, a diverse array of mechanisms have been identified that exert regulatory control over this transmission of information. Next-generation short read sequencing has driven many of these insights and provided increasingly nuanced understanding of these regulatory mechanisms. However, the continued development and application of sequencing methodologies and analytics are required to properly contextualize many of these insights on a more global scale. Ribosome profiling is one such recent advancement which enriches for ribosome-protected fragments of mRNA; sequencing and analysis of these ribosome-protected mRNA fragments enables profiling of the translational content of a sample. The aim of this dissertation is to address the need for the development and application of statistical and analytical algorithms to profile the regulatory factors that contribute to the translational dynamics in cells.
In the first chapter, I survey the development and application of next-generation sequencing methods for the profiling and computational analysis of translation and translational dynamics. In the second chapter of this thesis, I present SPECtre, a software package that identifies regions of active translation through measurement of the translational engagement of ribosomes over a transcript. SPECtre achieves high sensitivity and specificity in its classification of regions undergoing translation by leveraging the codon-dependent elongation of peptides; this tri-nucleotide periodicity is evident in the alignment of ribosome profiling sequence reads to a reference transcriptome. SPECtre classifies actively translated transcripts according to their coherence in read coverage over a region to an optimal tri-nucleotide signal.
In the third chapter, I describe the application of SPECtre to identify the translation of upstream-initiated open-reading frames that may regulate differentiation in a neuron-like cell model. uORFs are transcripts that result from the initiation of translation from AUG, and under certain biological constraints, from non-AUG sequences localized in the 5’ untranslated regions of annotated protein-coding genes. Subsets of these uORFs have been implicated in the regulation of their downstream protein-coding genes in yeast, mice and humans. In this chapter, I provide further evidence for this regulation as well as the spatial context for the functional consequences of uORF translation on downstream protein-coding genes in a neuron-like cell line model of differentiation.
Finally, in the fourth chapter, I outline a strategy using our coherence-based translational scoring algorithm to profile ribosomal engagement over chimeric gene fusion breakpoints in prostate cancer. Here, known breakpoints from current annotation databases are integrated with novel junctions nominated by existing whole genome and transcriptomic gene fusion detection algorithms, and the translational profile over these chimeric junctions using SPECtre is measured. This provides an additional layer of translational evidence to known and novel gene fusion breakpoints in prostate cancer. Ongoing development of a database and visualization platform based on these results will enable integrative insights into the transcriptional and translational topology of these breakpoints.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144106/1/stonyc_1.pd
Structure determination of membrane proteins by electron crystallography
A fundamental principle of life is the separation of environments into different compartments.
Prokaryotes shield their interior from the environment by a plasma membrane
and in some cases also by a cell wall. Eukaryotes refine this compartmentalization
by building different organelles for different parts of the cell metabolism. Nevertheless,
these different compartments are dependent on each other and are interconnected
by membrane proteins that transport specific nutrients, hormones, ions, water and
waste products across the membrane and facilitate signal transmission between different
compartments. Understanding the structure and function of membrane proteins
can therefore allow an enormous insight into the regulation of different metabolic pathways.
The electron microscope (EM) proved itself a great tool for studying membrane proteins,
offering the unique opportunity to image membrane proteins within a lipid bilayer
as close to the natural conditions as possible. Processing of images acquired by an electron
microscope poses a challenging task for both scientist and processing hardware.
Newly developed and optimized algorithms are needed to improve the image processing
to a level that allows atomic resolution to be achieved regularly.
Membrane proteins pose a difficult challenge for a structural biologist. To crystallize
membrane proteins into well ordered two dimensional (2D) or three dimensional (3D)
crystals is one of the most important prerequisites for structural analysis at the atomic
level, yet membrane proteins are notoriously difficult to crystallize.
One exception may be bacteriorhodopsin, which forms near-perfect crystals already
in its native membrane. This may explain the fact that the first 2D electron crystallographic
structure determined at 7 Å resolution by Henderson and Unwin[20][43] in
1975 was the structure of bacteriorhodopsin. In 1990 the structure of Br was determined
to atomic resolution by Henderson et al.[19], being the first atomic structure of
a membrane protein. The structure determination of Br was also the starting point
for the mrc program suite, which is widely used at the moment in the, albeit small,
2D electron crystallography community. Using the mrc software Kühlbrandt et al.[26]
solved the structure of the light-harvesting chlorophyll a/b-protein complex in 1994.
For recording the images they used the spot scan technique developed by Downing in
1991[9].
The first aquaporin water channel determined was aquaporin 1, resolved by Walz et
al. in 1997[45] at 6 Å resolution, and subsequently solved to atomic resolution by
Murata et al. in 2000[29]. Recently, several more aquaporin structures were determined
by 2D electron crystallographic methods, aquaporin-0 (AQP0) by Gonen et al. in
2004[14] at 3 Å and in 2005[13] at 1.9 Å and aquaporin-4 (AQP4) by Hiroaki et al.
in 2006[22]. Interestingly, AQP4 shows exactly the same monomer arrangement as
SoPIP2;1. The recent publications show that the trend goes from recording solely
images to the recording of diffraction data in combination with images or even to
recording diffraction data exclusively, and then using methods developed for x-ray
crystallography to obtain the phase information.
Given the fact that the software available for processing of 2D electron diffraction patterns
is less evolved than the one for processing images, and given this new development
of increased usage of diffraction patterns, it only makes sense to focus on implementing
new and improved programs for 2D electron diffraction processing.
In this work I would like to present the advances I achieved in the structural determination
of aquaporin 2, as well as my contribution to other projects, in particular the
structural investigations of SoPIP2;1 and KdgM. I will also explain the modified sample
preparation methods which made data recording at high tilt angles more reliable
and achieved an improvement in resolution of the measured data.
A second, equally important and detailed part of my thesis is the work invested in
improving and extending the image processing to a point where a user, not adept
in programming in several languages, can use it and produce good results. For this
I improved the functionality and performance at several points, including a strong
emphasis on user friendliness and ease of maintenance
Development of Chimeric Cas9 Nucleases for Accurate and Flexible Genome Editing
There has been tremendous amount of effort focused on the development and improvement of genome editing applications over the decades. Particularly, the development of programmable nucleases has revolutionized genome editing with regards to their improvements in mutagenesis efficacy and targeting feasibility. Programmable nucleases are competent for a variety of genome editing applications. There is growing interest in employing the programmable nucleases in therapeutic genome editing applications, such as correcting mutations in genetic disorders.
Type II CRISPR-Cas9 bacterial adaptive immunity systems have recently been engineered as RNA-guided programmable nucleases. Native CRISPR-Cas9 nucleases have two stages of sequence-specific target DNA recognition prior to cleavage: the intrinsic binding of the Cas9 nuclease to a short DNA element (the PAM) followed by testing target site complementarity with the programmable guide RNA. The ease of reprogramming CRISPR-Cas9 nucleases for new target sequences makes them favorable genome editing platform for many applications including gene therapy. However, wild-type Cas9 nucleases have limitations: (i) The PAM element requirement restricts the targeting range of Cas9; (ii) despite the presence of two stages of target recognition, wild-type Cas9 can cleave DNA at unintended sites, which is not desired for therapeutic purposes; and (iii) there is a lack of control over the mutagenic editing product that is procuded.
In this study, we developed and characterized chimeric Cas9 platforms to provide solutions to these limitations. In these platforms, the DNA-binding affinity of Cas9 protein from S. pyogenes is attenuated such that the target site binding is dependent on a fused programmable DNA-targeting-unit that recognizes a neighboring DNA-sequence. This modification extends the range of usable PAM elements and substantially improves the targeting specify of wild type Cas9. Furthermore, one of the featured chimeric Cas9 variants developed in this study has both robust nuclease activity and ability to generate predictable uniform editing products. These superior properties of the chimeric Cas9 platforms make them favorable for various genome editing applications and bring programmable nucleases one step closer to therapeutic applications
Evolutionary genomics : statistical and computational methods
This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward
Towards the understanding of transcriptional and translational regulatory complexity
Considering the same genome within every cell, the observed phenotypic diversity can only arise from highly regulated mechanisms beyond the encoded DNA sequence. We investigated several mechanisms of protein biosynthesis and analyzed DNA methylation patterns, alternative translation sites, and genomic mutations. As chromatin states are determined by epigenetic modifications and nucleosome occupancy,we conducted a structural superimposition approach between DNA methyltransferase 1 (DNMT1) and the nucleosome, which suggests that DNA methylation is dependent on accessibility of DNMT1 to nucleosome–bound DNA. Considering translation, alternative non–AUG translation initiation was observed. We developed reliable prediction models to detect these alternative start sites in a given mRNA sequence. Our tool PreTIS provides initiation confidences for all frame–independent non–cognate and AUG starts. Despite these innate factors, specific sequence variations can additionally affect a phenotype. We conduced a genome–wide analysis with millions of mutations and found an accumulation of SNPs next to transcription starts that could relate to a gene–specific regulatory signal. We also report similar conservation of canonical and alternative translation sites, highlighting the relevance of alternative mechanisms. Finally, our tool MutaNET automates variation analysis by scoring the impact of individual mutations on cell function while also integrating a gene regulatory network.Da sich in jeder Zelle die gleiche genomische Information befindet, kann die vorliegende phänotypische Vielfalt nur durch hochregulierte Mechanismen jenseits der kodierten DNA– Sequenz erklärt werden. Wir untersuchten Mechanismen der Proteinbiosynthese und analysierten DNA–Methylierungsmuster, alternative Translation und genomische Mutationen. Da die Chromatinorganisation von epigenetischen Modifikationen und Nukleosompositionen bestimmt wird, führten wir ein strukturelles Alignment zwischen DNA–Methyltransferase 1 (DNMT1) und Nukleosom durch. Dieses lässt vermuten, dass DNA–Methylierung von einer Zugänglichkeit der DNMT1 zur nukleosomalen DNA abhängt. Hinsichtlich der Translation haben wir verlässliche Vorhersagemodelle entwickelt, um alternative Starts zu identifizieren. Anhand einer mRNA–Sequenz bestimmt unser Tool PreTIS die Initiationskonfidenzen aller alternativen nicht–AUG und AUG Starts. Auch können sich Sequenzvarianten auf den Phänotyp auswirken. In einer genomweiten Untersuchung von mehreren Millionen Mutationen fanden wir eine Anreicherung von SNPs nahe des Transkriptionsstarts,welche auf ein genspezifisches regulatorisches Signal hindeuten könnte. Außerdem beobachteten wir eine ähnliche Konservierung von kanonischen und alternativen Translationsstarts, was die Relevanz alternativer Mechanismen belegt. Auch bewertet unser Tool MutaNET mit Hilfe von Scores und eines Genregulationsnetzwerkes automatisch den Einfluss einzelner Mutationen auf die Zellfunktion
Directionality of DNA mismatch repair in escherichia coli
Non-canonical base pairs that escape the proof-reading activity of the DNA
polymerase emerge from DNA replication as DNA mismatches. To promote
genomic integrity, these DNA mismatches are corrected by a secondary
protection system, called DNA mismatch repair (MMR). Understanding the
details of MMR is important for human health as defects in mismatch repair can
result in cancer (e.g. hereditary nonpolyposis colorectal cancer, also known as
Lynch syndrome).
Being normally stochastic in nature, mismatches can emerge at random
locations in a chromosome. Therefore, using a molecular tool to generate
substrates for the MMR system at a defined locus has been particularly useful in
my study of DNA mismatch repair in vivo. In this study, I have used a CTG•CAG
repeat array, also called the “TNR array”, to generate frequent substrates for the
MMR system in Escherichia coli. In E. coli, the MMR system searches for hemimethylated
GATC motifs around a mismatch to initiate removal of the faulty
nascent (un-methylated) strand. Analysing the usage of GATC motifs around the
TNR array, I have found that the MMR system preferentially utilizes the GATC
motifs on the origin distal side of the TNR array demonstrating that the
bidirectionality of MMR in vitro is constrained in live cells. My results suggest
that in vivo MMR operates by searching for the nearest hemimethylated GATC
site located between the mismatch and the replication fork and excision of the
nascent strand occurs directionally away from the fork towards the mismatch.
Previous in vitro studies have established that the excision reaction during MMR
terminates at a discrete point about 100 bp beyond a mismatch. However, in
vivo recombination at a 275 bp tandem repeat, which has been proposed to be
mediated by single stranded DNA generated during the excision reaction, has
suggested that the end point of the excision reaction in live cells may extend
much further from the mismatch than this. I have used this assay for extended
excision to determine the influence of GATC sites on excision tracts. In this
study, modification of the GATC motifs on the origin proximal side of the TNR
has shown that the excision reaction does not stop at a GATC motif on the origin
proximal side of the mismatch. In addition, sequential modifications of GATC
motifs on the origin distal side of the TNR array, thereby shifting the start point
of the excision reaction to a greater distance, have suggested that the length of
an excision tract is a function of the distance it covers from the start point rather
than from a mismatch.
My observation of directionality with respect to DNA replication in the
recognition of GATC sites suggested that MMR and DNA replication might be
coupled in some way and that perhaps active (or blocked) MMR might impede
the progress of the replication fork. However, no replication intermediates were
detected using two-dimensional agarose gel electrophoresis of genomic DNA
fragment containing the TNR array upon restriction digestion. I was therefore
unable to support the hypothesis that active or blocked MMR led to a slowing
down of DNA replication.
Given my observation of a decrease in MMR by separating the mismatch from
the closest origin distal GATC site, I set out to test whether MMR caused any
selection pressure for the genomic distribution of GATC motifs. To do this, I
generated artificial model genomes using a Markovian algorithm based on the
nucleotide composition and codon usage in E. coli. Strikingly, the comparison of
the distribution of GATC motifs in the E. coli genome with those from artificial
sequences has shown that GATC motifs are distributed randomly in E. coli
genome, except for a small clustering effect which has been detected for short
spaced (0-40 basepairs) GATC motifs. The observed distribution of slightly
over-represented GATC motifs in the E. coli genome appears to be a function of
the total number of GATC motifs and it seems that the DNA mismatch repair
system has evolved to utilize the natural distribution of GATC motifs to maintain
genomic integrity
- …