1,638 research outputs found

    The EM Algorithm and the Rise of Computational Biology

    Get PDF
    In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the "central dogma"; of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Annotation of Non-Model Species’ Genomes

    Get PDF
    The innovations in high throughput sequencing technologies in recent decades has allowed unprecedented examination and characterization of the genetic make-up of both model and non-model species, which has led to a surge in the use of genomics in fields which were previously considered unfeasible. These advances have greatly expanded the realm of possibilities in the fields of ecology and conservation. It is now possible to the identification of large cohorts of genetic markers, including single nucleotide polymorphisms (SNPs) and larger structural variants, as well as signatures of selection and local adaptation. Markers can be used to identify species, define population structure, and assess genetic health. In addition, researchers can examine unique features of genes related to the health of a threatened species, such as genes involved in immune function, reproduction, environmental response, as well as evolutionary trends and niche adaptations. Recent developments in sequencing and software also allow researchers to examine the noncoding- “ome”, providing a glimpse into gene regulation, developmental pathways, and response to viral sequences and mobile elements. Characterization of the biogenesis pathways of noncoding RNAs has facilitated the development of RNA interference strategies which are increasingly being used in therapeutics, agriculture, and pest control. The focus of this project is to use available sequencing technology and computational methods to annotate the genome and small RNA pathways of non-model organisms

    Using Experimental and Computational Strategies to Understand the Biogenesis of microRNAs and piRNAs: A Dissertation

    Get PDF
    Small RNAs are single-stranded, 18–36 nucleotide RNAs that can be categorized as miRNA, siRNA, and piRNA. miRNA are expressed ubiquitously in tissues and at particular developmental stages. They fine-tune gene expression by regulating the stability and translation of mRNAs. piRNAs are mainly expressed in the animal gonads and their major function is repressing transposable elements to ensure the faithful transfer of genetic information from generation to generation. My thesis research focused on the biogenesis of miRNAs and piRNAs using both experimental and computational strategies. The biogenesis of miRNAs involves sequential processing of their precursors by the RNase III enzymes Drosha and Dicer to generate miRNA/miRNA* duplexes, which are subsequently loaded into Argonaute proteins to form the RNA-induced silencing complex (RISC). We discovered that, after assembled into Ago1, more than a quarter of Drosophila miRNAs undergo 3′ end trimming by the 3′-to-5′ exoribonuclease Nibbler. Such trimming occurs after removal of the miRNA* strand from pre-RISC and may be the final step in RISC assembly, ultimately enhancing target messenger RNA repression. Moreover, by developing a specialized Burrow-Wheeler Transform based short reads aligner, we discovered that in the absence of Nibbler a subgroup of miRNAs undergoes increased tailing—non-templated nucleotide addition to their 3′ ends, which are usually associated with miRNA degradation. Therefore, the 3′ trimming by Nibbler might increase miRNA stability by protecting them from degradation. In Drosophila germ line, piRNAs associate with three PIWI-clade Argonaute proteins, Piwi, Aub, and Ago3. piRNAs bound by Aub and Ago3 are generated by reciprocal cleavages of sense and antisense transposon transcripts (a.k.a., the “Ping-Pong” cycle), which amplifies piRNA abundance and degrades transposon transcripts in the cytoplasm. On the other hand, Piwi and its associated piRNA repress the transcription of transposons in the nucleus. We discovered that Aub- and Ago3-mediated transposon RNA cleavage not only generates piRNAs bound to each other, but also produces substrates for the endonuclease Zucchini, which processively cleaves those substrates in a periodicity of ~26 nt and generates piRNAs that predominantly load into Piwi. Without Aub or Ago3, the abundance of Piwi-bound piRNAs drops and transcriptional silencing is compromised. Our discovery revises the current model of piRNA biogenesis

    Holocentric plants of the genus Rhynchospora as a new model to study meiotic adaptations to chromosomal structural rearrangements

    Get PDF
    Climate change, world hunger and overpopulation are some of the biggest challenges the world is currently facing. Moreover, they are part of a multidimensional single scenario: as climate change continues to modify our planet, we might see a decrease of arable land and increase in extreme weather patterns, posing a threat to food security. This has a direct impact on regions with high population growth, where food security is already scarce. Considering additionally the unsustainability of intensive global food production and its contribution to greenhouse emissions and biodiversity loss, it´s clear that all these factors are interconnected (Cardinale et al., 2012; Prosekov & Ivanova, 2018; Wiebe et al., 2019). Plants are the main source of staple food in the world and are also the main actors in carbon fixation, they are therefore key protagonists in controlling climate change. Plants are also an essential habitat-defining element balancing our ecosystem. Thus, how we grow plants and crops will, aside from the obvious implications for food security, also have a profound impact on the climate and biodiversity. The natural variability of species is considered an immense pool of genes and traits, and their understanding is key to generate new useful knowledge. For instance, natural populations can be more tolerant to abiotic and biotic stresses, or carry traits that combined together in hybrids, might achieve a higher seed number, or a faster growth. Classical breeding has exploited unrelated varieties to achieve traits of interest like dwarfism and higher grain production. However, only a limited number of crop species have been the focus of recent scientific and technological approaches, and they do not represent the extremely vast natural diversity of species that could generate useful knowledge for future applications (Castle et al., 2006; Pingali, 2012). The key to this natural variability is a process called meiotic recombination, the exchange of genomic material between homologous parental chromosomes. Meiotic recombination takes place during meiosis, a specialized cell division in which sexually reproducing organisms reduce the genomic complement of their gametes by half in preparation for fertilization. Meiotic recombination takes place at the beginning of meiosis, in a stage called prophase I. To exchange DNA sequences, the strands of two homologous chromosomes must be fragmented. This specific process of physiologically induced DNA fragmentation is conserved in the vast majority of eukaryotes (Keeney et al., 1997). After the formation of double-strand breaks, the 3’ ends that are left are targeted by recombinases that help the strands search and invade templates for repair. After invasion, the 3’ end is extended by DNA synthesis, exposing sequences on the opposite strand that can anneal to the other 3’ end of the original double strand break. DNA synthesis at both ends generates a new structure called a double Holliday Junction (dHJ), forming a physical link between homologous chromosomes, named chiasma (Wyatt & West, 2014). The resolutions of these structures are called crossovers (COs), which is the molecular event representing the outcome of meiotic recombination. Other outcomes are possible, like noncrossovers (NCOs). In this case, the invading strand is ejected and anneals to the single-strand 3´end of the original double-strand break (Allers & Lichten, 2001). Crossovers can be divided into two main groups, called class I and class II. COs of the first group are considered to be sensitive to interference, which means that there are mechanisms that prevent two class I COs from happening in proximity of each other. Class II is insensitive to interference. Class I COs are the result of a pathway called ZMM, which involves a group of specialised proteins that are highly conserved among eukaryotes (Lambing et al., 2017; Mercier et al., 2015). Class I COs are the most common, studied and important type of COs. Centromeres are structures, located on regions of the chromosomes, that allow proper chromosome segregation during mitosis and meiosis. Centromeres have a profound effect on plant breeding and crop improvement, as it is known that meiotic recombination is suppressed at centromeres in most eukaryotes. This represents a great limitation for crop improvement, as many possibly useful traits might be in regions not subject to recombination and thus might not be available for breeding purposes. Additionally, the mechanisms behind how recombination is regulated and prevented from happening at centromeres are still unclear. In most model organisms centromeres are single entities localized on specific regions on the chromosomes. This configuration is called monocentric. However, another type of configuration can be found in nature, but is less studied. In fact, some organisms harbour multiple centromeric determinants distributed over their whole chromosomal length. This configuration is called holocentric. The Cyperaceae comprise a vast, diverse family of plants, with a cosmopolitan distribution in all habitats (Spalink et al., 2016). Despite the presence of this family worldwide, knowledge about it is limited. Few genomes are available and molecular insights are scarce. This family is also known to be mainly formed by holocentric species (Melters et al., 2012). Understanding if and how meiotic recombination is achieved in holocentric plants will generate new knowledge that in the future might unlock new traits in elite crops, previously unavailable to breeding, that could help humanity face global climatic, economic and social challenges. Recent studies have reported new knowledge about important meiotic, chromosome and genome adaptions found in species of the Cyperaceae family and in particular the genus Rhynchospora (Marques et al., 2015, 2016a). With the recent publication of the first reference genomes for several Rhynchospora species, we could already perform a comprehensive analysis of their unique genome features and trace the evolutionary history of their karyotypes and how these have been determined by chromosome fusions (Hofstatter et al., 2021, 2022). This new resource paves the way for future research utilising Rhynchospora as a model genus to study adaptations to holocentricity in plants. With this work, my intention is to shed light on the underexplored topic of holocentricity in plants. Using cutting edge techniques, I examine the conservation of meiotic recombination together with other species-specific adaptations like achiasmy and polyploidy in holocentrics. My results reveal new insights into how plant meiotic recombination is regulated when small centromere units are found distributed chromosome-wide, challenging the classic dogma of suppression of recombination at centromeres

    De novo reconstruction of satellite repeat units from sequence data

    Full text link
    Satellite DNA are long tandemly repeating sequences in a genome and may be organized as high-order repeats (HORs). They are enriched in centromeres and are challenging to assemble. Existing algorithms for identifying satellite repeats either require the complete assembly of satellites or only work for simple repeat structures without HORs. Here we describe Satellite Repeat Finder (SRF), a new algorithm for reconstructing satellite repeat units and HORs from accurate reads or assemblies without prior knowledge on repeat structures. Applying SRF to real sequence data, we showed that SRF could reconstruct known satellites in human and well-studied model organisms. We also found satellite repeats are pervasive in various other species, accounting for up to 12% of their genome contents but are often underrepresented in assemblies. With the rapid progress on genome sequencing, SRF will help the annotation of new genomes and the study of satellite DNA evolution even if such repeats are not fully assembled

    Transcriptional and post-transcriptional regulation of leaf development in Arabidopsis thaliana

    Get PDF
    Plant growth follows a strict developmental program but needs to incorporate also environmental cues to adapt to the encountered conditions. This requires a complex regulatory network to ensure an appropriate response to changing conditions. We used the first leaf pair of Arabidopsis thaliana as a model system to study the regulation of organ development. Leaf growth can be divided in subsequent phases according to the major process driving it. In a young leaf primordium cells divide continuously and cell size homeostasis is ensured by matching rates of cell expansion. Next, cell division ceases and cell expansion becomes the driving force for growth. When the leaf has attained its final size, maturity is reached. In this thesis, I studied the regulation of leaf development at two regulatory levels. At the gene level, we analyzed the function of the CYCA2 core cell cycle regulatory gene family. We also studied the function of two new proliferation specific gene families putatively involved in cell cycle regulation. On the other hand, we profiled small RNA sequences during development and linked this with the occurrence of DNA methylation. The core machinery of the cell cycle in plants has been thoroughly studied, but our knowledge on how developmental and environmental signals impinge on cell division is still limited. CYCA2s are known core cell cycle regulators, involved in G2-to-M transition. Here, we studied the functional requirement of this gene family and showed that transcriptional repression is required for specific differentiation processes. Members of the CYCA2 protein family function in vascular development and differentiation of guard cells. For the latter process, we demonstrated that FOUR LIPS and MYB88, two transcription factors involved in stomatal development, directly repress CYCA2;3 expression, thus ensuring correct guard cell differentiation. Next to known ‘core’ cell cycle regulating genes, we also selected proliferation specific genes with unknown function, assuming them to be involved in the cell division process. We focused on two small gene families: three genes with four transmembrane domains (4TMs) and two genes containing three High Mobility Group (HMG) domains (3xHMG-box). Expression analysis and localization of transcriptional fusions with a fluorescent marker confirmed for both gene families the highly proliferation-specific expression pattern. Moreover, both families are highly induced in the M-phase of the cell cycle in synchronized cell cultures. The 4TMs localize to the cell plate during mitosis and we observed defects in cell plate formax tion upon overexpression and depletion of these genes. Therefore, we hypothesize that the 4TM genes are involved in formation of the cell plate. Profiling of small RNAs (sRNAs) in plants has thusfar mainly been focused on inflorescence tissue or whole seedlings. Here, we studied sRNAs during the different phases of development. Early in development, microRNAs implicated in nutrient stress response are upregulated, suggesting that at this phase nutrient availability is limiting for growth. We showed that specifically 24-nt sRNAs increase in expression during development. This class of sRNAs is known to be involved in RNA-dependent DNA methylation (RdDM) and can thus silence both transposons and genes. In general, the expression of sRNAs matching the coding sequences of protein-coding genes is positively correlated to the mRNA expression of this gene. We specifically selected genes that do not show this correlation, which were highly enriched in two categories: targets of microRNAs and trans-acting siRNAs, which generate phased sRNAs upon cleavage, and genes for which the sRNA profile is enriched for 24-nt sRNAs. This latter category is likely regulated through RdDM as this subset of genes shows increased DNA methylation in the gene body. This suggests that sRNA regulation could play an important role in regulating the leaf developmental process not only by preserving genome integrity by repressing transposon activity but also through silencing of protein-coding genes

    Modified penetrance of coding variants by cis-regulatory variation contributes to disease risk

    Get PDF
    Coding variants represent many of the strongest associations between genotype and phenotype; however, they exhibit interindividual differences in effect, termed 'variable penetrance'. Here, we study how cis-regulatory variation modifies the penetrance of coding variants. Using functional genomic and genetic data from the Genotype-Tissue Expression Project (GTEx), we observed that in the general population, purifying selection has depleted haplotype combinations predicted to increase pathogenic coding variant penetrance. Conversely, in cancer and autism patients, we observed an enrichment of penetrance increasing haplotype configurations for pathogenic variants in disease-implicated genes, providing evidence that regulatory haplotype configuration of coding variants affects disease risk. Finally, we experimentally validated this model by editing a Mendelian single-nucleotide polymorphism (SNP) using CRISPR/Cas9 on distinct expression haplotypes with the transcriptome as a phenotypic readout. Our results demonstrate that joint regulatory and coding variant effects are an important part of the genetic architecture of human traits and contribute to modified penetrance of disease-causing variants.Peer reviewe

    Improved Computational Prediction of Function and Structural Representation of Self-Cleaving Ribozymes with Enhanced Parameter Selection and Library Design

    Get PDF
    Biomolecules could be engineered to solve many societal challenges, including disease diagnosis and treatment, environmental sustainability, and food security. However, our limited understanding of how mutational variants alter molecular structures and functional performance has constrained the potential of important technological advances, such as high-throughput sequencing and gene editing. Ribonuleic Acid (RNA) sequences are thought to play a central role within many of these challenges. Their continual discovery throughout all domains of life is evidence of their significant biological importance (Weinreb et al., 2016). The self-cleaving ribozyme is a class of noncoding Ribonuleic Acid (ncRNA) that has been useful for relating sequence variants to structural features and their associated catalytic activities. Self-cleaving ribozymes possess tractable sequence spaces, perform easily identifiable catalytic functions, and have well documented structures. The determination of a self-cleaving ribozyme’s structure and catalytic activity within the laboratory is typically a slow and expensive process. Most current explorations of structure and function come from these empirical processes. Computational approaches to the prediction of catalytic activity and structure are fast and inexpensive, but have failed both to achieve atomic accuracy or to correctly identify all base-pair interactions (Watkins et al., 2018). One prominent impediment to computational approaches is the lack of existing structural and functional data typically required by predictive models (Jumper et al., 2021). Using data from deep-mutational scanning experiments and high-throughput sequencing technology, it is possible to computationally map mutational variants to their observed catalytic activity for a range of self-cleaving ribozymes. The resulting map reveals important base-pairing relationships that, in turn, facilitate accurate predictions of higher-order variants. Using sequence data from three experimental replicates of five model self-cleaving ribozymes, I will identify and map all single and double mutation variants to their observed cleavage activity. These mappings will be used to identify structural features within each ribozyme. Next, I will show within a training tool how observed cleavage for multiple reaction times can be used to identify the catalytic rates of our model ribozymes. Finally, I will predict the functional activity for model ribozyme variants of various mutational orders using machine learning models trained only on functionally labeled sequence variants. Together, these three dissertation chapters represent the kind of analysis needed to further the implementation of more accurate structural and functional prediction algorithms
    corecore