109 research outputs found

    Emerging Topics in Genome Sequencing and Analysis

    Get PDF
    This dissertation studies the emerging topics in genome sequencing and analysis with DNA and RNA. The optimal hybrid sequencing and assembly for accurate genome reconstruction and efficient detection approaches for novel ncRNAs in genomes are discussed. The next-generation sequencing is a significant topic that provides whole genetic information for the further biological research. Recent advances in high-throughput genome sequencing technologies have enabled the systematic study of various genomes by making whole genome sequencing affordable. To date, many hybrid genome assembly algorithms have been developed that can take reads from multiple read sources to reconstruct the original genome. An important aspect of hybrid sequencing and assembly is that the feasibility conditions for genome reconstruction can be satisfied by different combinations of the available read sources, opening up the possibility of optimally combining the sources to minimize the sequencing cost while ensuring accurate genome reconstruction. In this study, we derive the conditions for whole genome reconstruction from multiple read sources at a given confidence level and also introduce the optimal strategy for combining reads from different sources to minimize the overall sequencing cost. We show that the optimal read set, which simultaneously satisfies the feasibility conditions for genome reconstruction and minimizes the sequencing cost, can be effectively predicted through constrained discrete optimization. The availability of genome-wide sequences for a variety of species provides a large database for the further RNA analysis with computational methods. Recent studies have shown that noncoding RNAs (ncRNAs) are known to play crucial roles in various biological processes, and some ncRNAs are related to the genome stability and a variety of inherited diseases. The discovery of novel ncRNAs is hence an important topic, and there is a pressing need for accurate computational detection approaches that can be used to efficiently detect novel ncRNAs in genomes. One important issue is RNA structure alignment for comparative genome analysis, as RNA secondary structures are better conserved than the RNA sequences. Simultaneous RNA alignment and folding algorithms aim to accurately align RNAs by predicting the consensus structure and alignment at the same time, but the computational complexity of the optimal dynamic programming algorithm for simultaneous alignment and folding is extremely high. In this work, we proposed an innovative method, TOPAS, for RNA structural alignment that can efficiently align RNAs through topological networks. Although many ncRNAs are known to have a well conserved secondary structure, which provides useful clues for computational prediction, the prediction of ncRNAs is still challenging, since it has been shown that a structure-based approach alone may not be sufficient for detecting ncRNAs in a single sequence. In this study, we first develop a new approach by utilizing the n-gram model to classify the sequences and extract effective features to capture sequence homology. Based on this approach, we propose an advanced method, piRNAdetect, for reliable computational prediction of piRNAs in genome sequences. Utilizing the n-gram model can enhance the detection of ncRNAs that have sparse folding structures with many unpaired bases. By incorporating the n-gram model with the generalized ensemble defect, which assesses structure conservation and conformation to the consensus structure, we further propose RNAdetect, a novel computational method for accurate detection of ncRNAs through comparative genome analysis. Extensive performance evaluation based on the Rfam database and bacterial genomes demonstrates that our approaches can accurately and reliably detect novel ncRNAs, outperforming the current advanced methods

    Computational discovery of animal small RNA genes and targets

    Get PDF
    Though recently discovered, small RNAs appear to play a wealth of regulatory roles, being involved in degradation of target mRNAs, translation silencing of target genes, chromatin remodeling and transposon silencing. Presented here are the computational tools that I developed to annotate and characterize small RNA genes and to identify their targets. One of these tools is oligomap, a novel software for fast and exhaustive identi�cation of nearly-perfect matches of small RNAs in sequence databases. Oligomap is part of an automated annotation pipeline used in our laboratory to annotate small RNA sequences. The application of these tools to samples of small RNAs obtained from mouse and human germ cells together with subsequent computational analyses lead to the discovery of a new class of small RNAs which are now called piRNAs. The computational analysis revealed that piRNAs have a strong uridine preference at their 5' end, that unlike miRNAs, piRNAs are not excised from fold-back precursors but rather from long primary transcripts, and that the genome organization of their genes is conserved between human and mouse even though piRNAs on the sequence level are poorly conserved. In vertebrates, the most studied class of small regulatory RNAs are the miRNAs which bind to mRNAs and block translation. A computational framework is introduced to identify miRNA targets in mammals, ies, worms and �sh. The method uses extensive cross species conservation information to predict miRNA binding sites that are under evolutionary pressure. A downstream analysis of predicted miRNA targets revealed novel properties of miRNA target sites, one of which is a positional bias of miRNA target sites in long mammalian 3' untranslated regions. Intersection of our predictions with biochemical pathway annotation data suggested novel functions for some of the miRNAs. To gain further insights into the mechanism of miRNA targeting, I studied microarray data obtained in siRNA experiments. SiRNAs have been shown to produce o�- targets that resemble miRNA targets. This analysis suggests the presence of additional determinants of miRNA target site functionality (beyond complementarity between the miRNA 5' end and the target) in the close vicinity (about 150 nucleotides) of the miRNA-complementary site. Finally, as part of a study aiming to reduce siRNA o�-target e�ects by introducing chemical modi�cations in the siRNA, I performed microarray data analysis of siRNA transfection experiments. Presented are the methods used to quantify o�- target activity of siRNAs carrying di�erent types of chemical modi�cations. The analysis revealed that o�-targets caused by the passenger strand of the siRNA can be reduced by 5'-O-methylation

    PredT4SE-Stack: Prediction of Bacterial Type IV Secreted Effectors From Protein Sequences Using a Stacked Ensemble Method

    Get PDF
    Gram-negative bacteria use various secretion systems to deliver their secreted effectors. Among them, type IV secretion system exists widely in a variety of bacterial species, and secretes type IV secreted effectors (T4SEs), which play vital roles in host-pathogen interactions. However, experimental approaches to identify T4SEs are time- and resource-consuming. In the present study, we aim to develop an in silico stacked ensemble method to predict whether a protein is an effector of type IV secretion system or not based on its sequence information. The protein sequences were encoded by the feature of position specific scoring matrix (PSSM)-composition by summing rows that correspond to the same amino acid residues in PSSM profiles. Based on the PSSM-composition features, we develop a stacked ensemble model PredT4SE-Stack to predict T4SEs, which utilized an ensemble of base-classifiers implemented by various machine learning algorithms, such as support vector machine, gradient boosting machine, and extremely randomized trees, to generate outputs for the meta-classifier in the classification system. Our results demonstrated that the framework of PredT4SE-Stack was a feasible and effective way to accurately identify T4SEs based on protein sequence information. The datasets and source code of PredT4SE-Stack are freely available at http://xbioinfo.sjtu.edu.cn/PredT4SE_Stack/index.php

    Identifying Plant Pentatricopeptide Repeat Coding Gene/Protein Using Mixed Feature Extraction Methods

    Get PDF
    Motivation: Pentatricopeptide repeat (PPR) is a triangular pentapeptide repeat domain that plays a vital role in plant growth. In this study, we seek to identify PPR coding genes and proteins using a mixture of feature extraction methods. We use four single feature extraction methods focusing on the sequence, physical, and chemical properties as well as the amino acid composition, and mix the features. The Max-Relevant-Max-Distance (MRMD) technique is applied to reduce the feature dimension. Classification uses the random forest, J48, and naïve Bayes with 10-fold cross-validation.Results: Combining two of the feature extraction methods with the random forest classifier produces the highest area under the curve of 0.9848. Using MRMD to reduce the dimension improves this metric for J48 and naïve Bayes, but has little effect on the random forest results.Availability and Implementation: The webserver is available at: http://server.malab.cn/MixedPPR/index.jsp

    Small RNA Sorting in Drosophila Produces Chemically Distinct Functional RNA-Protein Complexes: A Dissertation

    Get PDF
    Small interfering RNAs (siRNAs), microRNAs (miRNAs), and piRNAs (piRNA) are conserved classes of small single-stranded ~21-30 nucleotide (nt) RNA guides that repress eukaryotic gene expression using distinct RNA Induced Silencing Complexes (RISCs). At its core, RISC is composed of a single-stranded small RNA guide bound to a member of the Argonaute protein family, which together bind and repress complementary target RNA. miRNAs target protein coding mRNAs—a function essential for normal development and broadly involved in pathways of human disease; small interfering RNAs (siRNA) defend against viruses, but can also be engineered to direct experimental or therapeutic gene silencing; piwi associated RNAs (piRNAs) protect germline genomes from expansion of parasitic nucleic acids such as transposons. Using the fruit fly, Drosophila melanogaster, as a model organism we seek to understand how small silencing RNAs are made and how they function. In Drosophila, miRNAs and siRNAs are proposed to have parallel, but separate biogenesis and effector machinery. miRNA duplexes are excised from imperfectly paired hairpin precursors by Dicer1 and loaded into Ago1; siRNA duplexes are hewn from perfectly paired long dsRNA by Dicer2 and loaded into Ago2. Contrary to this model we found one miRNA, miR-277, is made by Dicer1, but partitions between Ago1 and Ago2 RISCs. These two RISCs are functionally distinct—Ago2 could silence a perfectly paired target, but not a centrally bulged target; Ago1 could silence a bulged target, but not a perfect target. This was surprising since both Ago1 and Ago2 have endonucleolytic cleavage activity necessary for perfect target cleavage in vitro. Our detailed kinetic studies suggested why—Ago2 is a robust multiple turnover enzyme, but Ago1 is not. Along with a complementary in vitro study our data supports a duplex sorting mechanism in which Diced duplexes are released, and rebind to Ago1 or Ago2 loading machinery, regardless of which Dicer produced them. This allows structural information embedded in small RNA duplexes to direct small RNA loading into Ago1 and/or Ago2, resulting in distinct regulatory outputs. Small RNA sorting also has chemical consequences for the small RNA guide. Although siRNAs were presumed to have the signature 2′, 3′ hydroxyl ends left by Dicer, we found that small RNAs loaded into Ago2 or Piwi proteins, but not Ago1, are modified at their 3´ ends by the RNA 2´-O-methyltransferase DmHen1. In plants Hen1 modifies the 3´ ends all small RNAs duplexs, protecting and stabilizing them. Implying a similar function in flies, piRNAs are smaller, less abundant, and their function is perturbed in hen1 mutants. But unlike plants, small RNAs are modified as single-strands in RISC rather than as duplexes. This nicely explains why the dsRNA binding domain in plant Hen1 was discarded in animals, and why both dsRNA derived siRNAs and ssRNA derived piRNAs are modified. The recent discovery that both piRNAs and siRNAs target transposons links terminal modification and transposon silencing, suggesting that it is specialized for this purpose

    Human genome interaction: models for designing DNA sequences

    Get PDF
    Since the turn of the century, the scope and scale of Synthetic Biology projects have grown dramatically. Instead of limiting themselves to simple genetic circuits, researchers aim for genome-scale organism redesigns, revolutionary gene therapies, and high throughput, industrial scale natural product syntheses. However, the engineering principles adopted by the founders of the field have been applied to Biology in a way that does not fit many modern experiments. This has limited the usefulness of common sequence design paradigms. As experiments have become more complex, the sequence design process has taken up more and more intellectual bandwidth, partially because software tools for DNA design have remained largely unchanged. This thesis will explore software engineering, social science, and machine learning projects aiming to improve the ways in which researchers design novel DNA sequences for Synthetic Biology experiments. Popular DNA design tools will be reviewed, alongside an analysis of the key conceptual metaphors that underlie their workflows. Flaws in the ubiquitous parts-based design model will be demonstrated, and several alternatives will be explored. A tool called Part Crafter (partcrafter.com) will be presented, which aggregates sequence and annotation data from a variety of data sources to allow for rational search over genomic features, as well as the automated production of biological parts for Synthetic Biology experiments. However, Part Crafter’s mode of part creation is more flexible than traditional implementations of parts-based design in the field. Parts are abstracted away from specific manufacturing standards, and as much contextual information as possible is presented alongside parts of interest. Additionally, various types of machine learning models will be presented which predict histone modification occupancy in novel sequences. Current Synthetic Biology design paradigms largely ignore the epigenetic context of designed sequences. A gradient of increasingly complex models will be analysed in order to characterise the complexity of the combinatorial patterns of sequences of these epigenetic proteins. This work was exploratory, serving as a proof of concept for using a variety of increasingly complex models to represent genomic elements, and demonstrating that the parts-based design model is not the only option available to us. The aims of the field of Synthetic Biology become more ambitious every year. In order for the goals of the field to be accomplished, we must be able to better understand the sequences we are designing. The projects presented in this thesis were all completed with the aim of assisting Synthetic Biologists in designing sequences deliberately. By taking into account as much contextual information as possible, including epigenetic factors, researchers will be able to design sequences more quickly and reliably, increasing their chances of achieving the moon shot goals of the field

    Dissecting Small RNA Loading Pathway in \u3cem\u3eDrosophila melanogaster\u3c/em\u3e: A Dissertation

    Get PDF
    In the preceding chapters, I have discussed my doctoral research on studying the siRNA loading pathway in Drosophila using both biochemical and genetic approaches. We established a gel shift system to identify the intermediate complexes formed during siRNA loading. We detected at least three complexes, named complex B, RISC loading complex (RLC) and RISC. Using kinetic modeling, we determined that the siRNA enters complex B and RLC early during assembly when it remains double-stranded, and then matures in RISC to generate Argonaute bearing only the single-stranded guide. We further characterized the three complexes. We showed that complex B comprises Dcr-1 and Loqs, while both RLC and RISC contain Dcr-2 and R2D2. Our study suggests that the Dcr-2/R2D2 heterodimer plays a central role in RISC assembly. We observed that Dcr-1/Loqs, which function together to process pre-miRNA into mature miRNA, were also involved in siRNA loading. This was surprising, because it has been proposed that the RNAi pathway and miRNA pathway are separate and parallel, with each using a unique set of proteins to produce small RNAs, to assemble functional RNA-guided enzyme complexes, and to regulate target mRNAs. We further examined the molecular function of Dcr-1/Loqs in RNAi pathway. Our data suggest that, in vivo and in vitro, the Dcr-1/Loqs complex binds to siRNA. In vitro, the binding of the Dcr-1/Loqs complex to siRNA is the earliest detectable step in siRNA-triggered Ago2-RISC assembly. Futhermore, the binding of Dcr-1/Loqs to siRNA appears to facilitate dsRNA dicing by Dcr-2/R2D2, because the dicing activity is much lower in loqslysate than in wild type. Long inverted repeat (IR) triggered white silencing in fly eyes is an example of endogenous RNAi. Consistent with our finding that Dcr-1/Loqs function to load siRNA, less white siRNA accumulates in loqs mutant eyes compared to wild type. As a result, loqs mutants are partially defective in IR trigged whitesilencing. Our data suggest considerable functional and genetic overlap between the miRNA and siRNA pathways, with the two sharing key components previously thought to be confined to just one of the two pathways. Based on our study on siRNA loading pathway, we also elucidated the molecular function of Armitage (Armi) protein in RNAi. We showed that armi is required for RNAi. Lysates from armi mutant ovaries are defective for RNAi in vitro. Native gel analysis of protein-siRNA complexes suggests that armi mutants support early steps in the RNAi pathway, i.e., the formation of complex B and RLC, but are defective in the production of the RISC

    Mining Functional Elements in Messenger RNAs: Overview, Challenges, and Perspectives

    Get PDF
    Eukaryotic messenger RNA (mRNA) contains not only protein-coding regions but also a plethora of functional cis-elements that influence or coordinate a number of regulatory aspects of gene expression, such as mRNA stability, splicing forms, and translation rates. Understanding the rules that apply to each of these element types (e.g., whether the element is defined by primary or higher-order structure) allows for the discovery of novel mechanisms of gene expression as well as the design of transcripts with controlled expression. Bioinformatics plays a major role in creating databases and finding non-evident patterns governing each type of eukaryotic functional element. Much of what we currently know about mRNA regulatory elements in eukaryotes is derived from microorganism and animal systems, with the particularities of plant systems lagging behind. In this review, we provide a general introduction to the most well-known eukaryotic mRNA regulatory motifs (splicing regulatory elements, internal ribosome entry sites, iron-responsive elements, AU-rich elements, zipcodes, and polyadenylation signals) and describe available bioinformatics resources (databases and analysis tools) to analyze eukaryotic transcripts in search of functional elements, focusing on recent trends in bioinformatics methods and tool development. We also discuss future directions in the development of better computational tools based upon current knowledge of these functional elements. Improved computational tools would advance our understanding of the processes underlying gene regulations. We encourage plant bioinformaticians to turn their attention to this subject to help identify novel mechanisms of gene expression regulation using RNA motifs that have potentially evolved or diverged in plant species

    Detecting and comparing non-coding RNAs in the high-throughput era.

    Get PDF
    In recent years there has been a growing interest in the field of non-coding RNA. This surge is a direct consequence of the discovery of a huge number of new non-coding genes and of the finding that many of these transcripts are involved in key cellular functions. In this context, accurately detecting and comparing RNA sequences has become important. Aligning nucleotide sequences is a key requisite when searching for homologous genes. Accurate alignments reveal evolutionary relationships, conserved regions and more generally any biologically relevant pattern. Comparing RNA molecules is, however, a challenging task. The nucleotide alphabet is simpler and therefore less informative than that of amino-acids. Moreover for many non-coding RNAs, evolution is likely to be mostly constrained at the structural level and not at the sequence level. This results in very poor sequence conservation impeding comparison of these molecules. These difficulties define a context where new methods are urgently needed in order to exploit experimental results to their full potential. This review focuses on the comparative genomics of non-coding RNAs in the context of new sequencing technologies and especially dealing with two extremely important and timely research aspects: the development of new methods to align RNAs and the analysis of high-throughput data

    Chromatin and Epigenetics

    Get PDF
    Genomics has gathered broad public attention since Lamarck put forward his top-down hypothesis of 'motivated change' in 1809 in his famous book "Philosophie Zoologique" and even more so since Darwin published his famous bottom-up theory of natural selection in "The Origin of Species" in 1859. The public awareness culminated in the much anticipated race to decipher the sequence of the human genome in 2002. Over all those years, it has become apparent that genomic DNA is compacted into chromatin with a dedicated 3D higher-order organization and dynamics, and that on each structural level epigenetic modifications exist. The book "Chromatin and Epigenetics" addresses current issues in the fields of epigenetics and chromatin ranging from more theoretical overviews in the first four chapters to much more detailed methodologies and insights into diagnostics and treatments in the following chapters. The chapters illustrate in their depth and breadth that genetic information is stored on all structural and dynamical levels within the nucleus with corresponding modifications of functional relevance. Thus, only an integrative systems approach allows to understand, treat, and manipulate the holistic interplay of genotype and phenotype creating functional genomes. The book chapters therefore contribute to this general perspective, not only opening opportunities for a true universal view on genetic information but also being key for a general understanding of genomes, their function, as well as life and evolution in general
    corecore