193 research outputs found

    PlantRNA_sniffer : a SVM-based workflow to predict long intergenic non-coding RNAs in plants

    Get PDF
    Non-coding RNAs (ncRNAs) constitute an important set of transcripts produced in the cells of organisms. Among them, there is a large amount of a particular class of long ncRNAs that are difficult to predict, the so-called long intergenic ncRNAs (lincRNAs), which might play essential roles in gene regulation and other cellular processes. Despite the importance of these lincRNAs, there is still a lack of biological knowledge and, currently, the few computational methods considered are so specific that they cannot be successfully applied to other species different from those that they have been originally designed to. Prediction of lncRNAs have been performed with machine learning techniques. Particularly, for lincRNA prediction, supervised learning methods have been explored in recent literature. As far as we know, there are no methods nor workflows specially designed to predict lincRNAs in plants. In this context, this work proposes a workflow to predict lincRNAs on plants, considering a workflow that includes known bioinformatics tools together with machine learning techniques, here a support vector machine (SVM). We discuss two case studies that allowed to identify novel lincRNAs, in sugarcane (Saccharum spp.) and in maize (Zea mays). From the results, we also could identify differentially-expressed lincRNAs in sugarcane and maize plants submitted to pathogenic and beneficial microorganisms

    Classificação de RNAs não-codificadores longos intergênicos usando máquina de vetores de suporte : um estudo de caso para a cana-de-açúcar

    Get PDF
    Monografia (graduação)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2016.Dentre os RNAs, temos os que expressam proteínas, e aqueles que, embora não participando da síntese proteica, realizam funções importantes nas células, sendo denominados de RNAs não-codificadores (ncRNAs). Dentre os ncRNAs, existem os RNAs nãocodificadores longos intergênicos (long intergenic ncRNAs - lincRNAs), que estão localizados em regiões intergênicas, e podem desempenhar importantes papéis na regulação gênica e em doenças. Embora existam vários projetos relacionados a lincRNAs, tanto na parte da biologia molecular quanto computacional, não há métodos amplamente usados para sua predição. Neste contexto, validando características obtidas na literatura, criamos um modelo baseado em máquinas de vetores de suporte (Support Vector Machine - SVM) para predizer lincRNAs. Desenvolvemos dois estudos de caso, um para calcular o desempenho do modelo usando Mus musculus (camundongo) e Homo sapiens (humano) e outro para predizer lincRNAs em Saccharum officinarum (cana-de-açúcar). Os experimentos mostraram que o modelo tem boa acurácia, em camundongos 90%, em humanos 99% e em ambos simultaneamente 91%, que são melhores resultados, quando comparados ao iSeeRNA. Para a cana-de-açúcar, o método predisse 67 lincRNAs, usando um pipeline construído especialmente para predizer lincRNAs, que inclui o modelo SVM treinado com características extraídas de plantas.Among RNAs, some are involved in protein expression, and some other, although not participating in protein synthesis, perform important functions in cells, called non-coding RNAs (ncRNAs). Some functions of ncRNAs are: to catalyze chemical reactions and act in regulation of other RNAs. Generically, we can classify ncRNAs into two classes: small (small ncRNAs), having sizes between 20 and 300 nucleotides and presenting known features; and longs (long ncRNAs - lncRNAs), which have sizes larger than 200 nucleotides and small protein synthesis capacity, today not entirely known. Among the lncRNAs, there are the so called long intergenic non-coding RNAs (lincRNAs), those located in intergenic regions, which play important roles in gene regulation and diseases. Although there are many projects related to lincRNAs, both in molecular biology and in computational systems, there are no methods broadly used to predict lincRNAs. In this context, validating features extracted from literature, we created a model based on Support Vector Machine (SVM) to predict lincRNAs. Two case studies were developed, the first one to verify the performance of the model, using Mus musculus (mouse) and Homo sapiens (human), and the other one to predict lincRNAs in Saccharum officinarum (sugarcane). The experiments showed that the model presented good accuracy, in mouse 90%, humans 99%, and in both simultaneously 91%, which were better when compared to iSeeRNA. For sugarcane, the method predicted 67 lincRNAs, using a specially designed pipeline to predict lincRNAs, including the SVM model trained with features extracted from plants

    Computational analysis of multi-omic data for the elucidation of molecular mechanisms of neuroblastoma

    Get PDF
    Doctor ScientiaeNeuroblastoma is the most common extracranial solid tumor in childhood. The survival rates of patients with neuroblastoma, especially those in the high-risk category, are still low despite varied therapies. The detailed understanding of the molecular mechanisms underlying the pathogenesis of neuroblastoma is essential to develop better therapeutics and improve the poor survival rates. This study provides a multi-omic analysis of neuroblastoma datasets from the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) neuroblastoma project and the Gene Expression Omnibus (GEO) data portals to better understand the molecular mechanisms of neuroblastoma

    Transcriptomic investigation of the Wolbachia symbiosis in larval stages of Brugia malayi

    Get PDF
    The Wolbachia genus of bacteria is comprised of obligate intracellular endosymbionts that are known to infect arthropods and nematodes. Most filarial nematodes of humans host maintain Wolbachia endosymbionts in a mutualistic association that is essential for nematode development, reproduction and the longevity of the adult parasites. As a result, much research has gone into investigating Wolbachia’s role in adult nematodes, both in understanding the basis of the mutualistic relationship, as well as exploiting the endosymbiont as a target for treatment. Less attention has been applied to understanding Wolbachia’s role in the biology of larval stages of filarial parasites. To better characterise Wolbachia’s roles during these larval stages, RNA-sequencing technologies were employed to investigate the relationship between the parasitic filarial nematode Brugia malayi, and its Wolbachia endosymbiont during larval development and microfilarial transmission. This first involved the development of a manually curated, revised annotation of the Wolbachia genome using gene expression data, further corroborated by RT-qPCR and proteomics experiments. Second, the transcriptomes for both nematode and Wolbachia were then investigated across two major nematode developmental stages: the two weeks immediately after nematode infection into the mammalian host spanning the L3 to L4 developmental moult, and following Wolbachia depletion from B. malayi microfilariae during transmission to the mosquito vector. The reannotation of the Wolbachia endosymbiont genome resulted in the identification of 21 new protein coding genes, 5 instances of non-model translational events, and 3 functional RNAs. Several newly identified genes were predicted to be unique to the Wolbachia genus, with a potential role in Wolbachia-nematode interactions. The transcriptome of developing L3 to L4 stages demonstrates Wolbachia’s ability to undergo coordinated control over its carbon metabolism to enable rapid population growth. The consistent upregulation of metabolic pathways, such as haem, nucleotide biosynthesis and Type IV secretion systems, complements the nematode host transcriptome, which was focused predominantly on its own growth and development, as well as regulating Wolbachia population during the L4 stage. B. malayi microfilariae depleted of Wolbachia have a significantly reduced ability to infect the mosquito vector, with transcriptome analysis of treated and untreated nematodes identifying targeted downregulation of chitinase and V-type ATPase transcripts in the treated group. These targeted changes likely have a role in the nematode’s ability to successfully penetrate the vector’s midgut or achieve exsheathment. Taken together, these observations illustrate a complex and dynamic relationship that Wolbachia has with its nematode host, expanding to more than just a mutualist important for adult parasite longevity and reproduction

    Developing a bioinformatics framework for proteogenomics

    Get PDF
    In the last 15 years, since the human genome was first sequenced, genome sequencing and annotation have continued to improve. However, genome annotation has not kept up with the accelerating rate of genome sequencing and as a result there is now a large backlog of genomic data waiting to be interpreted both quickly and accurately. Through advances in proteomics a new field has emerged to help improve genome annotation, termed proteogenomics, which uses peptide mass spectrometry data, enabling the discovery of novel protein coding genes, as well as the refinement and validation of known and putative protein-coding genes. The annotation of genomes relies heavily on ab initio gene prediction programs and/or mapping of a range of RNA transcripts. Although this method provides insights into the gene content of genomes it is unable to distinguish protein-coding genes from putative non-coding RNA genes. This problem is further confounded by the fact that only 5% of the public protein sequence repository at UniProt/SwissProt has been curated and derived from actual protein evidence. This thesis contends that it is critically important to incorporate proteomics data into genome annotation pipelines to provide experimental protein-coding evidence. Although there have been major improvements in proteogenomics over the last decade there are still numerous challenges to overcome. These key challenges include the loss of sensitivity when using inflated search spaces of putative sequences, how best to interpret novel identifications and how best to control for false discoveries. This thesis addresses the existing gap between the use of genomic and proteomic sources for accurate genome annotation by applying a proteogenomics approach with a customised methodology. This new approach was applied within four case studies: a prokaryote bacterium; a monocotyledonous wheat plant; a dicotyledonous grape plant; and human. The key contributions of this thesis are: a new methodology for proteogenomics analysis; 145 suggested gene refinements in Bradyrhizobium diazoefficiens (nitrogen-fixing bacteria); 55 new gene predictions (57 protein isoforms) in Vitis vinifera (grape); 49 new gene predictions (52 protein isoforms) in Homo sapiens (human); and 67 new gene predictions (70 protein isoforms) in Triticum aestivum (bread wheat). Lastly, a number of possible improvements for the studies conducted in this thesis and proteogenomics as a whole have been identified and discussed

    Novel Algorithm Development for ‘NextGeneration’ Sequencing Data Analysis

    Get PDF
    In recent years, the decreasing cost of ‘Next generation’ sequencing has spawned numerous applications for interrogating whole genomes and transcriptomes in research, diagnostic and forensic settings. While the innovations in sequencing have been explosive, the development of scalable and robust bioinformatics software and algorithms for the analysis of new types of data generated by these technologies have struggled to keep up. As a result, large volumes of NGS data available in public repositories are severely underutilised, despite providing a rich resource for data mining applications. Indeed, the bottleneck in genome and transcriptome sequencing experiments has shifted from data generation to bioinformatics analysis and interpretation. This thesis focuses on development of novel bioinformatics software to bridge the gap between data availability and interpretation. The work is split between two core topics – computational prioritisation/identification of disease gene variants and identification of RNA N6 -adenosine Methylation from sequencing data. The first chapter briefly discusses the emergence and establishment of NGS technology as a core tool in biology and its current applications and perspectives. Chapter 2 introduces the problem of variant prioritisation in the context of Mendelian disease, where tens of thousands of potential candidates are generated by a typical sequencing experiment. Novel software developed for candidate gene prioritisation is described that utilises data mining of tissue-specific gene expression profiles (Chapter 3). The second part of chapter investigates an alternative approach to candidate variant prioritisation by leveraging functional and phenotypic descriptions of genes and diseases from multiple biomedical domain ontologies (Chapter 4). Chapter 5 discusses N6 AdenosineMethylation, a recently re-discovered posttranscriptional modification of RNA. The core of the chapter describes novel software developed for transcriptome-wide detection of this epitranscriptomic mark from sequencing data. Chapter 6 presents a case study application of the software, reporting the previously uncharacterised RNA methylome of Kaposi’s Sarcoma Herpes Virus. The chapter further discusses a putative novel N6-methyl-adenosine -RNA binding protein and its possible roles in the progression of viral infection

    Computational methods for the quantification of RNA transcript abundance and messenger RNA regulation

    Get PDF
    Experiments investigating the regulation of RNA transcripts have been revolutionised by technology developed over the last 40 years. The data acquired from these experiments have revealed novel regulatory mechanisms for the localisation, degradation and modification of RNA transcripts. However, the volume and complexity of the data sets have led to an unprecedented reliance on statistical software. Inadequate analysis of data sets is contributing to the ongoing crisis around reproducing conclusions from published research. Rigorous implementation of statistical analysis software can continue to uncover novel regulatory mechanisms, but closed, obscure, and incorrect analyses will propagate the reproducibility crisis to unassailable new heights. The objective of this research project is to develop open-source software and implement reproducible analyses to enable further exploration of regulatory mechanisms acting on RNA transcripts. This thesis focuses on the analysis of transcriptomics data sets, predominately from the model organism Saccharomyces cerevisiae. This first project discusses the standardisation of the analysis of qPCR data. The chapter compares the R package tidyqpcr, developed by the author, to other current software available. This case highlights how quality software supported by comprehensive documentation can improve the quality of an entire experimental assay. The next chapter showcases how the implementation of quality analysis can detect subtle interactions between regulatory motifs. The design of several reporter constructs using insights from published data sets shows how even short regulatory motifs can be affected by their overall context. The final results chapter outlines the development of a statistical software package to rigorously analyse noisy transcriptomic data from RNA-Seq assays exploring RNA localisation. The statistical software package uses a Bayesian hierarchical model of fractionation-based assays to overcome common biases in RNA-Seq data sets. In summary, this thesis presents and implements two examples of research software that improve the reproducibility and quality of conclusions from data acquired from common experimental assays in molecular biology. The thesis also outlines how to implement open-source development practices and create inclusive documentation in an academic setting. Software developed within this framework is then used to elucidate subtle ways that cells regulate their transcriptome

    Dissecting the spatial structure of overlapping transcription in budding yeast

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 95-102).This thesis presents a computational and algorithmic method for the analysis of high-resolution transcription data in the budding yeast Saccharomyces cerevisiae. We begin by describing a computational system for storing and retrieving spatially-mapped genomic data. This system forms the infrastructure for a novel algorithmic approach to detect and recover instances of same-strand overlapping transcripts in high resolution expression experiments. We then apply these algorithms to a set of transcription experiments in budding yeast, Saccharomyces cerevisiae, in order to identify potential sites of same-strand overlapping transcripts that may be involved in novel forms of transcriptional regulation.by Timothy Danford.Ph.D
    • …
    corecore