87 research outputs found

    A new computational framework for the classification and function prediction of long non-coding RNAs

    Get PDF
    Long non-coding RNAs (lncRNAs) are known to play a significant role in several biological processes. These RNAs possess sequence length greater than 200 base pairs (bp), and so are often misclassified as protein-coding genes. Most Coding Potential Computation (CPC) tools fail to accurately identify, classify and predict the biological functions of lncRNAs in plant genomes, due to previous research being limited to mammalian genomes. In this thesis, an investigation and extraction of various sequence and codon-bias features for identification of lncRNA sequences has been carried out, to develop a new CPC Framework. For identification of essential features, the framework implements regularisation-based selection. A novel classification algorithm is implemented, which removes the dependency on experimental datasets and provides a coordinate-based solution for sub-classification of lncRNAs. For imputing the lncRNA functions, lncRNA-protein interactions have been first determined through co-expression of genes which were re-analysed by a sequence similaritybased approach for identification of novel interactions and prediction of lncRNA functions in the genome. This integrates a D3-based application for visualisation of lncRNA sequences and their associated functions in the genome. Standard evaluation metrics such as accuracy, sensitivity, and specificity have been used for benchmarking the performance of the framework against leading CPC tools. Case study analyses were conducted with plant RNA-seq datasets for evaluating the effectiveness of the framework using a cross-validation approach. The tests show the framework can provide significant improvements on existing CPC models for plant genomes: 20-40% greater accuracy. Function prediction analysis demonstrates results are consistent with the experimentally-published findings

    DEVELOPMENT OF BIOINFORMATICS TOOLS AND ALGORITHMS FOR IDENTIFYING PATHWAY REGULATORS, INFERRING GENE REGULATORY RELATIONSHIPS AND VISUALIZING GENE EXPRESSION DATA

    Get PDF
    In the era of genetics and genomics, the advent of big data is transforming the field of biology into a data-intensive discipline. Novel computational algorithms and software tools are in demand to address the data analysis challenges in this growing field. This dissertation comprises the development of a novel algorithm, web-based data analysis tools, and a data visualization platform. Triple Gene Mutual Interaction (TGMI) algorithm, presented in Chapter 2 is an innovative approach to identify key regulatory transcription factors (TFs) that govern a particular biological pathway or a process through interaction among three genes in a triple gene block, which consists of a pair of pathway genes and a TF. The identification of key TFs controlling a biological pathway or a process allows biologists to understand the complex regulatory mechanisms in living organisms. TF-Miner, presented in Chapter 3, is a high-throughput gene expression data analysis web application that was developed by integrating two highly efficient algorithms; TF-cluster and TF-Finder. TF-Cluster can be used to obtain collaborative TFs that coordinately control a biological pathway or a process using genome-wide expression data. On the other hand, TF-Finder can identify regulatory TFs involved in or associated with a specific biological pathway or a process using Adaptive Sparse Canonical Correlation Analysis (ASCCA). Chapter 4 presents ExactSearch; a suffix tree based motif search algorithm, implemented in a web-based tool. This tool can identify the locations of a set of motif sequences in a set of target promoter sequences. ExactSearch also provides the functionality to search for a set of motif sequences in flanking regions from 50 plant genomes, which we have incorporated into the web tool. Chapter 5 presents STTM JBrowse; a web-based RNA-Seq data visualization system built using the JBrowse open source platform. STTM JBrowse is a unified repository to share/produce visualizations created from large RNA-Seq datasets generated from a variety of model and crop plants in which miRNAs were destroyed using Short Tandem Target Mimic (STTM) Technology

    PANOMICS meets germplasm

    Get PDF
    Genotyping-by-sequencing has enabled approaches for genomic selection to improve yield, stress resistance and nutritional value. More and more resource studies are emerging providing 1000 and more genotypes and millions of SNPs for one species covering a hitherto inaccessible intraspecific genetic variation. The larger the databases are growing, the better statistical approaches for genomic selection will be available. However, there are clear limitations on the statistical but also on the biological part. Intraspecific genetic variation is able to explain a high proportion of the phenotypes, but a large part of phenotypic plasticity also stems from environmentally driven transcriptional, post-transcriptional, ranslational, post-translational, epigenetic and metabolic regulation. Moreover, regulation of the same gene can have different phenotypic outputs in different environments. Consequently, to explain and understand environment-dependent phenotypic plasticity based on the available genotype variation we have to integrate the analysis of further molecular levels reflecting the complete information flow from the gene to metabolism to phenotype. Interestingly, metabolomics platforms are already more cost-effective than NGS platforms and are decisive for the prediction of nutritional value or stress resistance. Here, we propose three fundamental pillars for future breeding strategies in the framework of Green Systems Biology: (i) combining genome selection with environment dependent PANOMICS analysis and deep learning to improve prediction accuracy for marker dependent trait performance; (ii) PANOMICS resolution at subtissue, cellular and subcellular level provides information about fundamental functions of selected markers; (iii) combining PANOMICS with genome editing and speed breeding tools to accelerate and enhance large-scale functional validation of trait-specific precision breeding

    Regularisoitu riippuvuuksien mallintaminen geeniekpressio- ja metabolomiikkadatan välillä metabolian säätelyn tutkimuksessa

    Get PDF
    Fusing different high-throughput data sources is an effective way to reveal functions of unknown genes, as well as regulatory relationships between biological components such as genes and metabolites. Dependencies between biological components functioning in the different layers of biological regulation can be investigated using canonical correlation analysis (CCA). However, the properties of the high-throughput bioinformatics data induce many challenges to data analysis: the sample size is often insufficient compared to the dimensionality of the data, and the data pose multi-collinearity due to, for example, co-expressed and co-regulated genes. Therefore, a regularized version of classical CCA has been adopted. An alternative way of introducing regularization to statistical models is to perform Bayesian data analysis with suitable priors. In this thesis, the performance of a new variant of Bayesian CCA called gsCCA is compared to a classical ridge regression regularized CCA (rrCCA) in revealing relevant information shared between two high-throughput data sets. The gsCCA produces a partly similar regulatory effect as the classical CCA but, in addition, the gsCCA introduces a new type of regularization to the data covariance matrices. Both CCA methods are applied to gene expression and metabolic concentration measurements obtained from an oxidative-stress tolerant Arabidopsis thaliana ecotype Col-0, and an oxidative stress sensitive mutant rcd1 as time series under ozone exposure and in a control condition. The aim of this work is to reveal new regulatory mechanisms in the oxidative stress signalling in plants. For the both methods, rrCCA and gsCCA, the thesis illustrates their potential to reveal both already known and new regulatory mechanisms in Arabidopsis thaliana oxidative stress signalling.Bioinformatiikassa erityyppisten mittausaineistojen yhdistäminen on tehokas tapa selvittää tuntemattomien geenien toiminnallisuutta sekä säätelyvuorovaikutuksia eri biologisten komponenttien, kuten geenien ja metaboliittien, välillä. Riippuvuuksia eri biologisilla säätelytasoilla toimivien komponenttien välillä voidaan tutkia kanonisella korrelaatioanalyysilla (canonical correlation analysis, CCA). Bioinformatiikan tietoaineistot aiheuttavat kuitenkin monia haasteita data-analyysille: näytteiden määrä on usein riittämätön verrattuna aineiston piirteiden määrään, ja aineisto on multikollineaarista johtuen esim. yhdessä säädellyistä ja ilmentyvistä geeneistä. Tästä syystä usein käytetään regularisoitua versiota kanonisesta korrelaatioanalyysistä aineiston tilastolliseen analysointiin. Vaihtoehto regularisoidulle analyysille on bayesilainen lähestymistapa yhdessä sopivien priorioletuksien kanssa. Tässä diplomityössä tutkitaan ja vertaillaan uuden bayesilaisen CCA:n sekä klassisen harjanneregressio-regularisoidun CCA:n kykyä löytää oleellinen jaettu informaatio kahden bioinformatiikka-tietoaineiston välillä. Uuden bayesilaisen menetelmän nimi on ryhmittäin harva kanoninen korrelaatioanalyysi. Ryhmittäin harva CCA tuottaa samanlaisen regularisointivaikutuksen kuin harjanneregressio-CCA, mutta lisäksi uusi menetelmä regularisoi tietoaineistojen kovarianssimatriiseja uudella tavalla. Molempia CCA-menetelmiä sovelletaan geenien ilmentymisaineistoon ja metaboliittien konsentraatioaineistoon, jotka on mitattu Arabidopsis thaliana:n hapetus-stressiä sietävästä ekotyypistä Col-0 ja hapetus-stressille herkästä rcd1 mutantista aika-sarjana, sekä otsoni-altistuksessa että kontrolliolosuhteissa. Diplomityö havainnollistaa harjanneregressio-CCA:n ja ryhmittäin harvan CCA:n kykyä paljastaa jo tunnettuja ja mahdollisesti uusia säätelymekanismeja geenien ja metabolittien välillä kasvisolujen viestinnässä hapettavan stressin aikana

    Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure

    Get PDF
    Understanding the genetic regulatory code governing gene expression is an important challenge in molecular biology. However, how individual coding and non-coding regions of the gene regulatory structure interact and contribute to mRNA expression levels remains unclear. Here we apply deep learning on over 20,000 mRNA datasets to examine the genetic regulatory code controlling mRNA abundance in 7 model organisms ranging from bacteria to Human. In all organisms, we can predict mRNA abundance directly from DNA sequence, with up to 82% of the variation of transcript levels encoded in the gene regulatory structure. By searching for DNA regulatory motifs across the gene regulatory structure, we discover that motif interactions could explain the whole dynamic range of mRNA levels.\ua0Co-evolution across coding and non-coding regions suggests that it is not single motifs or regions, but the entire gene regulatory structure and specific combination of regulatory elements that define gene expression levels

    Systems Analytics and Integration of Big Omics Data

    Get PDF
    A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome

    Analysis of the determinants of Pol II pausing

    Get PDF
    Pausing of transcribing RNA polymerase II (Pol II) has emerged as a general feature of gene expression in human cells. Many transcription factors, DNA sequences and chromatin characteristics have been implicated in inducing transcriptional pausing. However, it is unclear what are the relative contributions of these factors on the observed Pol II pausing. Furthermore, research in metazoans has mainly focused on Pol II promoter-proximal pausing, leaving the causes of pausing outside of this region unknown. To reliably detect real transcriptional pausing sites and advance the understanding of the causes of this phenomenon, we developed a pausing detection algorithm for nucleotide-resolution Pol II occupancy data. We scrutinized the characteristics and potential shortcomings of Native Elongating Transcript sequencing (NET-seq), which is one of the high-resolution methods of Pol II profiling, and we used our observations to improve the NET-seq processing pipeline. Leveraging the improved processing pipeline and the developed pausing detection algorithm revealed widespread genome-wide Pol II pausing at a nucleotide resolution in human cells. Next, we set out to identify the determinants of Pol II pausing in an unbiased manner based on the underlying DNA sequence. To predict the predisposition of a genomic site to evoke Pol II pausing, we applied a range of machine learning approaches using previously identified high-confidence pausing sites. For each of the sites, we created a large number of features, including both factors that were previously linked to transcriptional pausing and factors that were not yet implicated in invoking pausing. Our analysis revealed DNA sequence properties underlying widespread Pol II pausing including a new pausing motif. Interestingly, key sequence determinants of RNA polymerase pausing are shared by human cells and bacteria. Our study indicates that transcriptional pausing in human cells is sequence-induced and that the determinants of Pol II pausing might be evolutionary conserved.Ein allgemeines Merkmal der Genexpression in menschlichen Zellen ist das Pausieren der RNA Polymerase II (Pol II). Verschiedene Aspekte wie Transkriptionsfaktoren, DNA Sequenzen und Eigenschaften des Chromatins werden mit dem Prozess in Verbindung gebracht. Der relative Beitrag dieser Faktoren zur Entstehung der beobachteten Pausen ist unbekannt. Darüber hinaus hat sich die bisherige Forschung bei Metazoen hauptsächlich auf Pol II Pausen während der frühen Elongationsphase, im promoter-proximalen Bereich, konzentriert. Die Ursachen für das Pausieren außerhalb dieser Regionen sind unbekannt. Um das Verständnis der Ursachen von Transkriptionspausen zu verbessern, haben wir einen Algorithmus entwickelt, der Pol II Signale verarbeitet und Pausen präzise bis auf ein einzelnes Nukleotid lokalisiert. Die Pol II Signalmessungen werden mithilfe von NET-seq (Native Elongating Transcript Sequencing), einer hochauflösenden Methode, erstellt. Bei der Untersuchung der Methode identifizierten wir systematische Fehler in den Messdaten, welche zur Anpassung bei der Datenverarbeitung führte. Diese algorithmischen Verbesserungen zeigten, dass Pol II Pausen in menschlichen Zellen weit verbreitet sind und verteilt über das gesamte Genom, an einzelnen Nukleotiden, beobachtet werden können. Für eine unvoreingenommene Identifizierung der Sequenzspezifischen Faktoren, die zum Pausieren der Pol II beitragen, wurden eine Reihe von Methoden des maschinellen Lernens angewandt. Mit hoher Sicherheit detektierte Transkriptionspausen wurden genutzt, um Prädispositionen in DNA-Abschnitten zu lernen und vorherzusagen. Für jedes dieser Beispiel Regionen werden beschreibende Merkmale erstellt. Darunter befinden sich Faktoren, die zuvor mit Transkriptionspausen in Verbindung gebracht wurden, sowie Merkmale ohne bekannte Assoziation. Unsere Analyse identifiziert ein neues DNA Sequenzmotiv und andere relevante Sequenzeigenschaften, welche dem pausieren der Pol II zugrunde liegen. Interessanterweise sind die identifizierten Sequenzeigenschaften sowohl in menschlichen Zellen als auch in Bakterien zu finden. Unsere Studie deutet darauf hin, dass Transkriptionspausen in menschlichen Zellen sequenzabhängig und evolutionär konserviert sind

    Outils statistiques pour la sélection de variables\ud et l'intégration de données "omiques"

    Get PDF
    Les récentes avancées biotechnologiques permettent maintenant de mesurer une\ud énorme quantité de données biologiques de différentes sources (données génomiques,\ud protémiques, métabolomiques, phénotypiques), souvent caractérisées par un petit nombre\ud d'échantillons ou d'observations.\ud L'objectif de ce travail est de développer ou d'adapter des méthodes statistiques\ud adéquates permettant d'analyser ces jeux de données de grande dimension, en proposant\ud aux biologistes des outils efficaces pour sélectionner les variables les plus pertinentes.\ud Dans un premier temps, nous nous intéressons spécifiquement aux données de\ud transcriptome et à la sélection de gènes discriminants dans un cadre de classification\ud supervisée. Puis, dans un autre contexte, nous cherchons a sélectionner des variables de\ud types différents lors de la réconciliation (ou l'intégration) de deux tableaux de données\ud omiques.\ud Dans la première partie de ce travail, nous proposons une approche de type\ud wrapper en agrégeant des méthodes de classification (CART, SVM) pour sélectionner\ud des gènes discriminants une ou plusieurs conditions biologiques. Dans la deuxième\ud partie, nous développons une approche PLS avec pénalisation l1 dite de type sparse\ud car conduisant à un ensemble "creux" de paramètres, permettant de sélectionner des\ud sous-ensembles de variables conjointement mesurées sur les mêmes échantillons biologiques.\ud Un cadre de régression, ou d'analyse canonique est propose pour répondre\ud spécifiquement a la question biologique.\ud Nous évaluons chacune des approches proposées en les comparant sur de nombreux\ud jeux de données réels a des méthodes similaires proposées dans la littérature.\ud Les critères statistiques usuels que nous appliquons sont souvent limitée par le petit\ud nombre d'échantillons. Par conséquent, nous nous efforcons de toujours combiner nos\ud évaluations statistiques avec une interprétation biologique détaillee des résultats.\ud Les approches que nous proposons sont facilement applicables et donnent des\ud résultats très satisfaisants qui répondent aux attentes des biologistes.------------------------------------------------------------------------------------Recent advances in biotechnology allow the monitoring of large quantities of\ud biological data of various types, such as genomics, proteomics, metabolomics, phenotypes...,\ud that are often characterized by a small number of samples or observations.\ud The aim of this thesis was to develop, or adapt, appropriate statistical methodologies\ud to analyse highly dimensional data, and to present ecient tools to biologists\ud for selecting the most biologically relevant variables. In the rst part, we focus on\ud microarray data in a classication framework, and on the selection of discriminative\ud genes. In the second part, in the context of data integration, we focus on the selection\ud of dierent types of variables with two-block omics data.\ud Firstly, we propose a wrapper method, which agregates two classiers (CART\ud or SVM) to select discriminative genes for binary or multiclass biological conditions.\ud Secondly, we develop a PLS variant called sparse PLS that adapts l1 penalization and\ud allows for the selection of a subset of variables, which are measured from the same\ud biological samples. Either a regression or canonical analysis frameworks are proposed\ud to answer biological questions correctly.\ud We assess each of the proposed approaches by comparing them to similar methods\ud known in the literature on numerous real data sets. The statistical criteria that\ud we use are often limited by the small number of samples. We always try, therefore, to\ud combine statistical assessments with a thorough biological interpretation of the results.\ud The approaches that we propose are easy to apply and give relevant results that\ud answer the biologists needs

    Dynamical Modeling Techniques for Biological Time Series Data

    Get PDF
    The present thesis is articulated over two main topics which have in common the modeling of the dynamical properties of complex biological systems from large-scale time-series data. On one hand, this thesis analyzes the inverse problem of reconstructing Gene Regulatory Networks (GRN) from gene expression data. This first topic seeks to reverse-engineer the transcriptional regulatory mechanisms involved in few biological systems of interest, vital to understand the specificities of their different responses. In the light of recent mathematical developments, a novel, flexible and interpretable modeling strategy is proposed to reconstruct the dynamical dependencies between genes from short-time series data. In addition, experimental trade-offs and optimal modeling strategies are investigated for given data availability. Consistent literature on these topics was previously surprisingly lacking. The proposed methodology is applied to the study of circadian rhythms, which consists in complex GRN driving most of daily biological activity across many species. On the other hand, this manuscript covers the characterization of dynamically differentiable brain states in Zebrafish in the context of epilepsy and epileptogenesis. Zebrafish larvae represent a valuable animal model for the study of epilepsy due to both their genetic and dynamical resemblance with humans. The fundamental premise of this research is the early apparition of subtle functional changes preceding the clinical symptoms of seizures. More generally, this idea, based on bifurcation theory, can be described by a progressive loss of resilience of the brain and ultimately, its transition from a healthy state to another characterizing the disease. First, the morphological signatures of seizures generated by distinct pathological mechanisms are investigated. For this purpose, a range of mathematical biomarkers that characterizes relevant dynamical aspects of the neurophysiological signals are considered. Such mathematical markers are later used to address the subtle manifestations of early epileptogenic activity. Finally, the feasibility of a probabilistic prediction model that indicates the susceptibility of seizure emergence over time is investigated. The existence of alternative stable system states and their sudden and dramatic changes have notably been observed in a wide range of complex systems such as in ecosystems, climate or financial markets
    • …
    corecore