14 research outputs found

    Towards structural models for the Ebola UTR regions using experimental SHAPE probing data

    Get PDF
    National audienceNext-Generation Sequencing (NGS) technologies have opened new perspectives to refine the process of predicting the secondary structure(s) of structured non-coding RNAs. Herein, we describe an integrated modeling strategy, based on the SHAPE chemistry, to infer structural insight from deep se-quencing data. Our approach is based on a pseudo-energy minimization, incorporating additional information from evolutionary data (compensatory mutations) and SHAPE experiments (reactivity scores) within an iterative procedure. Preliminary results reveal conserved and stable structures within UTRs of the Ebola Genome, that are both thermodynamically-stable and highly supported by SHAPE accessibility analysis

    Analyse différentielle de données de sondage pour la prédiction des structures d'acides ribonucléiques

    No full text
    In structural bioinformatics, predicting the secondary structure(s) of ribonucleic acids (RNAs) represents a major direction of research to understand cellular mechanisms. A classic approach for structure postulates that, at the thermodynamic equilibrium, RNA adopts its various conformations according to a Boltzmann distribution based on its free energy. Modern approaches, therefore, favor the consideration of the dominant conformations. Such approaches are limited in accuracy due to the imprecision of the energy model and the structure topology restrictions.Experimental data can be used to circumvent the shortcomings of predictive computational methods. RNA probing encompasses a wide array of experimental protocols dedicated to revealing partial structural information through exposure to a chemical or enzymatic reagent, whose effect depends on, and thus reveals, features of its adopted structure(s). Accordingly, single-reagent probing data is used to supplement free-energy models within computational methods, leading to significant gains in prediction accuracy. In practice, however, structural biologists integrate probing data produced in various experimental conditions, using different reagents or over a collection of mutated sequences, to model RNA structure(s). This integrative approach remains manual, time-consuming and arguably subjective in its modeling principles. In this Ph.D., we contributed in silico methods for an automated modeling of RNA structure(s) from multiple sources of probing data.We have first established automated pipelines for the acquisition of reactivity profiles from primary data produced through a variety of protocols (SHAPE, DMS using Capillary Electrophoresis, SHAPE-Map/Ion Torrent). We have designed and implemented a new, versatile, method that simultaneously integrates multiple probing profiles. Based on a combination of Boltzmann sampling and structural clustering, it produces alternative stable conformations jointly supported by a set of probing experiments. As it favors recurrent structures, our method allows exploiting the complementarity of several probing assays. The quality of predictions produced using our method compared favorably against state-of-the-art computational predictive methods on single-probing assays.Our method was used to identify models for structured regions in RNA viruses. In collaboration with experimental partners, we suggested a refined structure of the HIV-1 Gag IRES, showing a good compatibility with chemical and enzymatic probing data. The predicted structure allowed us to build hypotheses on binding sites that are functionally relevant to the translation. We also proposed conserved structures in Ebola Untranslated regions, showing a high consistency with both SHAPE probing and evolutionary data. Our modeling allows us to detect conserved and stable stem-loop at the 5’end of each UTR, a typical structure found in viral genomes to protect the RNA from being degraded by nucleases.Our method was extended to the analysis of sequence variants. We analyzed a collection of DMS probed mutants, produced by the Mutate-and-Map protocol, leading to better structural models for the GIR1 lariat-capping ribozyme than from the sole wild-type sequence. To avoid systematic production of point-wise mutants, and exploit the recent SHAPEMap protocol, we designed an experimental protocol based on undirected mutagenesis and sequencing, where several mutated RNAs are produced and simultaneously probed. Produced reads must then be re-assigned to mutants to establish their reactivity profiles used later for structure modeling. The assignment problem was modeled as a likelihood maximization joint inference of mutational profiles and assignments, and solved using an instance of the "Expectation-Maximization" algorithm. Preliminary results on a reduced/simulated sample of reads showed a remarkable decrease of the reads assignment errors compared to a classic algorithm.En bioinformatique structurale, la prédiction de la (des) structure(s) secondaire(s) des acides ribonucléiques (ARNs) constitue une direction de recherche majeure pour comprendre les mécanismes cellulaires. Une approche classique pour la prédiction de la structure postule qu'à l'équilibre thermodynamique, l'ARN adopte plusieurs conformations, caractérisées par leur énergie libre, dans l’ensemble de Boltzmann. Les approches modernes privilégient donc une considération des conformations dominantes. Ces approches voient leur précision limitées par l'imprécision des modèles d'énergie et les restrictions topologiques pesant sur les espaces de conformations.Les données expérimentales peuvent être utilisées pour pallier aux lacunes des méthodes de prédiction. Différents protocoles permettent ainsi la révélation d'informations structurales partielles via une exposition à un réactif chimique/enzymatique, dont l'effet dépend, et est donc révélateur, de la (les) structure(s) adoptée(s). Les données de sondage mono-réactif sont utilisées pour valider et complémenter les modèles d’énergie libre, permettant ainsi d’améliorer la précision des prédictions. En pratique, cependant, les praticiens basent leur modélisation sur des données de sondage produites dans diverses conditions expérimentales, utilisant différents réactifs ou associées à une collection de séquences mutées. Une telle approche intégrative est répandue mais reste manuelle, onéreuse et subjective. Au cours de cette thèse, nous avons développé des méthodes in silico pour une modélisation automatisée de la structure à partir de plusieurs sources de données de sondage.En premier lieu, nous avons établi des pipelines d’analyse automatisés pour l'acquisition de profils de réactivité à partir de données brutes produites à travers une série de protocoles. Nous avons ensuite conçu et implémenté une nouvelle méthode qui permet l'intégration simultanée de plusieurs profils de sondage. Basée sur une combinaison d'échantillonnage de l'ensemble de Boltzmann et de clustering structurel, notre méthode produit des conformations dominantes, stables et compatible avec les données de sondage. En favorisant les structures récurrentes, notre méthode permet d’exploiter la complémentarité entre plusieurs données de sondage. Ses performances dans le cas mono-sondage sont comparables ou meilleures que celles des méthodes prédictives de pointe.Cette méthode a permis de proposer des modèles pour les régions structurées des virus. En collaboration avec des expérimentalistes, nous avons suggéré une structure raffinée de l'IRES du VIH-1 Gag, compatible avec les données de sondage chimiques et enzymatiques, qui nous a permis d’identifier des sites d'interactions putatifs avec le ribosome. Nous avons également modélisé la structure des régions non traduites d'Ebola. Cohérents avec les données de sondage SHAPE et les données de covariation, nos modèles montrent l’existence d'une tige-boucle conservée et stable à l'extrémité 5', une structure typiquement présente dans les génomes viraux pour protéger l'ARN de la dégradation par les nucléases.L’extension de notre méthode pour l’analyse simultanée de variants, appliquée dans un premier temps sur des mutants produits par le protocole Mutate-and-Map et sondés par le DMS, a permis d'enregistrer une amélioration en précision de prédiction. Pour éviter la production systématique de mutants ponctuels et exploiter le protocole récent SHAPEMap, nous avons conçu un protocole expérimental basé sur une mutagenèse non dirigé et le séquençage, où plusieurs ARN mutés sont produits et simultanément sondés. Nous avons traité l’affectation des reads aux mutants de références à l'aide d'une instance de l'algorithme "Expectation-Maximization" dont les résultats préliminaires, sur un échantillon de reads réduit/simulé, ont montré un faible taux d’erreurs d'assignation par rapport à une affectation classique des reads aux séquences d'ARN de référence

    An integrative approach for predicting the RNA secondary structure for the HIV–1 Gag UTR using probing data

    Get PDF
    National audienceStructure modeling is key to understand the mechanisms of RNA retroviruses such as HIV. Many in silico prediction approaches suggesting structural models of moderate to good accuracies are available. However, the prediction methods could be further improved by taking advantage of both next generation sequencing technologies and different experimental techniques such as enzymatic and SHAPE probing data [1]. In a published article [2], we introduce and use a structural modeling method based on the integration of many experimental probing data to direct predictions with the aim to find the most accurate structure lying in the intersection of disjoint sources of experiments

    Clustering strings with mutations using an expectation-maximization algorithm In the context of RNA structure prediction

    No full text
    International audienceIn comparative analysis, an RNA structure (a set of base pairs and unpaired nucleotides) is predicted from a set of RNA variants (similar sequences) under the assumption of the conservation of the structure during evolution. The combination of RNA variants with Experimental data informing about the local (nucleotide) structure may lead to more accurate structure prediction. The experimental protocol consists of mutating nucleotides likely to be 'unpaired'. A simultaneous reading of RNA variants sequences that underwent the experimental mutation protocol lead to the following issue: How to cluster 'mutated' substrings of similar parent strings such that each substring is correctly assigned to its parent string? We developed an Expectation Maximization algorithm that uses Mutational profiles (mutation distributions) to assign the substrings to their strings of origin

    Efficient machine learning model to predict dynamic viscosity in phosphoric acid production

    No full text
    The rheological behavior of the phosphoric acid slurry, during the production process, strongly depends on its dynamic viscosity. Controlling this property limits P2O5 losses, minimizes energy consumption and ensures optimal flow conditions. Thus, reliable simulation tools predicting the viscosity property are needed for analysis and process optimization. To this end, three machine learning (ML) methods: single-layer artificial neural network (ANN), gradient boosting (GB) and random forest (RF), were tested using 456 data of dynamic viscosity at different solid content, shear rate and temperature, obtained from industry. The performance of these models was evaluated and compared using diverse precision metrics. The GB has shown to be the outperforming model with determination coefficient greater than 99%, and Root Mean Squared Error lower than 0.750, on both training and validation datasets. Based on the importance of the explanatory variables, all models agree on the large effect of solid content on dynamic viscosity, followed by shear rate, then temperature. The GB relative partial dependence diagram made it possible to deduce operating intervals for the solid content of the pulp to be fed to the phosphoric acid reactor, leading to optimal flow of the suspension at the level of the attack and maturation units

    An integrative approach for predicting the RNA secondary structure for the HIV–1 Gag UTR using probing data

    No full text
    National audienceStructure modeling is key to understand the mechanisms of RNA retroviruses such as HIV. Many in silico prediction approaches suggesting structural models of moderate to good accuracies are available. However, the prediction methods could be further improved by taking advantage of both next generation sequencing technologies and different experimental techniques such as enzymatic and SHAPE probing data [1]. In a published article [2], we introduce and use a structural modeling method based on the integration of many experimental probing data to direct predictions with the aim to find the most accurate structure lying in the intersection of disjoint sources of experiments

    IPANEMAP: Integrative Probing Analysis of Nucleic Acids Empowered by Multiple Accessibility Profiles

    Get PDF
    International audienceThe manual production of reliable RNA structure models from chemical probing experiments benefits from the integration of information derived from multiple protocols and reagents. However, the interpretation of multiple probing profiles remains a complex task, hindering the quality and reproducibility of modeling efforts. We introduce IPANEMAP, the first automated method for the modeling of RNA structure from multiple probing reactivity profiles. Input profiles can result from experiments based on diverse protocols, reagents, or collection of variants, and are jointly analyzed to predict the dominant conformations of an RNA. IPANEMAP combines sampling, clustering, and multi-optimization, to produce secondary structure models that are both stable and well-supported by experimental evidences. The analysis of multiple reactivity profiles, both publicly available and produced in our study, demonstrates the good performances of IPANEMAP, even in a mono probing setting. It confirms the potential of integrating multiple sources of probing data, informing the design of informative probing assays. Availability: IPANEMAP is freely downloadable at https://github.com/afafbioinfo/IPANEMAP Contact: [email protected] Supplementary information available at NAR online

    Computational methods for comparing and integrating multiple probing assays to predict RNA secondary structure

    Get PDF
    International audience1- Introduction:RNA structure is a key to understand retroviruses’s mechanisms e.g. HIV. Many prediction approaches suggesting accurate structures are available but they could be further improved by both taking advantage of Next generation sequencing technology and new experimental techniques (Enzymatic and SHAPE).2 - Experimental probing dataIn this poster, we present an integrative approach based on using many experimental data, resulting from sequencing, to direct predictions with the aim to find an accurate structure lying in the intersection of different sources of experiments. From one side, to reveal single nucleotide, reactivity profiles resulting from a SHAPE technology were used as “soft constraints”, meaning that the reactivity values were translated into pseudo-energies as described (Lorenz et al, 2016). From the other side, RNAses cleavage was used with two enzymes V and T targeting respectively paired and unpaired nucleotides. Reactivity scores resulting from those two experiments are used as hard constraints, forcing positions that exceed a specific threshold to be paired(case of Enzymatic-V) or unpaired(case of Enzymatic-T).3-stochastic sampling:At the thermodynamic equilibrium, a given RNA can have many alternative structures, where each structure could be characterized by a probability within the space of all the possible conformations (Boltzmann ensemble). This probability is related to the energy of the structure, the highest the energy needed to break pairs present in the structure the highest is its probability in the ensemble.We admit that the optimal structure(s) should be energetically stable and supported by several experimental data. For this reason, we coupled a stochastic sampling from the Boltzmann ensembles associated with the experimentally derived constraints, with a clustering across experimental conditions, to generate a structural models that are well-supported by available data.4-The work-flow description: 1. Experimental data from different conditions(SHAPE,Enzymatic-T, Enzymatic-V) were analysed to extract reactivity profiles that will serve as constraints. 2. We sampled 2000 structures per condition: We perform a Boltzmann sampling (Ding et al, 2005) to generate a predefined number of stable structures, compatible with the constraints derived for each condition. We used the stochastic sampling mode of RNAsubopt (-p option) to generate energetically stable structures that are either fully compliant with constraints derived from enzymatic data (hard constraints(Mathews et al,2004)), or constitute reasonable trade-offs between thermodynamic stability and compatibility with SHAPE data (soft constraints, using thepseudo-potentials of Deigan et al. (Deigan et al, 2009), see (Lorenz et al,2016) for details). 3. We merge the structures while keeping labels to retain the origin of each structure. 4. In order to detect structures with affinity to each other, The merged sets of models were clustered, using the base-pair distance as a measure of dissimilarity, the distance between two structures corresponds to the number of base pairs needed to break and to build in order to go from a structure to an other. A clustering algorithms (affinity propagation (Wang et al, 2007) implemented in the scikit-learn Python package (Pedregosa et al, 2011) is used to agglomerate and identify recurrent structures. One of the advantages of affinity propagation resides in its low computational requirements. 5. The next step consists on identifying clusters that are homogeneous, stable and well supported by experimental evidences, leading to the identification of the following objective criteria:-Present conditions that informs as about the diversity of the cluster: Our primary target are clusters compatible with multiple experimental conditions. However, the larger sampled sets required for reproducibility tend to populate each cluster with structures fromall conditions. We thus associated with each cluster the number of represented conditions, defined as the number of conditions for whichthe accumulated Boltzmann probability in the cluster exceeds a predefined threshold.-Boltzmann weight that is a measure of stability: Structures that are found in a given cluster may be unstable, and should be treated asoutliers. For this reason, we computed the cumulated normalized Boltzmann probabilities within the cluster, to favor stable clustersconsisting of stable structures;- Average Cluster Distance to count for coherence: We observed a general tendency of clustering algorithms to create heterogeneousclusters when faced with noisy data. We thus associated with each cluster the mean distance between pairs of structures, estimated asthe average distance to the MEA (Lu et al, 2009) for the sake of efficiency, in order to neglect clusters that were too diverse.6. The next steps consist on choosing cluster(s) with high coherence, diversity and stability. for this purpose we restricted our analysis to clusters that were found on the 3D Pareto Frontier (Mattson Messac, 2005) with respect to the three mentioned above criteria .7. After detecting the optimal Pareto cluster(s), we need to identify representative structure for each cluster. We chose the maximum expectedaccuracy (MEA) structure (Lu et al, 2009) as the representative structure for each cluster, which is defined as the secondary structure whose structural elements have highest accumulated Boltzmann probability within the cluster.5 Results:This resulted in 2 structures which we narrowed down to a single candidate using compatibility with the 1M7 SHAPE data as a final discriminatory criterion

    An EM algorithm for mapping short reads in multiple RNA structure probing experiments

    No full text
    International audienceAn accurate mapping of reads against the sequence of reference is the first step to grant a good NGS data analysis.However, when mapping is about assigning reads to a set of RNA variants, in the case of simultaneous sequencing,the task become hard to handle. Many algorithms have been developed to overcome the issue of mapping readsagainst a set of homologous sequences at one time but the problem is not fully resolved, particularly when dealingwith short reads. The issue addressed in our study is much more challenging; In addition to the parallel assignmentissue in the presence of short reads, RNA variants molecules, used for the library sequencing preparation step,undergo a specific experimental treatment SHAPE causing the formation of mutations at the level of structurallyunpaired nucleotides. Mutations due to SHAPE might lead to a miss-mapping i.e. a read could be derived from agiven RNA variant i and because of SHAPE mutations it becomes more appropriate to assign it to the variant jfrom which the read has the shortest base distance. In an ongoing work, we are trying to resolve the unprecedentedmapping question trough an Expectation Maximization (EM) algorithm where each RNA variant from the setof references would be characterized by a SHAPE mutational profile instead of being merely characterized by asequence of nucleotides. The EM algorithm aims to maximize the likelihood of a read to be derived from a specificRNA variant and to assess its contribution to build the RNA associated mutational profile

    An EM algorithm for mapping short reads in multiple RNA structure probing experiments

    No full text
    International audienceAn accurate mapping of reads against the sequence of reference is the first step to grant a good NGS data analysis.However, when mapping is about assigning reads to a set of RNA variants, in the case of simultaneous sequencing,the task become hard to handle. Many algorithms have been developed to overcome the issue of mapping readsagainst a set of homologous sequences at one time but the problem is not fully resolved, particularly when dealingwith short reads. The issue addressed in our study is much more challenging; In addition to the parallel assignmentissue in the presence of short reads, RNA variants molecules, used for the library sequencing preparation step,undergo a specific experimental treatment SHAPE causing the formation of mutations at the level of structurallyunpaired nucleotides. Mutations due to SHAPE might lead to a miss-mapping i.e. a read could be derived from agiven RNA variant i and because of SHAPE mutations it becomes more appropriate to assign it to the variant jfrom which the read has the shortest base distance. In an ongoing work, we are trying to resolve the unprecedentedmapping question trough an Expectation Maximization (EM) algorithm where each RNA variant from the setof references would be characterized by a SHAPE mutational profile instead of being merely characterized by asequence of nucleotides. The EM algorithm aims to maximize the likelihood of a read to be derived from a specificRNA variant and to assess its contribution to build the RNA associated mutational profile
    corecore