9,973 research outputs found
Genome-Wide Analysis of RNA Secondary Structure in Eukaryotes
The secondary structure of an RNA molecule plays an integral role in its maturation, regulation, and function. Over the past decades, myriad studies have revealed specific examples of structural elements that direct the expression and function of both protein-coding messenger RNAs (mRNAs) and non-coding RNAs (ncRNAs). In this work, we develop and apply a novel high-throughput, sequencing-based, structure mapping approach to study RNA secondary structure in three eukaryotic organisms.
First, we assess global patterns of secondary structure across protein-coding transcripts and identify a conserved mark of strongly reduced base pairing at transcription start and stop sites, which we hypothesize helps with ribosome recruitment and function. We also find empirical evidence for reduced base pairing within microRNA (miRNA) target sites, lending further support to the notion that even mRNAs have additional selective pressures outside of their protein coding sequence.
Next, we integrate our structure mapping approaches with transcriptome-wide sequencing of ribosomal RNA-depleted (RNA-seq), small (smRNA-seq), and ribosome-bound (ribo-seq) RNA populations to investigate the impact of RNA secondary structure on gene expression regulation in the model organism Arabidopsis thaliana. We find that secondary structure and mRNA abundance are strongly anti-correlated, which is likely due to the propensity for highly structured transcripts to be degraded and/or processed into smRNAs.
Finally, we develop a likelihood model and Bayesian Markov chain Monte Carlo (MCMC) algorithm that utilizes the sequencing data from our structure mapping approaches to generate single-nucleotide resolution predictions of RNA secondary structure. We show that this likelihood framework resolves ambiguities that arise from the sequencing protocol and leads to significantly increased prediction accuracy.
In total, our findings provide on a global scale both validation of existing hypotheses regarding RNA biology as well as new insights into the regulatory and functional consequences of RNA secondary structure. Furthermore, the development of a statistical approach to structure prediction from sequencing data offers the promise of true genome-wide determination of RNA secondary structure
A Factor Graph Approach to Automated GO Annotation
As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum.Fil: Spetale, Flavio Ezequiel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; ArgentinaFil: Krsticevic, Flavia Jorgelina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; ArgentinaFil: Roda, Fernando. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; ArgentinaFil: Bulacio, Pilar Estela. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; Argentin
Machine learning-guided directed evolution for protein engineering
Machine learning (ML)-guided directed evolution is a new paradigm for
biological design that enables optimization of complex functions. ML methods
use data to predict how sequence maps to function without requiring a detailed
model of the underlying physics or biological pathways. To demonstrate
ML-guided directed evolution, we introduce the steps required to build ML
sequence-function models and use them to guide engineering, making
recommendations at each stage. This review covers basic concepts relevant to
using ML for protein engineering as well as the current literature and
applications of this new engineering paradigm. ML methods accelerate directed
evolution by learning from information contained in all measured variants and
using that information to select sequences that are likely to be improved. We
then provide two case studies that demonstrate the ML-guided directed evolution
process. We also look to future opportunities where ML will enable discovery of
new protein functions and uncover the relationship between protein sequence and
function.Comment: Made significant revisions to focus on aspects most relevant to
applying machine learning to speed up directed evolutio
A Knowledge Gradient Policy for Sequencing Experiments to Identify the Structure of RNA Molecules Using a Sparse Additive Belief Model
We present a sparse knowledge gradient (SpKG) algorithm for adaptively
selecting the targeted regions within a large RNA molecule to identify which
regions are most amenable to interactions with other molecules. Experimentally,
such regions can be inferred from fluorescence measurements obtained by binding
a complementary probe with fluorescence markers to the targeted regions. We use
a biophysical model which shows that the fluorescence ratio under the log scale
has a sparse linear relationship with the coefficients describing the
accessibility of each nucleotide, since not all sites are accessible (due to
the folding of the molecule). The SpKG algorithm uniquely combines the Bayesian
ranking and selection problem with the frequentist regularized
regression approach Lasso. We use this algorithm to identify the sparsity
pattern of the linear model as well as sequentially decide the best regions to
test before experimental budget is exhausted. Besides, we also develop two
other new algorithms: batch SpKG algorithm, which generates more suggestions
sequentially to run parallel experiments; and batch SpKG with a procedure which
we call length mutagenesis. It dynamically adds in new alternatives, in the
form of types of probes, are created by inserting, deleting or mutating
nucleotides within existing probes. In simulation, we demonstrate these
algorithms on the Group I intron (a mid-size RNA molecule), showing that they
efficiently learn the correct sparsity pattern, identify the most accessible
region, and outperform several other policies
Large-scale analysis of influenza A virus nucleoprotein sequence conservation reveals potential drug-target sites
The nucleoprotein (NP) of the influenza A virus encapsidates the viral RNA and participates in the infectious life cycle of the virus. The aims of this study were to find the degree of conservation of NP among all virus subtypes and hosts and to identify conserved binding sites, which may be utilised as potential drug target sites. The analysis of conservation based on 4430 amino acid sequences identified high conservation in known functional regions as well as novel highly conserved sites. Highly variable clusters identified on the surface of NP may be associated with adaptation to different hosts and avoidance of the host immune defence. Ligand binding potential overlapping with high conservation was found in the tail-loop binding site and near the putative RNA binding region. The results provide the basis for developing antivirals that may be universally effective and have a reduced potential to induce resistance through mutations.Peer reviewe
- …