4,481 research outputs found
RNA Accessibility in cubic time
<p>Abstract</p> <p>Background</p> <p>The accessibility of RNA binding motifs controls the efficacy of many biological processes. Examples are the binding of miRNA, siRNA or bacterial sRNA to their respective targets. Similarly, the accessibility of the Shine-Dalgarno sequence is essential for translation to start in prokaryotes. Furthermore, many classes of RNA binding proteins require the binding site to be single-stranded.</p> <p>Results</p> <p>We introduce a way to compute the accessibility of all intervals within an RNA sequence in <inline-formula><graphic file="1748-7188-6-3-i1.gif"/></inline-formula>(<it>n</it><sup>3</sup>) time. This improves on previous implementations where only intervals of one defined length were computed in the same time. While the algorithm is in the same efficiency class as sampling approaches, the results, especially if the probabilities get small, are much more exact.</p> <p>Conclusions</p> <p>Our algorithm significantly speeds up methods for the prediction of RNA-RNA interactions and other applications that require the accessibility of RNA molecules. The algorithm is already available in the program RNAplfold of the ViennaRNA package.</p
Structurally constrained protein evolution: results from a lattice simulation
We simulate the evolution of a protein-like sequence subject to point
mutations, imposing conservation of the ground state, thermodynamic stability
and fast folding. Our model is aimed at describing neutral evolution of natural
proteins. We use a cubic lattice model of the protein structure and test the
neutrality conditions by extensive Monte Carlo simulations. We observe that
sequence space is traversed by neutral networks, i.e. sets of sequences with
the same fold connected by point mutations. Typical pairs of sequences on a
neutral network are nearly as different as randomly chosen sequences. The
fraction of neutral neighbors has strong sequence to sequence variations, which
influence the rate of neutral evolution. In this paper we study the
thermodynamic stability of different protein sequences. We relate the high
variability of the fraction of neutral mutations to the complex energy
landscape within a neutral network, arguing that valleys in this landscape are
associated to high values of the neutral mutation rate. We find that when a
point mutation produces a sequence with a new ground state, this is likely to
have a low stability. Thus we tentatively conjecture that neutral networks of
different structures are typically well separated in sequence space. This
results indicates that changing significantly a protein structure through a
biologically acceptable chain of point mutations is a rare, although possible,
event.Comment: added reference, to appear on European Physical Journal
Trajectory-based differential expression analysis for single-cell sequencing data
Trajectory inference has radically enhanced single-cell RNA-seq research by enabling the study of dynamic changes in gene expression. Downstream of trajectory inference, it is vital to discover genes that are (i) associated with the lineages in the trajectory, or (ii) differentially expressed between lineages, to illuminate the underlying biological processes. Current data analysis procedures, however, either fail to exploit the continuous resolution provided by trajectory inference, or fail to pinpoint the exact types of differential expression. We introduce tradeSeq, a powerful generalized additive model framework based on the negative binomial distribution that allows flexible inference of both within-lineage and between-lineage differential expression. By incorporating observation-level weights, the model additionally allows to account for zero inflation. We evaluate the method on simulated datasets and on real datasets from droplet-based and full-length protocols, and show that it yields biological insights through a clear interpretation of the data. Downstream of trajectory inference for cell lineages based on scRNA-seq data, differential expression analysis yields insight into biological processes. Here, Van den Berge et al. develop tradeSeq, a framework for the inference of within and between-lineage differential expression, based on negative binomial generalized additive models
LinearCoFold and LinearCoPartition: Linear-Time Algorithms for Secondary Structure Prediction of Interacting RNA molecules
Many ncRNAs function through RNA-RNA interactions. Fast and reliable RNA
structure prediction with consideration of RNA-RNA interaction is useful. Some
existing tools are less accurate due to omitting the competing of
intermolecular and intramolecular base pairs, or focus more on predicting the
binding region rather than predicting the complete secondary structure of two
interacting strands. Vienna RNAcofold, which reduces the problem into the
classical single sequence folding by concatenating two strands, scales in cubic
time against the combined sequence length, and is slow for long sequences. To
address these issues, we present LinearCoFold, which predicts the complete
minimum free energy structure of two strands in linear runtime, and
LinearCoPartition, which calculates the cofolding partition function and base
pairing probabilities in linear runtime. LinearCoFold and LinearCoPartition
follows the concatenation strategy of RNAcofold, but are orders of magnitude
faster than RNAcofold. For example, on a sequence pair with combined length of
26,190 nt, LinearCoFold is 86.8x faster than RNAcofold MFE mode (0.6 minutes
vs. 52.1 minutes), and LinearCoPartition is 642.3x faster than RNAcofold
partition function mode (1.8 minutes vs. 1156.2 minutes). Different from the
local algorithms, LinearCoFold and LinearCoPartition are global cofolding
algorithms without restriction on base pair length. Surprisingly, LinearCoFold
and LinearCoPartition's predictions have higher PPV and sensitivity of
intermolecular base pairs. Furthermore, we apply LinearCoFold to predict the
RNA-RNA interaction between SARS-CoV-2 gRNA and human U4 snRNA, which has been
experimentally studied, and observe that LinearCoFold's prediction correlates
better to the wet lab results
Recommended from our members
scAI: an unsupervised approach for the integrative analysis of parallel single-cell transcriptomic and epigenomic profiles.
Simultaneous measurements of transcriptomic and epigenomic profiles in the same individual cells provide an unprecedented opportunity to understand cell fates. However, effective approaches for the integrative analysis of such data are lacking. Here, we present a single-cell aggregation and integration (scAI) method to deconvolute cellular heterogeneity from parallel transcriptomic and epigenomic profiles. Through iterative learning, scAI aggregates sparse epigenomic signals in similar cells learned in an unsupervised manner, allowing coherent fusion with transcriptomic measurements. Simulation studies and applications to three real datasets demonstrate its capability of dissecting cellular heterogeneity within both transcriptomic and epigenomic layers and understanding transcriptional regulatory mechanisms
Flexible RNA design under structure and sequence constraints using formal languages
The problem of RNA secondary structure design (also called inverse folding)
is the following: given a target secondary structure, one aims to create a
sequence that folds into, or is compatible with, a given structure. In several
practical applications in biology, additional constraints must be taken into
account, such as the presence/absence of regulatory motifs, either at a
specific location or anywhere in the sequence. In this study, we investigate
the design of RNA sequences from their targeted secondary structure, given
these additional sequence constraints. To this purpose, we develop a general
framework based on concepts of language theory, namely context-free grammars
and finite automata. We efficiently combine a comprehensive set of constraints
into a unifying context-free grammar of moderate size. From there, we use
generic generic algorithms to perform a (weighted) random generation, or an
exhaustive enumeration, of candidate sequences. The resulting method, whose
complexity scales linearly with the length of the RNA, was implemented as a
standalone program. The resulting software was embedded into a publicly
available dedicated web server. The applicability demonstrated of the method on
a concrete case study dedicated to Exon Splicing Enhancers, in which our
approach was successfully used in the design of \emph{in vitro} experiments.Comment: ACM BCB 2013 - ACM Conference on Bioinformatics, Computational
Biology and Biomedical Informatics (2013
Improving the value of public RNA-seq expression data by phenotype prediction.
Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible
- …