Search CORE

1,078 research outputs found

RNA secondary structure prediction from multi-aligned sequences

It has been well accepted that the RNA secondary structures of most functional non-coding RNAs (ncRNAs) are closely related to their functions and are conserved during evolution. Hence, prediction of conserved secondary structures from evolutionarily related sequences is one important task in RNA bioinformatics; the methods are useful not only to further functional analyses of ncRNAs but also to improve the accuracy of secondary structure predictions and to find novel functional RNAs from the genome. In this review, I focus on common secondary structure prediction from a given aligned RNA sequence, in which one secondary structure whose length is equal to that of the input alignment is predicted. I systematically review and classify existing tools and algorithms for the problem, by utilizing the information employed in the tools and by adopting a unified viewpoint based on maximum expected gain (MEG) estimators. I believe that this classification will allow a deeper understanding of each tool and provide users with useful information for selecting tools for common secondary structure predictions.Comment: A preprint of an invited review manuscript that will be published in a chapter of the book `Methods in Molecular Biology'. Note that this version of the manuscript may differ from the published versio

arXiv.org e-Print Archive

CiteSeerX

Crossref

Improved Measurements of RNA Structure Conservation with Generalized Centroid Estimators

Author: Okada Yohei
Saito Yutaka
Sakakibara Yasubumi
Sato Kengo
Publication venue: Frontiers Research Foundation
Publication date: 01/01/2011
Field of study

Identification of non-protein-coding RNAs (ncRNAs) in genomes is a crucial task for not only molecular cell biology but also bioinformatics. Secondary structures of ncRNAs are employed as a key feature of ncRNA analysis since biological functions of ncRNAs are deeply related to their secondary structures. Although the minimum free energy (MFE) structure of an RNA sequence is regarded as the most stable structure, MFE alone could not be an appropriate measure for identifying ncRNAs since the free energy is heavily biased by the nucleotide composition. Therefore, instead of MFE itself, several alternative measures for identifying ncRNAs have been proposed such as the structure conservation index (SCI) and the base pair distance (BPD), both of which employ MFE structures. However, these measurements are unfortunately not suitable for identifying ncRNAs in some cases including the genome-wide search and incur high false discovery rate. In this study, we propose improved measurements based on SCI and BPD, applying generalized centroid estimators to incorporate the robustness against low quality multiple alignments. Our experiments show that our proposed methods achieve higher accuracy than the original SCI and BPD for not only human-curated structural alignments but also low quality alignments produced by CLUSTAL W. Furthermore, the centroid-based SCI on CLUSTAL W alignments is more accurate than or comparable with that of the original SCI on structural alignments generated with RAF, a high quality structural aligner, for which twofold expensive computational time is required on average. We conclude that our methods are more suitable for genome-wide alignments which are of low quality from the point of view on secondary structures than the original SCI and BPD

Crossref

Directory of Open Access Journals

PubMed Central

Frontiers - Publisher Connector

Bayesian Centroid Estimation for Motif Discovery

Author: A Dempster
A Neuwald
B Webb-Robertson
C Lawrence
C Lawrence
C Murrea
D GuhaThakurta
E Xing
F Roth
G Pavesi
G Sandve
G Stormo
G Thijs
J Besag
J Gower
J Hu
J Liu
K MacIsaac
L Carvalho
L Newberg
Luis Carvalho
M Barbieri
M Régnier
M Tompa
MA Lones
Matteo G. A. Paris
S Geman
T Bailey
W Thompson
Y Ding
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 06/04/2012
Field of study

Biological sequences may contain patterns that are signal important biomolecular functions; a classical example is regulation of gene expression by transcription factors that bind to specific patterns in genomic promoter regions. In motif discovery we are given a set of sequences that share a common motif and aim to identify not only the motif composition, but also the binding sites in each sequence of the set. We present a Bayesian model that is an extended version of the model adopted by the Gibbs motif sampler, and propose a new centroid estimator that arises from a refined and meaningful loss function for binding site inference. We discuss the main advantages of centroid estimation for motif discovery, including computational convenience, and how its principled derivation offers further insights about the posterior distribution of binding site configurations. We also illustrate, using simulated and real datasets, that the centroid estimator can differ from the maximum a posteriori estimator.Comment: 24 pages, 9 figure

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

FigShare

RNAG: a new Gibbs sampler for predicting RNA secondary structure for unaligned sequences

Author: Bernhart
Bindewald
Carvalho
Cary
Charles E. Lawrence
Chenna
Ding
Ding
Do
Do
Do
Donglai Wei
Eddy
Gardner
Geman
Giegerich
Griffiths-Jones
Gutell
Hamada
Hamada
Hofacker
Hofacker
Ji
Kiryu
Kiryu
Knudsen
Lauren V. Alpert
Lindgreen
Liu
Mathews
Mathews
Meyer
Nawrocki
Nawrocki
Newberg
Sakakibara
Sankoff
Seemann
Siebert
Steffen
Tabaska
Torarinsson
Webb
Webb-Robertson
Will
Xing
Yao
Zuker
Publication venue: Oxford University Press
Publication date: 01/01/2011
Field of study

Motivation: RNA secondary structure plays an important role in the function of many RNAs, and structural features are often key to their interaction with other cellular components. Thus, there has been considerable interest in the prediction of secondary structures for RNA families. In this article, we present a new global structural alignment algorithm, RNAG, to predict consensus secondary structures for unaligned sequences. It uses a blocked Gibbs sampling algorithm, which has a theoretical advantage in convergence time. This algorithm iteratively samples from the conditional probability distributions P(Structure | Alignment) and P(Alignment | Structure). Not surprisingly, there is considerable uncertainly in the high-dimensional space of this difficult problem, which has so far received limited attention in this field. We show how the samples drawn from this algorithm can be used to more fully characterize the posterior space and to assess the uncertainty of predictions

CiteSeerX

Crossref

PubMed Central

Prediction of RNA secondary structure by maximizing pseudo-expected accuracy

Author: B Knudsen
C Do
D Mathews
H Kiryu
I Hofacker
I Holmes
IL Hofacker
JS McCaskill
K Sato
Kengo Sato
Kiyoshi Asai
L Carvalho
L Kall
M Andronescu
M Andronescu
M Hamada
M Hamada
M Hamada
M Hamada
M Parisien
M Zuker
M Zuker
MC Frith
Michiaki Hamada
N Michal
P Baldi
PP Gardner
R Durbin
RK Bradley
RK Bradley
S Bernhart
S Engelen
S Griffiths-Jones
S Gross
S Seemann
SJ Schroeder
Y Ding
Y Ding
Y Ding
ZJ Lu
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Recent studies have revealed the importance of considering the entire distribution of possible secondary structures in RNA secondary structure predictions; therefore, a new type of estimator is proposed including the maximum expected accuracy (MEA) estimator. The MEA-based estimators have been designed to maximize the expected accuracy of the base-pairs and have achieved the highest level of accuracy. Those methods, however, do not give the single best prediction of the structure, but employ parameters to control the trade-off between the sensitivity and the positive predictive value (PPV). It is unclear what parameter value we should use, and even the well-trained default parameter value does not, in general, give the best result in popular accuracy measures to each RNA sequence. Results Instead of using the expected values of the popular accuracy measures for RNA secondary structure prediction, which is difficult to be calculated, the <it>pseudo</it>-expected accuracy, which can easily be computed from base-pairing probabilities, is introduced. It is shown that the pseudo-expected accuracy is a good approximation in terms of sensitivity, PPV, MCC, or F-score. The pseudo-expected accuracy can be approximately maximized for each RNA sequence by stochastic sampling. It is also shown that well-balanced secondary structures between sensitivity and PPV can be predicted with a small computational overhead by combining the pseudo-expected accuracy of MCC or F-score with the γ-centroid estimator. Conclusions This study gives not only a method for predicting the secondary structure that balances between sensitivity and PPV, but also a general method for approximately maximizing the (pseudo-)expected accuracy with respect to various evaluation measures including MCC and F-score.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Optimal feature selection for sparse linear discriminant analysis and its applications in gene expression data

Author: Anderson
Baiqi Miao
Barry
Bickel
Cai
Candes
Cheng Wang
Donoho
Dudoit
Dudoit
Fan
Fan
Fan
Goeman
Golub
Hess
Lai
Li
Longbing Cao
Mai
Shao
Srivastava
Tibshirani
Tong
Wu
Yeung
Zuber
Publication venue: 'Elsevier BV'
Publication date: 01/01/2013
Field of study

This work studies the theoretical rules of feature selection in linear discriminant analysis (LDA), and a new feature selection method is proposed for sparse linear discriminant analysis. An

l_1

minimization method is used to select the important features from which the LDA will be constructed. The asymptotic results of this proposed two-stage LDA (TLDA) are studied, demonstrating that TLDA is an optimal classification rule whose convergence rate is the best compared to existing methods. The experiments on simulated and real datasets are consistent with the theoretical results and show that TLDA performs favorably in comparison with current methods. Overall, TLDA uses a lower minimum number of features or genes than other approaches to achieve a better result with a reduced misclassification rate.Comment: 20 pages, 3 figures, 5 tables, accepted by Computational Statistics and Data Analysi

arXiv.org e-Print Archive

Crossref

OPUS - University of Technology Sydney

CentroidFold: a web server for RNA secondary structure prediction

Author: Ding
DING
Do
Dowell
Hofacker
K. Asai
K. Sato
Knudsen
M. Hamada
McCaskill
T. Mituyama
Zuker
Publication venue: Oxford University Press
Publication date
Field of study

The CentroidFold web server (http://www.ncrna.org/centroidfold/) is a web application for RNA secondary structure prediction powered by one of the most accurate prediction engine. The server accepts two kinds of sequence data: a single RNA sequence and a multiple alignment of RNA sequences. It responses with a prediction result shown as a popular base-pair notation and a graph representation. PDF version of the graph representation is also available. For a multiple alignment sequence, the server predicts a common secondary structure. Usage of the server is quite simple. You can paste a single RNA sequence (FASTA or plain sequence text) or a multiple alignment (CLUSTAL-W format) into the textarea then click on the ‘execute CentroidFold’ button. The server quickly responses with a prediction result. The major advantage of this server is that it employs our original CentroidFold software as its prediction engine which scores the best accuracy in our benchmark results. Our web server is freely available with no login requirement

Crossref

PubMed Central