Search CORE

16,869 research outputs found

Computational identification and analysis of noncoding RNAs - Unearthing the buried treasures in the genome

Author: Vaidyanathan P. P.
Yoon Byung-Jun
Publication venue
Publication date: 01/01/2007
Field of study

The central dogma of molecular biology states that the genetic information flows from DNA to RNA to protein. This dogma has exerted a substantial influence on our understanding of the genetic activities in the cells. Under this influence, the prevailing assumption until the recent past was that genes are basically repositories for protein coding information, and proteins are responsible for most of the important biological functions in all cells. In the meanwhile, the importance of RNAs has remained rather obscure, and RNA was mainly viewed as a passive intermediary that bridges the gap between DNA and protein. Except for classic examples such as tRNAs (transfer RNAs) and rRNAs (ribosomal RNAs), functional noncoding RNAs were considered to be rare. However, this view has experienced a dramatic change during the last decade, as systematic screening of various genomes identified myriads of noncoding RNAs (ncRNAs), which are RNA molecules that function without being translated into proteins [11], [40]. It has been realized that many ncRNAs play important roles in various biological processes. As RNAs can interact with other RNAs and DNAs in a sequence-specific manner, they are especially useful in tasks that require highly specific nucleotide recognition [11]. Good examples are the miRNAs (microRNAs) that regulate gene expression by targeting mRNAs (messenger RNAs) [4], [20], and the siRNAs (small interfering RNAs) that take part in the RNAi (RNA interference) pathways for gene silencing [29], [30]. Recent developments show that ncRNAs are extensively involved in many gene regulatory mechanisms [14], [17]. The roles of ncRNAs known to this day are truly diverse. These include transcription and translation control, chromosome replication, RNA processing and modification, and protein degradation and translocation [40], just to name a few. These days, it is even claimed that ncRNAs dominate the genomic output of the higher organisms such as mammals, and it is being suggested that the greater portion of their genome (which does not encode proteins) is dedicated to the control and regulation of cell development [27]. As more and more evidence piles up, greater attention is paid to ncRNAs, which have been neglected for a long time. Researchers began to realize that the vast majority of the genome that was regarded as “junk,” mainly because it was not well understood, may indeed hold the key for the best kept secrets in life, such as the mechanism of alternative splicing, the control of epigenetic variations and so forth [27]. The complete range and extent of the role of ncRNAs are not so obvious at this point, but it is certain that a comprehensive understanding of cellular processes is not possible without understanding the functions of ncRNAs [47]

Caltech Authors

Empirical Risk Minimization with Approximations of Probabilistic Grammars

Author: Cohen S. B.
Smith N. A.
Publication venue
Publication date: 01/01/2010
Field of study

Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting.

CiteSeerX

Edinburgh Research Explorer

Data-Oriented Language Processing. An Overview

Author: Bod Rens
Scha Remko
Publication venue
Publication date: 01/01/1996
Field of study

During the last few years, a new approach to language processing has started to emerge, which has become known under various labels such as "data-oriented parsing", "corpus-based interpretation", and "tree-bank grammar" (cf. van den Berg et al. 1994; Bod 1992-96; Bod et al. 1996a/b; Bonnema 1996; Charniak 1996a/b; Goodman 1996; Kaplan 1996; Rajman 1995a/b; Scha 1990-92; Sekine & Grishman 1995; Sima'an et al. 1994; Sima'an 1995-96; Tugwell 1995). This approach, which we will call "data-oriented processing" or "DOP", embodies the assumption that human language perception and production works with representations of concrete past language experiences, rather than with abstract linguistic rules. The models that instantiate this approach therefore maintain large corpora of linguistic representations of previously occurring utterances. When processing a new input utterance, analyses of this utterance are constructed by combining fragments from the corpus; the occurrence-frequencies of the fragments are used to estimate which analysis is the most probable one. In this paper we give an in-depth discussion of a data-oriented processing model which employs a corpus of labelled phrase-structure trees. Then we review some other models that instantiate the DOP approach. Many of these models also employ labelled phrase-structure trees, but use different criteria for extracting fragments from the corpus or employ different disambiguation strategies (Bod 1996b; Charniak 1996a/b; Goodman 1996; Rajman 1995a/b; Sekine & Grishman 1995; Sima'an 1995-96); other models use richer formalisms for their corpus annotations (van den Berg et al. 1994; Bod et al., 1996a/b; Bonnema 1996; Kaplan 1996; Tugwell 1995).Comment: 34 pages, Postscrip

arXiv.org e-Print Archive

CiteSeerX

Using parametric set constraints for locating errors in CLP programs

Author: Drabent W.
Maluszynski J.
Pietrzak P.
Publication venue
Publication date: 01/01/2002
Field of study

This paper introduces a framework of parametric descriptive directional types for constraint logic programming (CLP). It proposes a method for locating type errors in CLP programs and presents a prototype debugging tool. The main technique used is checking correctness of programs w.r.t. type specifications. The approach is based on a generalization of known methods for proving correctness of logic programs to the case of parametric specifications. Set-constraint techniques are used for formulating and checking verification conditions for (parametric) polymorphic type specifications. The specifications are expressed in a parametric extension of the formalism of term grammars. The soundness of the method is proved and the prototype debugging tool supporting the proposed approach is illustrated on examples. The paper is a substantial extension of the previous work by the same authors concerning monomorphic directional types.Comment: 64 pages, To appear in Theory and Practice of Logic Programmin

arXiv.org e-Print Archive

CiteSeerX

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Luleå University of Technology Publications

Inducing Probabilistic Grammars by Bayesian Model Merging

Author: A. P. Dempster
A. Stolcke
C. M. Cook
D. Angluin
D. Ron
F. Jelinek
J. E. Hopcroft
J. G. Wolff
J. J. Horning
J. K. Baker
J. Oncina
J. R. Quinlan
L. E. Baum
L. R. Rabiner
P. F. Brown
S. F. Gull
S. M. Omohundro
T. C. Bell
T. L. Booth
Y. Sakakibara
Publication venue
Publication date: 01/01/1994
Field of study

We describe a framework for inducing probabilistic grammars from corpora of positive samples. First, samples are {\em incorporated} by adding ad-hoc rules to a working grammar; subsequently, elements of the model (such as states or nonterminals) are {\em merged} to achieve generalization and a more compact representation. The choice of what to merge and when to stop is governed by the Bayesian posterior probability of the grammar given the data, which formalizes a trade-off between a close fit to the data and a default preference for simpler models (`Occam's Razor'). The general scheme is illustrated using three types of probabilistic grammars: Hidden Markov models, class-based

n

-grams, and stochastic context-free grammars.Comment: To appear in Grammatical Inference and Applications, Second International Colloquium on Grammatical Inference; Springer Verlag, 1994. 13 page

arXiv.org e-Print Archive

CiteSeerX

Crossref