331 research outputs found
Better synchronous binarization for machine translation
Binarization of Synchronous Context Free Grammars (SCFG) is essential for achieving polynomial time complexity of decoding for SCFG parsing based machine translation sys-tems. In this paper, we first investigate the excess edge competition issue caused by a left-heavy binary SCFG derived with the method of Zhang et al. (2006). Then we propose a new binarization method to mitigate the problem by exploring other alternative equivalent bi-nary SCFGs. We present an algorithm that ite-ratively improves the resulting binary SCFG, and empirically show that our method can im-prove a string-to-tree statistical machine trans-lations system based on the synchronous bina-rization method in Zhang et al. (2006) on the NIST machine translation evaluation tasks.
Synchronous Context-Free Grammars and Optimal Linear Parsing Strategies
Synchronous Context-Free Grammars (SCFGs), also known as syntax-directed
translation schemata, are unlike context-free grammars in that they do not have
a binary normal form. In general, parsing with SCFGs takes space and time
polynomial in the length of the input strings, but with the degree of the
polynomial depending on the permutations of the SCFG rules. We consider linear
parsing strategies, which add one nonterminal at a time. We show that for a
given input permutation, the problems of finding the linear parsing strategy
with the minimum space and time complexity are both NP-hard
Algebraic decoder specification: coupling formal-language theory and statistical machine translation: Algebraic decoder specification: coupling formal-language theory and statistical machine translation
The specification of a decoder, i.e., a program that translates sentences from one natural language into another, is an intricate process, driven by the application and lacking a canonical methodology. The practical nature of decoder development inhibits the transfer of knowledge between theory and application, which is unfortunate because many contemporary decoders are in fact related to formal-language theory. This thesis proposes an algebraic framework where a decoder is specified by an expression built from a fixed set of operations. As yet, this framework accommodates contemporary syntax-based decoders, it spans two levels of abstraction, and, primarily, it encourages mutual stimulation between the theory of weighted tree automata and the application
An Evaluation of Methods for Inferring Boolean Networks from Time-Series Data
Regulatory networks play a central role in cellular behavior and decision making. Learning these regulatory networks is a
major task in biology, and devising computational methods and mathematical models for this task is a major endeavor in
bioinformatics. Boolean networks have been used extensively for modeling regulatory networks. In this model, the state of
each gene can be either ‘on’ or ‘off’ and that next-state of a gene is updated, synchronously or asynchronously, according to
a Boolean rule that is applied to the current-state of the entire system. Inferring a Boolean network from a set of
experimental data entails two main steps: first, the experimental time-series data are discretized into Boolean trajectories,
and then, a Boolean network is learned from these Boolean trajectories. In this paper, we consider three methods for data
discretization, including a new one we propose, and three methods for learning Boolean networks, and study the
performance of all possible nine combinations on four regulatory systems of varying dynamics complexities. We find that
employing the right combination of methods for data discretization and network learning results in Boolean networks that
capture the dynamics well and provide predictive power. Our findings are in contrast to a recent survey that placed Boolean
networks on the low end of the ‘‘faithfulness to biological reality’’ and ‘‘ability to model dynamics’’ spectra. Further, contrary
to the common argument in favor of Boolean networks, we find that a relatively large number of time points in the timeseries
data is required to learn good Boolean networks for certain data sets. Last but not least, while methods have been
proposed for inferring Boolean networks, as discussed above, missing still are publicly available implementations thereof.
Here, we make our implementation of the methods available publicly in open source at http://bioinfo.cs.rice.edu/
Pushdown automata in statistical machine translation
This article describes the use of pushdown automata (PDA) in the context of statistical machine translation and alignment under a synchronous context-free grammar. We use PDAs to compactly represent the space of candidate translations generated by the grammar when applied to an input sentence. General-purpose PDA algorithms for replacement, composition, shortest path, and expansion are presented. We describe HiPDT, a hierarchical phrase-based decoder using the PDA representation and these algorithms. We contrast the complexity of this decoder with a decoder based on a finite state automata representation, showing that PDAs provide a more suitable framework to achieve exact decoding for larger synchronous context-free grammars and smaller language models. We assess this experimentally on a large-scale Chinese-to-English alignment and translation task. In translation, we propose a two-pass decoding strategy involving a weaker language model in the first-pass to address the results of PDA complexity analysis. We study in depth the experimental conditions and tradeoffs in which HiPDT can achieve state-of-the-art performance for large-scale SMT. </jats:p
- …