23 research outputs found
Parallel on-line parsing in constant time per word
An on-line parser processes each word as soon as it is typed by the user, without waiting for the end of the sentence. Thus, in an interactive system, a sentence will be parsed almost immediately after the last word has been presented.\ud
\ud
The complexity of an on-line parser is determined by the resources needed for the analysis of a single word, as it is assumed that previous words have been processed already. Sequential parsing algorithms like CYK or Earley need O(n2) time for the nth word. A parallel implementation in O(n) time on O(n) processors is straightforward. In this paper a novel parallel on-line parser is presented that needs O(1) time on O(n2) processors
Precise n-gram Probabilities from Stochastic Context-free Grammars
We present an algorithm for computing n-gram probabilities from stochastic
context-free grammars, a procedure that can alleviate some of the standard
problems associated with n-grams (estimation from sparse data, lack of
linguistic structure, among others). The method operates via the computation of
substring expectations, which in turn is accomplished by solving systems of
linear equations derived from the grammar. We discuss efficient implementation
of the algorithm and report our practical experience with it.Comment: 12 pages, to appear in ACL-9
BitPar
Statistical parse
A simple, practical and complete O-time Algorithm for RNA folding using the Four-Russians Speedup
<p>Abstract</p> <p>Background</p> <p>The problem of computationally predicting the secondary structure (or folding) of RNA molecules was first introduced more than thirty years ago and yet continues to be an area of active research and development. The basic <it>RNA-folding problem </it>of finding a maximum cardinality, non-crossing, matching of complimentary nucleotides in an RNA sequence of length <it>n</it>, has an <it>O</it>(<it>n</it><sup>3</sup>)-time dynamic programming solution that is widely applied. It is known that an <it>o</it>(<it>n</it><sup>3</sup>) worst-case time solution is possible, but the published and suggested methods are complex and have not been established to be practical. Significant practical improvements to the original dynamic programming method have been introduced, but they retain the <it>O</it>(<it>n</it><sup>3</sup>) worst-case time bound when <it>n </it>is the only problem-parameter used in the bound. Surprisingly, the most widely-used, general technique to achieve a worst-case (and often practical) speed up of dynamic programming, the <it>Four-Russians </it>technique, has not been previously applied to the RNA-folding problem. This is perhaps due to technical issues in adapting the technique to RNA-folding.</p> <p>Results</p> <p>In this paper, we give a simple, complete, and practical Four-Russians algorithm for the basic RNA-folding problem, achieving a worst-case time-bound of <it>O</it>(<it>n</it><sup>3</sup>/log(<it>n</it>)).</p> <p>Conclusions</p> <p>We show that this time-bound can also be obtained for richer nucleotide matching scoring-schemes, and that the method achieves consistent speed-ups in practice. The contribution is both theoretical and practical, since the basic RNA-folding problem is often solved multiple times in the inner-loop of more complex algorithms, and for long RNA molecules in the study of RNA virus genomes.</p
If the Current Clique Algorithms are Optimal, so is Valiant's Parser
The CFG recognition problem is: given a context-free grammar
and a string of length , decide if can be obtained from
. This is the most basic parsing question and is a core computer
science problem. Valiant's parser from 1975 solves the problem in
time, where is the matrix multiplication
exponent. Dozens of parsing algorithms have been proposed over the years, yet
Valiant's upper bound remains unbeaten. The best combinatorial algorithms have
mildly subcubic complexity.
Lee (JACM'01) provided evidence that fast matrix multiplication is needed for
CFG parsing, and that very efficient and practical algorithms might be hard or
even impossible to obtain. Lee showed that any algorithm for a more general
parsing problem with running time can
be converted into a surprising subcubic algorithm for Boolean Matrix
Multiplication. Unfortunately, Lee's hardness result required that the grammar
size be . Nothing was known for the more relevant
case of constant size grammars.
In this work, we prove that any improvement on Valiant's algorithm, even for
constant size grammars, either in terms of runtime or by avoiding the
inefficiencies of fast matrix multiplication, would imply a breakthrough
algorithm for the -Clique problem: given a graph on nodes, decide if
there are that form a clique.
Besides classifying the complexity of a fundamental problem, our reduction
has led us to similar lower bounds for more modern and well-studied cubic time
problems for which faster algorithms are highly desirable in practice: RNA
Folding, a central problem in computational biology, and Dyck Language Edit
Distance, answering an open question of Saha (FOCS'14)
A Logic Grammar Foundation for Document Representation and Document Layout
We present a powerful grammar-based paradigm for electronic document markup: coordinated definite clause translation grammars. This markup is of a declarative character, being, in effect, a collection of constraints on the logical and physical structure of documents. To the best of our knowledge, coordinated grammars and their parsers can accommodate all of the descriptive and layout processing functionality enjoyed by extant electronic markup languages. We describe an operational prototype that demonstrates the feasibility of a syntax-directed basis for formalizing and realizing document layout
An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities
We describe an extension of Earley's parser for stochastic context-free
grammars that computes the following quantities given a stochastic context-free
grammar and an input string: a) probabilities of successive prefixes being
generated by the grammar; b) probabilities of substrings being generated by the
nonterminals, including the entire string being generated by the grammar; c)
most likely (Viterbi) parse of the string; d) posterior expected number of
applications of each grammar production, as required for reestimating rule
probabilities. (a) and (b) are computed incrementally in a single left-to-right
pass over the input. Our algorithm compares favorably to standard bottom-up
parsing methods for SCFGs in that it works efficiently on sparse grammars by
making use of Earley's top-down control structure. It can process any
context-free rule format without conversion to some normal form, and combines
computations for (a) through (d) in a single algorithm. Finally, the algorithm
has simple extensions for processing partially bracketed inputs, and for
finding partial parses and their likelihoods on ungrammatical inputs.Comment: 45 pages. Slightly shortened version to appear in Computational
Linguistics 2