31 research outputs found
Clique-Based Lower Bounds for Parsing Tree-Adjoining Grammars
up to lower order factors
If the Current Clique Algorithms are Optimal, so is Valiant's Parser
The CFG recognition problem is: given a context-free grammar
and a string of length , decide if can be obtained from
. This is the most basic parsing question and is a core computer
science problem. Valiant's parser from 1975 solves the problem in
time, where is the matrix multiplication
exponent. Dozens of parsing algorithms have been proposed over the years, yet
Valiant's upper bound remains unbeaten. The best combinatorial algorithms have
mildly subcubic complexity.
Lee (JACM'01) provided evidence that fast matrix multiplication is needed for
CFG parsing, and that very efficient and practical algorithms might be hard or
even impossible to obtain. Lee showed that any algorithm for a more general
parsing problem with running time can
be converted into a surprising subcubic algorithm for Boolean Matrix
Multiplication. Unfortunately, Lee's hardness result required that the grammar
size be . Nothing was known for the more relevant
case of constant size grammars.
In this work, we prove that any improvement on Valiant's algorithm, even for
constant size grammars, either in terms of runtime or by avoiding the
inefficiencies of fast matrix multiplication, would imply a breakthrough
algorithm for the -Clique problem: given a graph on nodes, decide if
there are that form a clique.
Besides classifying the complexity of a fundamental problem, our reduction
has led us to similar lower bounds for more modern and well-studied cubic time
problems for which faster algorithms are highly desirable in practice: RNA
Folding, a central problem in computational biology, and Dyck Language Edit
Distance, answering an open question of Saha (FOCS'14)
Parsing Linear Context-Free Rewriting Systems with Fast Matrix Multiplication
We describe a matrix multiplication recognition algorithm for a subset of
binary linear context-free rewriting systems (LCFRS) with running time
where is the running time for matrix multiplication and is the "contact rank" of the LCFRS --
the maximal number of combination and non-combination points that appear in the
grammar rules. We also show that this algorithm can be used as a subroutine to
get a recognition algorithm for general binary LCFRS with running time
. The currently best known is smaller than
. Our result provides another proof for the best known result for parsing
mildly context sensitive formalisms such as combinatory categorial grammars,
head grammars, linear indexed grammars, and tree adjoining grammars, which can
be parsed in time . It also shows that inversion transduction
grammars can be parsed in time . In addition, binary LCFRS
subsumes many other formalisms and types of grammars, for some of which we also
improve the asymptotic complexity of parsing
28th Annual Symposium on Combinatorial Pattern Matching : CPM 2017, July 4-6, 2017, Warsaw, Poland
Peer reviewe
Integrated supertagging and parsing
EuroMatrixPlus project funded by the European Commission, 7th Framework ProgrammeParsing is the task of assigning syntactic or semantic structure to a natural language
sentence. This thesis focuses on syntactic parsing with Combinatory Categorial Grammar
(CCG; Steedman 2000). CCG allows incremental processing, which is essential
for speech recognition and some machine translation models, and it can build semantic
structure in tandem with syntactic parsing. Supertagging solves a subset of the parsing
task by assigning lexical types to words in a sentence using a sequence model. It has
emerged as a way to improve the efficiency of full CCG parsing (Clark and Curran,
2007) by reducing the parser’s search space. This has been very successful and it is the
central theme of this thesis.
We begin by an analysis of how efficiency is being traded for accuracy in supertagging.
Pruning the search space by supertagging is inherently approximate and to contrast
this we include A* in our analysis, a classic exact search technique. Interestingly,
we find that combining the two methods improves efficiency but we also demonstrate
that excessive pruning by a supertagger significantly lowers the upper bound on accuracy
of a CCG parser.
Inspired by this analysis, we design a single integrated model with both supertagging
and parsing features, rather than separating them into distinct models chained
together in a pipeline. To overcome the resulting complexity, we experiment with both
loopy belief propagation and dual decomposition approaches to inference, the first empirical
comparison of these algorithms that we are aware of on a structured natural
language processing problem.
Finally, we address training the integrated model. We adopt the idea of optimising
directly for a task-specific metric such as is common in other areas like statistical
machine translation. We demonstrate how a novel dynamic programming algorithm
enables us to optimise for F-measure, our task-specific evaluation metric, and experiment
with approximations, which prove to be excellent substitutions.
Each of the presented methods improves over the state-of-the-art in CCG parsing.
Moreover, the improvements are additive, achieving a labelled/unlabelled dependency
F-measure on CCGbank of 89.3%/94.0% with gold part-of-speech tags, and
87.2%/92.8% with automatic part-of-speech tags, the best reported results for this task
to date. Our techniques are general and we expect them to apply to other parsing problems,
including lexicalised tree adjoining grammar and context-free grammar parsing
Non-size increasing Graph Rewriting for Natural Language Processing
International audienceA very large amount of work in Natural Language Processing use tree structure as the first class citizen mathematical structures to represent linguistic structures such as parsed sentences or feature structures. However, some linguistic phenomena do not cope properly with trees: for instance, in the sentence "Max decides to leave", "Max" is the subject of the both predicates "to decide" and "to leave". Tree-based linguistic formalisms generally use some encoding to manage sentences like the previous example. In former papers, we discussed the interest to use graphs rather than trees to deal with linguistic structures and we have shown how Graph Rewriting could be used for their processing, for instance in the transformation of the sentence syntax into its semantics. Our experiments have shown that Graph Rewriting applications to Natural Language Processing do not require the full computational power of the general Graph Rewriting setting. The most important observation is that all graph vertices in the final structures are in some sense "predictable" from the input data and so, we can consider the framework of Non-size increasing Graph Rewriting. In our previous papers, we have formally described the Graph Rewriting calculus we used and our purpose here is to study the theoretical aspect of termination with respect to this calculus. In our framework, we show that uniform termination is undecidable and that non-uniform termination is decidable. We define termination techniques based on weight, we prove the termination of weighted rewriting systems and we give complexity bounds on derivation lengths for these rewriting systems