84 research outputs found
Structure Sharing and Parallelization in a GB Parser
By utilizing structure sharing among its parse trees, a GB parser can increase its efficiency dramatically. Using a GB parser which has as its phrase structure recovery component an implementation of Tomita's algorithm (as described in [Tom86]), we investigate how a GB parser can preserve the structure sharing output by Tomita's algorithm. In this report, we discuss the implications of using Tomita's algorithm in GB parsing, and we give some details of the structuresharing parser currently under construction. We also discuss a method of parallelizing a GB parser, and relate it to the existing literature on parallel GB parsing. Our approach to preserving sharing within a
shared-packed forest is applicable not only to GB parsing, but anytime we want to preserve structure sharing in a parse forest in the presence of features
One Parser to Rule Them All
Despite the long history of research in parsing, constructing parsers for real programming languages remains a difficult and painful task. In the last decades, different parser generators emerged to allow the construction of parsers from a BNF-like specification. However, still today, many parsers are handwritten, or are only partly generated, and include various hacks to deal with different peculiarities in programming languages. The main problem is that current declarative syntax definition techniques are based on pure context-free grammars, while many constructs found in programming languages require context information.
In this paper we propose a parsing framework that embraces context information in its core. Our framework is based on data-dependent grammars, which extend context-free grammars with arbitrary computation, variable binding and constraints. We present an implementation of our framework on top of the Generalized LL (GLL) parsing algorithm, and show how common idioms in syntax of programming languages such as (1) lexical disambiguation filters, (2) operator precedence, (3) indentation-sensitive rules, and (4) conditional preprocessor directives can be mapped to data-dependent grammars. We demonstrate the initial experience with our framework, by parsing more than 20000 Java, C#, Haskell, and OCaml source files
Survey on Instruction Selection: An Extensive and Modern Literature Review
Instruction selection is one of three optimisation problems involved in the
code generator backend of a compiler. The instruction selector is responsible
of transforming an input program from its target-independent representation
into a target-specific form by making best use of the available machine
instructions. Hence instruction selection is a crucial part of efficient code
generation.
Despite on-going research since the late 1960s, the last, comprehensive
survey on the field was written more than 30 years ago. As new approaches and
techniques have appeared since its publication, this brings forth a need for a
new, up-to-date review of the current body of literature. This report addresses
that need by performing an extensive review and categorisation of existing
research. The report therefore supersedes and extends the previous surveys, and
also attempts to identify where future research should be directed.Comment: Major changes: - Merged simulation chapter with macro expansion
chapter - Addressed misunderstandings of several approaches - Completely
rewrote many parts of the chapters; strengthened the discussion of many
approaches - Revised the drawing of all trees and graphs to put the root at
the top instead of at the bottom - Added appendix for listing the approaches
in a table See doc for more inf
Deriving modernity signatures of codebases with static analysis
This paper addresses the problem of determining the modernity of software systems by analysing the use of new language features and their adoption over time. We propose the concept of modernity signatures to estimate the age of a codebase, naturally adjusted for maintenance practices, such that the modernity of a regularly updated system would be above that of a more recently created one which neglects current features and best practices. This can provide insights into coding practices, codebase health and the evolution of software languages. We present case studies on PHP and Python code, demonstrating the effectiveness of modernity signatures in determining the age of a codebase without executing the code or performing extensive human inspection. The paper describes the technical implementation details of generating the modernity signature for both of these languages, including the use of existing tools like the PHP parser and Vermin. The findings suggest that modernity signatures can aid developers in many ways from choosing whether to use a system or how to approach its maintenance, to assessing usefulness of a language feature, thus providing a valuable tool for source code analysis and manipulation
Integrated supertagging and parsing
EuroMatrixPlus project funded by the European Commission, 7th Framework ProgrammeParsing is the task of assigning syntactic or semantic structure to a natural language
sentence. This thesis focuses on syntactic parsing with Combinatory Categorial Grammar
(CCG; Steedman 2000). CCG allows incremental processing, which is essential
for speech recognition and some machine translation models, and it can build semantic
structure in tandem with syntactic parsing. Supertagging solves a subset of the parsing
task by assigning lexical types to words in a sentence using a sequence model. It has
emerged as a way to improve the efficiency of full CCG parsing (Clark and Curran,
2007) by reducing the parser’s search space. This has been very successful and it is the
central theme of this thesis.
We begin by an analysis of how efficiency is being traded for accuracy in supertagging.
Pruning the search space by supertagging is inherently approximate and to contrast
this we include A* in our analysis, a classic exact search technique. Interestingly,
we find that combining the two methods improves efficiency but we also demonstrate
that excessive pruning by a supertagger significantly lowers the upper bound on accuracy
of a CCG parser.
Inspired by this analysis, we design a single integrated model with both supertagging
and parsing features, rather than separating them into distinct models chained
together in a pipeline. To overcome the resulting complexity, we experiment with both
loopy belief propagation and dual decomposition approaches to inference, the first empirical
comparison of these algorithms that we are aware of on a structured natural
language processing problem.
Finally, we address training the integrated model. We adopt the idea of optimising
directly for a task-specific metric such as is common in other areas like statistical
machine translation. We demonstrate how a novel dynamic programming algorithm
enables us to optimise for F-measure, our task-specific evaluation metric, and experiment
with approximations, which prove to be excellent substitutions.
Each of the presented methods improves over the state-of-the-art in CCG parsing.
Moreover, the improvements are additive, achieving a labelled/unlabelled dependency
F-measure on CCGbank of 89.3%/94.0% with gold part-of-speech tags, and
87.2%/92.8% with automatic part-of-speech tags, the best reported results for this task
to date. Our techniques are general and we expect them to apply to other parsing problems,
including lexicalised tree adjoining grammar and context-free grammar parsing
- …