608 research outputs found
Automatic error recovery for LR parsers in theory and practice
This thesis argues the need for good syntax error handling schemes in language
translation systems such as compilers, and for the automatic incorporation of such schemes
into parser-generators. Syntax errors are studied in a theoretical framework and practical
methods for handling syntax errors are presented.
The theoretical framework consists of a model for syntax errors based on the concept of
a minimum prefix-defined error correction,a sentence obtainable from an erroneous string by
performing edit operations at prefix-defined (parser defined) errors. It is shown that for an
arbitrary context-free language, it is undecidable whether a better than arbitrary choice of edit
operations can be made at a prefix-defined error. For common programming languages,it is
shown that minimum-distance errors and prefix-defined errors do not necessarily coincide,
and that there exists an infinite number of programs that differ in a single symbol only; sets
of equivalent insertions are exhibited.
Two methods for syntax error recovery are, presented. The methods are language
independent and suitable for automatic generation. The first method consists of two stages,
local repair followed if necessary by phrase-level repair. The second method consists of a
single stage in which a locally minimum-distance repair is computed. Both methods are
developed for use in the practical LR parser-generator yacc, requiring no additional
specifications from the user. A scheme for the automatic generation of diagnostic messages
in terms of the source input is presented. Performance of the methods in practice is evaluated
using a formal method based on minimum-distance and prefix-defined error correction. The
methods compare favourably with existing methods for error recovery
BSML: A Binding Schema Markup Language for Data Interchange in Problem Solving Environments (PSEs)
We describe a binding schema markup language (BSML) for describing data
interchange between scientific codes. Such a facility is an important
constituent of scientific problem solving environments (PSEs). BSML is designed
to integrate with a PSE or application composition system that views model
specification and execution as a problem of managing semistructured data. The
data interchange problem is addressed by three techniques for processing
semistructured data: validation, binding, and conversion. We present BSML and
describe its application to a PSE for wireless communications system design
LR(k) sparse-parsers and their optimisation
PhD ThesisA method of syntactic analysis is developed which . .
is believed to surpass all known competitors in all major
respects.
I
The method is based upon that associated with the
LR(k) grammars but is faster because it bypasses all
reduction steps concerned with 'chain' productions. These
are freely selected productions which are considered
semantically irrelevant and whose right parts consist of
just a single symbol. The parses produced by the method
are 'sparse' in that they contain no references to chain
productions - they are termed 'chain-free' parses.
The CFLR(k) grammars are introduced as the largest
class which can be -Chain-F-ree parsed from -Le-ft to Right while looking ~ symbols ahead of the current point of the
parse. The properties of these grammars are examined in
detail and their relationship to the conventional LR(k)
grammars is explored. Techniques are presented for testing
grammars for the CFLR(k) property and for constructing
chain-free parsers for those grammars possessing the
property. Methods are also presented for. converting
ordinary LR(k) parsers into chain-free parsers.
CFLR(k) parsers are more widely applicable than
their LR(k) counterparts, are faster 'and provide the same
excellent detection of syntactic errors. Unfortunately they
also tend to be rather larger. A 'simple optimization is
presented which completely'overcomes this single disadvantage
without sacrificing any of the advantages of the
method.
These theoretical techniques are adapted to provide
truly practical chain-free parsers based on the conventional
SLR and,LALR parsing methods. Detailed consideration
is given to use of 'default reductions' and related
techniques for achd.evfng compact representations of these
parsers. The resulting chain-free parsers are not only
faster than their ordinary counterparts, but probably
smaller too. We believe their advantages are such that they
should substantially replace other parsing methods currently
used in programming language compilers
Table driven prediction for recursive descent LL(k) parsers
Programming languages are typically described in BNF or some extension of BNF, and the process of converting these descriptions into parsers is performed by parser generators. Some of the parser generators that convert LL grammars into parsers construct them to use recursive descent that gives them context during execution. The context is provided by the execution stack as the parser descends into the grammar and this is what allows the full expressiveness of LL grammars. Table driven parsers can be generated instead but restrictions are placed on the LL grammars that can be accepted. The benefit of tables is that they facilitate a separation of syntax analysis and semantic code written by a language designer. They are also faster and they simplify language implementation and modification. This paper proposes the possibility of a hybrid system that makes decisions using tables but once decisions are made recursive descent is employed to maintain context. The benefits of each system are maintained, and the drawbacks are mitigated. Also discussed are the modifications made to an existing parser generator, oops (version 2), so that it accepts LL(k) grammars and builds parsers using this system as proof-of-concept
Protecting Systems From Exploits Using Language-Theoretic Security
Any computer program processing input from the user or network must validate the input. Input-handling vulnerabilities occur in programs when the software component responsible for filtering malicious input---the parser---does not perform validation adequately. Consequently, parsers are among the most targeted components since they defend the rest of the program from malicious input. This thesis adopts the Language-Theoretic Security (LangSec) principle to understand what tools and research are needed to prevent exploits that target parsers. LangSec proposes specifying the syntactic structure of the input format as a formal grammar. We then build a recognizer for this formal grammar to validate any input before the rest of the program acts on it. To ensure that these recognizers represent the data format, programmers often rely on parser generators or parser combinators tools to build the parsers. This thesis propels several sub-fields in LangSec by proposing new techniques to find bugs in implementations, novel categorizations of vulnerabilities, and new parsing algorithms and tools to handle practical data formats. To this end, this thesis comprises five parts that tackle various tenets of LangSec. First, I categorize various input-handling vulnerabilities and exploits using two frameworks. First, I use the mismorphisms framework to reason about vulnerabilities. This framework helps us reason about the root causes leading to various vulnerabilities. Next, we built a categorization framework using various LangSec anti-patterns, such as parser differentials and insufficient input validation. Finally, we built a catalog of more than 30 popular vulnerabilities to demonstrate the categorization frameworks. Second, I built parsers for various Internet of Things and power grid network protocols and the iccMAX file format using parser combinator libraries. The parsers I built for power grid protocols were deployed and tested on power grid substation networks as an intrusion detection tool. The parser I built for the iccMAX file format led to several corrections and modifications to the iccMAX specifications and reference implementations. Third, I present SPARTA, a novel tool I built that generates Rust code that type checks Portable Data Format (PDF) files. The type checker I helped build strictly enforces the constraints in the PDF specification to find deviations. Our checker has contributed to at least four significant clarifications and corrections to the PDF 2.0 specification and various open-source PDF tools. In addition to our checker, we also built a practical tool, PDFFixer, to dynamically patch type errors in PDF files. Fourth, I present ParseSmith, a tool to build verified parsers for real-world data formats. Most parsing tools available for data formats are insufficient to handle practical formats or have not been verified for their correctness. I built a verified parsing tool in Dafny that builds on ideas from attribute grammars, data-dependent grammars, and parsing expression grammars to tackle various constructs commonly seen in network formats. I prove that our parsers run in linear time and always terminate for well-formed grammars. Finally, I provide the earliest systematic comparison of various data description languages (DDLs) and their parser generation tools. DDLs are used to describe and parse commonly used data formats, such as image formats. Next, I conducted an expert elicitation qualitative study to derive various metrics that I use to compare the DDLs. I also systematically compare these DDLs based on sample data descriptions available with the DDLs---checking for correctness and resilience
Contributions to the Construction of Extensible Semantic Editors
This dissertation addresses the need for easier construction and extension of language tools. Specifically, the construction and extension of so-called semantic editors is considered, that is, editors providing semantic services for code comprehension and manipulation. Editors like these are typically found in state-of-the-art development environments, where they have been developed by hand. The list of programming languages available today is extensive and, with the lively creation of new programming languages and the evolution of old languages, it keeps growing. Many of these languages would benefit from proper tool support. Unfortunately, the development of a semantic editor can be a time-consuming and error-prone endeavor, and too large an effort for most language communities. Given the complex nature of programming, and the huge benefits of good tool support, this lack of tools is problematic. In this dissertation, an attempt is made at narrowing the gap between generative solutions and how state-of-the-art editors are constructed today. A generative alternative for construction of textual semantic editors is explored with focus on how to specify extensible semantic editor services. Specifically, this dissertation shows how semantic services can be specified using a semantic formalism called refer- ence attribute grammars (RAGs), and how these services can be made responsive enough for editing, and be provided also when the text in an editor is erroneous. Results presented in this dissertation have been found useful, both in industry and in academia, suggesting that the explored approach may help to reduce the effort of editor construction
Null Element Restoration
Understanding the syntactic structure of a sentence is a necessary preliminary to understanding its semantics and therefore for many practical applications. The field of natural language processing has achieved a high degree of accuracy in parsing, at least in English. However, the syntactic structures produced by the most commonly used parsers are less detailed than those structures found in the treebanks the parsers were trained on. In particular, these parsers typically lack the null elements used to indicate wh-movement, control, and other phenomena.
This thesis presents a system for inserting these null elements into parse trees in English. It then examines the problem in Arabic, which motivates a second, joint- inference system which has improved performance on English as well. Finally, it examines the application of information derived from the Google Web 1T corpus as a way of reducing certain data sparsity issues related to wh-movement
On-the-Fly Syntax Highlighting using Neural Networks
With the presence of online collaborative tools for software developers,
source code is shared and consulted frequently, from code viewers to merge
requests and code snippets. Typically, code highlighting quality in such
scenarios is sacrificed in favor of system responsiveness. In these on-the-fly
settings, performing a formal grammatical analysis of the source code is not
only expensive, but also intractable for the many times the input is an invalid
derivation of the language. Indeed, current popular highlighters heavily rely
on a system of regular expressions, typically far from the specification of the
language's lexer. Due to their complexity, regular expressions need to be
periodically updated as more feedback is collected from the users and their
design unwelcome the detection of more complex language formations. This paper
delivers a deep learning-based approach suitable for on-the-fly grammatical
code highlighting of correct and incorrect language derivations, such as code
viewers and snippets. It focuses on alleviating the burden on the developers,
who can reuse the language's parsing strategy to produce the desired
highlighting specification. Moreover, this approach is compared to nowadays
online syntax highlighting tools and formal methods in terms of accuracy and
execution time, across different levels of grammatical coverage, for three
mainstream programming languages. The results obtained show how the proposed
approach can consistently achieve near-perfect accuracy in its predictions,
thereby outperforming regular expression-based strategies.Comment: Accepted for publication in the ACM Joint European Software
Engineering Conference and Symposium on the Foundations of Software
Engineering (ESEC/FSE 2022
A linguistic rule-based approach to extract drug-drug interactions from pharmacological documents
<p>Abstract</p> <p>Background</p> <p>A drug-drug interaction (DDI) occurs when one drug influences the level or activity of another drug. The increasing volume of the scientific literature overwhelms health care professionals trying to be kept up-to-date with all published studies on DDI.</p> <p>Methods</p> <p>This paper describes a hybrid linguistic approach to DDI extraction that combines shallow parsing and syntactic simplification with pattern matching. Appositions and coordinate structures are interpreted based on shallow syntactic parsing provided by the UMLS MetaMap tool (MMTx). Subsequently, complex and compound sentences are broken down into clauses from which simple sentences are generated by a set of simplification rules. A pharmacist defined a set of domain-specific lexical patterns to capture the most common expressions of DDI in texts. These lexical patterns are matched with the generated sentences in order to extract DDIs.</p> <p>Results</p> <p>We have performed different experiments to analyze the performance of the different processes. The lexical patterns achieve a reasonable precision (67.30%), but very low recall (14.07%). The inclusion of appositions and coordinate structures helps to improve the recall (25.70%), however, precision is lower (48.69%). The detection of clauses does not improve the performance.</p> <p>Conclusions</p> <p>Information Extraction (IE) techniques can provide an interesting way of reducing the time spent by health care professionals on reviewing the literature. Nevertheless, no approach has been carried out to extract DDI from texts. To the best of our knowledge, this work proposes the first integral solution for the automatic extraction of DDI from biomedical texts.</p
- …