608 research outputs found

    Automatic error recovery for LR parsers in theory and practice

    Get PDF
    This thesis argues the need for good syntax error handling schemes in language translation systems such as compilers, and for the automatic incorporation of such schemes into parser-generators. Syntax errors are studied in a theoretical framework and practical methods for handling syntax errors are presented. The theoretical framework consists of a model for syntax errors based on the concept of a minimum prefix-defined error correction,a sentence obtainable from an erroneous string by performing edit operations at prefix-defined (parser defined) errors. It is shown that for an arbitrary context-free language, it is undecidable whether a better than arbitrary choice of edit operations can be made at a prefix-defined error. For common programming languages,it is shown that minimum-distance errors and prefix-defined errors do not necessarily coincide, and that there exists an infinite number of programs that differ in a single symbol only; sets of equivalent insertions are exhibited. Two methods for syntax error recovery are, presented. The methods are language independent and suitable for automatic generation. The first method consists of two stages, local repair followed if necessary by phrase-level repair. The second method consists of a single stage in which a locally minimum-distance repair is computed. Both methods are developed for use in the practical LR parser-generator yacc, requiring no additional specifications from the user. A scheme for the automatic generation of diagnostic messages in terms of the source input is presented. Performance of the methods in practice is evaluated using a formal method based on minimum-distance and prefix-defined error correction. The methods compare favourably with existing methods for error recovery

    BSML: A Binding Schema Markup Language for Data Interchange in Problem Solving Environments (PSEs)

    Full text link
    We describe a binding schema markup language (BSML) for describing data interchange between scientific codes. Such a facility is an important constituent of scientific problem solving environments (PSEs). BSML is designed to integrate with a PSE or application composition system that views model specification and execution as a problem of managing semistructured data. The data interchange problem is addressed by three techniques for processing semistructured data: validation, binding, and conversion. We present BSML and describe its application to a PSE for wireless communications system design

    LR(k) sparse-parsers and their optimisation

    Get PDF
    PhD ThesisA method of syntactic analysis is developed which . . is believed to surpass all known competitors in all major respects. I The method is based upon that associated with the LR(k) grammars but is faster because it bypasses all reduction steps concerned with 'chain' productions. These are freely selected productions which are considered semantically irrelevant and whose right parts consist of just a single symbol. The parses produced by the method are 'sparse' in that they contain no references to chain productions - they are termed 'chain-free' parses. The CFLR(k) grammars are introduced as the largest class which can be -Chain-F-ree parsed from -Le-ft to Right while looking ~ symbols ahead of the current point of the parse. The properties of these grammars are examined in detail and their relationship to the conventional LR(k) grammars is explored. Techniques are presented for testing grammars for the CFLR(k) property and for constructing chain-free parsers for those grammars possessing the property. Methods are also presented for. converting ordinary LR(k) parsers into chain-free parsers. CFLR(k) parsers are more widely applicable than their LR(k) counterparts, are faster 'and provide the same excellent detection of syntactic errors. Unfortunately they also tend to be rather larger. A 'simple optimization is presented which completely'overcomes this single disadvantage without sacrificing any of the advantages of the method. These theoretical techniques are adapted to provide truly practical chain-free parsers based on the conventional SLR and,LALR parsing methods. Detailed consideration is given to use of 'default reductions' and related techniques for achd.evfng compact representations of these parsers. The resulting chain-free parsers are not only faster than their ordinary counterparts, but probably smaller too. We believe their advantages are such that they should substantially replace other parsing methods currently used in programming language compilers

    Table driven prediction for recursive descent LL(k) parsers

    Get PDF
    Programming languages are typically described in BNF or some extension of BNF, and the process of converting these descriptions into parsers is performed by parser generators. Some of the parser generators that convert LL grammars into parsers construct them to use recursive descent that gives them context during execution. The context is provided by the execution stack as the parser descends into the grammar and this is what allows the full expressiveness of LL grammars. Table driven parsers can be generated instead but restrictions are placed on the LL grammars that can be accepted. The benefit of tables is that they facilitate a separation of syntax analysis and semantic code written by a language designer. They are also faster and they simplify language implementation and modification. This paper proposes the possibility of a hybrid system that makes decisions using tables but once decisions are made recursive descent is employed to maintain context. The benefits of each system are maintained, and the drawbacks are mitigated. Also discussed are the modifications made to an existing parser generator, oops (version 2), so that it accepts LL(k) grammars and builds parsers using this system as proof-of-concept

    Protecting Systems From Exploits Using Language-Theoretic Security

    Get PDF
    Any computer program processing input from the user or network must validate the input. Input-handling vulnerabilities occur in programs when the software component responsible for filtering malicious input---the parser---does not perform validation adequately. Consequently, parsers are among the most targeted components since they defend the rest of the program from malicious input. This thesis adopts the Language-Theoretic Security (LangSec) principle to understand what tools and research are needed to prevent exploits that target parsers. LangSec proposes specifying the syntactic structure of the input format as a formal grammar. We then build a recognizer for this formal grammar to validate any input before the rest of the program acts on it. To ensure that these recognizers represent the data format, programmers often rely on parser generators or parser combinators tools to build the parsers. This thesis propels several sub-fields in LangSec by proposing new techniques to find bugs in implementations, novel categorizations of vulnerabilities, and new parsing algorithms and tools to handle practical data formats. To this end, this thesis comprises five parts that tackle various tenets of LangSec. First, I categorize various input-handling vulnerabilities and exploits using two frameworks. First, I use the mismorphisms framework to reason about vulnerabilities. This framework helps us reason about the root causes leading to various vulnerabilities. Next, we built a categorization framework using various LangSec anti-patterns, such as parser differentials and insufficient input validation. Finally, we built a catalog of more than 30 popular vulnerabilities to demonstrate the categorization frameworks. Second, I built parsers for various Internet of Things and power grid network protocols and the iccMAX file format using parser combinator libraries. The parsers I built for power grid protocols were deployed and tested on power grid substation networks as an intrusion detection tool. The parser I built for the iccMAX file format led to several corrections and modifications to the iccMAX specifications and reference implementations. Third, I present SPARTA, a novel tool I built that generates Rust code that type checks Portable Data Format (PDF) files. The type checker I helped build strictly enforces the constraints in the PDF specification to find deviations. Our checker has contributed to at least four significant clarifications and corrections to the PDF 2.0 specification and various open-source PDF tools. In addition to our checker, we also built a practical tool, PDFFixer, to dynamically patch type errors in PDF files. Fourth, I present ParseSmith, a tool to build verified parsers for real-world data formats. Most parsing tools available for data formats are insufficient to handle practical formats or have not been verified for their correctness. I built a verified parsing tool in Dafny that builds on ideas from attribute grammars, data-dependent grammars, and parsing expression grammars to tackle various constructs commonly seen in network formats. I prove that our parsers run in linear time and always terminate for well-formed grammars. Finally, I provide the earliest systematic comparison of various data description languages (DDLs) and their parser generation tools. DDLs are used to describe and parse commonly used data formats, such as image formats. Next, I conducted an expert elicitation qualitative study to derive various metrics that I use to compare the DDLs. I also systematically compare these DDLs based on sample data descriptions available with the DDLs---checking for correctness and resilience

    Contributions to the Construction of Extensible Semantic Editors

    Get PDF
    This dissertation addresses the need for easier construction and extension of language tools. Specifically, the construction and extension of so-called semantic editors is considered, that is, editors providing semantic services for code comprehension and manipulation. Editors like these are typically found in state-of-the-art development environments, where they have been developed by hand. The list of programming languages available today is extensive and, with the lively creation of new programming languages and the evolution of old languages, it keeps growing. Many of these languages would benefit from proper tool support. Unfortunately, the development of a semantic editor can be a time-consuming and error-prone endeavor, and too large an effort for most language communities. Given the complex nature of programming, and the huge benefits of good tool support, this lack of tools is problematic. In this dissertation, an attempt is made at narrowing the gap between generative solutions and how state-of-the-art editors are constructed today. A generative alternative for construction of textual semantic editors is explored with focus on how to specify extensible semantic editor services. Specifically, this dissertation shows how semantic services can be specified using a semantic formalism called refer- ence attribute grammars (RAGs), and how these services can be made responsive enough for editing, and be provided also when the text in an editor is erroneous. Results presented in this dissertation have been found useful, both in industry and in academia, suggesting that the explored approach may help to reduce the effort of editor construction

    Null Element Restoration

    Get PDF
    Understanding the syntactic structure of a sentence is a necessary preliminary to understanding its semantics and therefore for many practical applications. The field of natural language processing has achieved a high degree of accuracy in parsing, at least in English. However, the syntactic structures produced by the most commonly used parsers are less detailed than those structures found in the treebanks the parsers were trained on. In particular, these parsers typically lack the null elements used to indicate wh-movement, control, and other phenomena. This thesis presents a system for inserting these null elements into parse trees in English. It then examines the problem in Arabic, which motivates a second, joint- inference system which has improved performance on English as well. Finally, it examines the application of information derived from the Google Web 1T corpus as a way of reducing certain data sparsity issues related to wh-movement

    On-the-Fly Syntax Highlighting using Neural Networks

    Full text link
    With the presence of online collaborative tools for software developers, source code is shared and consulted frequently, from code viewers to merge requests and code snippets. Typically, code highlighting quality in such scenarios is sacrificed in favor of system responsiveness. In these on-the-fly settings, performing a formal grammatical analysis of the source code is not only expensive, but also intractable for the many times the input is an invalid derivation of the language. Indeed, current popular highlighters heavily rely on a system of regular expressions, typically far from the specification of the language's lexer. Due to their complexity, regular expressions need to be periodically updated as more feedback is collected from the users and their design unwelcome the detection of more complex language formations. This paper delivers a deep learning-based approach suitable for on-the-fly grammatical code highlighting of correct and incorrect language derivations, such as code viewers and snippets. It focuses on alleviating the burden on the developers, who can reuse the language's parsing strategy to produce the desired highlighting specification. Moreover, this approach is compared to nowadays online syntax highlighting tools and formal methods in terms of accuracy and execution time, across different levels of grammatical coverage, for three mainstream programming languages. The results obtained show how the proposed approach can consistently achieve near-perfect accuracy in its predictions, thereby outperforming regular expression-based strategies.Comment: Accepted for publication in the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022

    A linguistic rule-based approach to extract drug-drug interactions from pharmacological documents

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A drug-drug interaction (DDI) occurs when one drug influences the level or activity of another drug. The increasing volume of the scientific literature overwhelms health care professionals trying to be kept up-to-date with all published studies on DDI.</p> <p>Methods</p> <p>This paper describes a hybrid linguistic approach to DDI extraction that combines shallow parsing and syntactic simplification with pattern matching. Appositions and coordinate structures are interpreted based on shallow syntactic parsing provided by the UMLS MetaMap tool (MMTx). Subsequently, complex and compound sentences are broken down into clauses from which simple sentences are generated by a set of simplification rules. A pharmacist defined a set of domain-specific lexical patterns to capture the most common expressions of DDI in texts. These lexical patterns are matched with the generated sentences in order to extract DDIs.</p> <p>Results</p> <p>We have performed different experiments to analyze the performance of the different processes. The lexical patterns achieve a reasonable precision (67.30%), but very low recall (14.07%). The inclusion of appositions and coordinate structures helps to improve the recall (25.70%), however, precision is lower (48.69%). The detection of clauses does not improve the performance.</p> <p>Conclusions</p> <p>Information Extraction (IE) techniques can provide an interesting way of reducing the time spent by health care professionals on reviewing the literature. Nevertheless, no approach has been carried out to extract DDI from texts. To the best of our knowledge, this work proposes the first integral solution for the automatic extraction of DDI from biomedical texts.</p
    • …
    corecore