11,287 research outputs found

    Abstract syntax as interlingua: Scaling up the grammatical framework from controlled languages to robust pipelines

    Get PDF
    Syntax is an interlingual representation used in compilers. Grammatical Framework (GF) applies the abstract syntax idea to natural languages. The development of GF started in 1998, first as a tool for controlled language implementations, where it has gained an established position in both academic and commercial projects. GF provides grammar resources for over 40 languages, enabling accurate generation and translation, as well as grammar engineering tools and components for mobile and Web applications. On the research side, the focus in the last ten years has been on scaling up GF to wide-coverage language processing. The concept of abstract syntax offers a unified view on many other approaches: Universal Dependencies, WordNets, FrameNets, Construction Grammars, and Abstract Meaning Representations. This makes it possible for GF to utilize data from the other approaches and to build robust pipelines. In return, GF can contribute to data-driven approaches by methods to transfer resources from one language to others, to augment data by rule-based generation, to check the consistency of hand-annotated corpora, and to pipe analyses into high-precision semantic back ends. This article gives an overview of the use of abstract syntax as interlingua through both established and emerging NLP applications involving GF

    An IDE for the Grammatical Framework

    Get PDF
    Abstract The GF Eclipse Plugin provides an integrated development environment (IDE) for developing grammars in the Grammatical Framework (GF). Built on top of the Eclipse Platform, it aids grammar writing by providing instant syntax checking, semantic warnings and crossreference resolution. Inline documentation and a library browser facilitate the use of existing resource libraries, and compilation and testing of grammars is greatly improved through single-click launch configurations and an in-built test case manager for running treebank regression tests. This IDE promotes grammar-based systems by making the tasks of writing grammars and using resource libraries more efficient, and provides powerful tools to reduce the barrier to entry to GF and encourage new users of the framework. The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007(FP7/ -2013 under grant agreement no. FP7-ICT-247914. Introduction Grammatical Framework (GF) GF is a special-purpose framework for writing multilingual grammars targeting multiple parallel languages simultaneously. It provides a functional programming language for declarative grammar writing, where each grammar is split between an abstract syntax common to all languages, and multiple language-dependent concrete syntaxes, which define how abstract syntax 2 / John J. Camilleri trees should be linearised into the target languages. From these grammar components, the GF compiler derives both a parser and a lineariser for each concrete language, enabling bi-directional translation between all language pairs. GF grammar development As a grammar formalism, GF facilitates the writing of grammars which can form the basis of various kinds of rule-based machine translation applications. While it is common to focus on the theoretical capabilities and characteristics of such formalisms, it is also relevant to assess what software engineering tools exist to aid the grammar writers themselves. The process of writing a GF grammar may be constrained by the framework's formal limits, but its effectiveness and endurance as a language for grammar development is equally determined by the real-world tools which exist to support it. Whether out of developer choice or merely lack of anything better, GF grammar development typically takes place in traditional text editors, which have no special support for GF apart from a few syntax highlighting schemes made available for certain popular editors 1 . Looking up library functions, grammar compilation and running of regression tests must all take place in separate windows, where the developer frequently enters console commands for searching within source files, loading the GF interpreter, and running some test set against a compiled grammar. GF developers in fact often end up writing their own script files for performing such tasks as a batch. Any syntax errors or compiler warnings generated in the process must be manually interpreted. While some developers may actively choose this low-level approach, the number of integrated development environments (IDEs) available today indicate that there is also a big demand for advanced development setups which provide combined tools for code validation, navigation, refactoring, test suite management and more. Major IDEs such as Eclipse, Microsoft Visual Studio and Xcode have become staples for many developers who want more integrated experiences than the traditional text editor and console combination. Motivation The goal of this work is to provide powerful development tools to the GF developer community, making more efficient the work of current grammar writers as well as promoting the Grammatical Framework itself and encouraging new developers to use the framework. By building a GF development environment as a plugin to an existing IDE platform, we are able to obtain many useful code-editing features "for free". Thus rather than building generic development tools, we only need to focus on writing IDE customisations which are specific to GF, of course reducing the total effort required. The rest of this paper is laid out as follows: section 1.2 describes the design choices which guided the plugin's development, section 1.3.1 then covers each of the major features provided by the plugin, and in section 1.4 we discuss our plans for evaluation along with some future directions for the work. Design choices 1.2.1 Eclipse Eclipse 2 is a multi-language software development environment which consists of both a standalone IDE, as well as an underlying platform with an extensible plugin system. Eclipse can also be used for the development of selfcontained general purpose applications via its Rich Client Platform (RCP). The Eclipse Platform was chosen as the basis for a GF IDE for various reasons: 1. It is written in Java, meaning that the same compiled byte code can run on any platform for which there is a compatible virtual machine. This allows for maximum platform support while avoiding the effort required to maintain multiple versions of the product. The platform is fully open-source under the Eclipse Public License (EPL) 3 , is designed to be extensible and is very well documented. 3. Eclipse is a widely popular IDE and is already well-known to a number of developers within the GF community. 4. It has excellent facilities for building language development tools via the Xtext Framework (see below). Xtext Xtext 4 is an Eclipse-based framework for development of programming languages and domain specific languages (DSLs). Given a language description in the form of an EBNF grammar, it can provide all aspects of a complete language infrastructure, including a parser, linker and compiler or interpreter. These tools are completely integrated within the Eclipse IDE yet allow full customisation according to the developer's needs. Xtext can be used both for By taking the grammar for the GF syntax as specified in Ranta (2011, appendix C.6.2), and converting it into a non-left recursive (LL(*)) equivalent, we used Xtext's ANTLR 5 -based code generator to obtain a basic infrastructure for the GF programming language, including a parser and serialiser. With this infrastructure as a starting point, a number of GF-specific customisations were written in order to provide support for linking across GF's module hierarchy system. Details of this implementation as well as other custom-built IDE features are described in section 1.3.1. Design principles Preserving existing projects As users may wish to switch back and forth between a new IDE and their own traditional development setups, it was considered an important design principle to have the GF IDE not alter the developer's existing project structure. To this end, the GF Eclipse Plugin does not have any folder layout requirements, and never moves or alters a developer's files for its own purposes. For storing any IDE-specific preferences and intermediary files, meta-data directories are used which do not interfere with the original source files. Preventing application tie-in in this way reduces the investment required for users who want to switch to using the new IDE, and ensures that developers retain full control over their GF projects. This is especially important for developers using version control systems, who would want to use the plugin without risking any changes to their repository's directory tree. Interaction with GF compiler It is clear that an IDE which provides syntax checking and cross-reference resolution is in some sense replicating the parsing and linking features of that language's compiler. With this comes the decision of what should be re-implemented within the GF IDE itself, and what should be delegated to the existing GF compiler. In terms of minimising effort required, the obvious option would be to rely on the compiler as much as possible. This would conveniently mean that any future changes to the language, as implemented in updates to the compiler, would require no change to the IDE itself. However, building an IDE which depends entirely on an external program to handle all parsing and linking jobs on-the-fly is not a practical solution. Thanks to Xtext Framework's parser generator as described above, keeping all syntax checking within the IDE platform becomes a feasible option, in terms of effort required versus performance benefit. When it comes to reference resolution and linking however, it was decided that the IDE should delegate these tasks to the GF compiler in a background process (see section 1.3.4). This avoids the work of having to re-implement GF's module hierarchy system within the IDE implementation. Communication of scope information from GF back to the IDE is facilitated through a new "tags" feature in the GF compiler, as described in section 1.3.3. This delegation occurs in a on-demand fashion, where the GF compiler is called asynchronously and as needed, when changes are made to a module's header

    Handling non-compositionality in multilingual CNLs

    Full text link
    In this paper, we describe methods for handling multilingual non-compositional constructions in the framework of GF. We specifically look at methods to detect and extract non-compositional phrases from parallel texts and propose methods to handle such constructions in GF grammars. We expect that the methods to handle non-compositional constructions will enrich CNLs by providing more flexibility in the design of controlled languages. We look at two specific use cases of non-compositional constructions: a general-purpose method to detect and extract multilingual multiword expressions and a procedure to identify nominal compounds in German. We evaluate our procedure for multiword expressions by performing a qualitative analysis of the results. For the experiments on nominal compounds, we incorporate the detected compounds in a full SMT pipeline and evaluate the impact of our method in machine translation process.Comment: CNL workshop in COLING 201

    Controlled Natural Language Generation from a Multilingual FrameNet-based Grammar

    Full text link
    This paper presents a currently bilingual but potentially multilingual FrameNet-based grammar library implemented in Grammatical Framework. The contribution of this paper is two-fold. First, it offers a methodological approach to automatically generate the grammar based on semantico-syntactic valence patterns extracted from FrameNet-annotated corpora. Second, it provides a proof of concept for two use cases illustrating how the acquired multilingual grammar can be exploited in different CNL applications in the domains of arts and tourism

    Automatic acquisition of LFG resources for German - as good as it gets

    Get PDF
    We present data-driven methods for the acquisition of LFG resources from two German treebanks. We discuss problems specific to semi-free word order languages as well as problems arising fromthe data structures determined by the design of the different treebanks. We compare two ways of encoding semi-free word order, as done in the two German treebanks, and argue that the design of the TiGer treebank is more adequate for the acquisition of LFG resources. Furthermore, we describe an architecture for LFG grammar acquisition for German, based on the two German treebanks, and compare our results with a hand-crafted German LFG grammar

    A CNL for Contract-Oriented Diagrams

    Full text link
    We present a first step towards a framework for defining and manipulating normative documents or contracts described as Contract-Oriented (C-O) Diagrams. These diagrams provide a visual representation for such texts, giving the possibility to express a signatory's obligations, permissions and prohibitions, with or without timing constraints, as well as the penalties resulting from the non-fulfilment of a contract. This work presents a CNL for verbalising C-O Diagrams, a web-based tool allowing editing in this CNL, and another for visualising and manipulating the diagrams interactively. We then show how these proof-of-concept tools can be used by applying them to a small example

    Treebank-based acquisition of wide-coverage, probabilistic LFG resources: project overview, results and evaluation

    Get PDF
    This paper presents an overview of a project to acquire wide-coverage, probabilistic Lexical-Functional Grammar (LFG) resources from treebanks. Our approach is based on an automatic annotation algorithm that annotates “raw” treebank trees with LFG f-structure information approximating to basic predicate-argument/dependency structure. From the f-structure-annotated treebank we extract probabilistic unification grammar resources. We present the annotation algorithm, the extraction of lexical information and the acquisition of wide-coverage and robust PCFG-based LFG approximations including long-distance dependency resolution. We show how the methodology can be applied to multilingual, treebank-based unification grammar acquisition. Finally we show how simple (quasi-)logical forms can be derived automatically from the f-structures generated for the treebank trees

    Common aetiology for diverse language skills in 41/2-year-old twins

    Get PDF
    Multivariate genetic analysis was used to examine the genetic and environmental aetiology of the interrelationships of diverse linguistic skills. This study used data from a large sample of 4 1/2 year-old twins who were tested on measures assessing articulation, phonology, grammar, vocabulary, and verbal memory. Phenotypic analysis suggested two latent factors: articulation (2 measures) and general language (the remaining 7), and a genetic model incorporating these factors provided a good fit to the data. Almost all genetic and shared environmental influences on the 9 measures acted through the two latent factors. There was also substantial aetiological overlap between the two latent factors, with a genetic correlation of 0·64 and shared environment correlation of 1·00. We conclude that to a large extent, the same genetic and environmental factors underlie the development of individual differences in a wide range of linguistic skills
    corecore