16 research outputs found

    Handling non-compositionality in multilingual CNLs

    Full text link
    In this paper, we describe methods for handling multilingual non-compositional constructions in the framework of GF. We specifically look at methods to detect and extract non-compositional phrases from parallel texts and propose methods to handle such constructions in GF grammars. We expect that the methods to handle non-compositional constructions will enrich CNLs by providing more flexibility in the design of controlled languages. We look at two specific use cases of non-compositional constructions: a general-purpose method to detect and extract multilingual multiword expressions and a procedure to identify nominal compounds in German. We evaluate our procedure for multiword expressions by performing a qualitative analysis of the results. For the experiments on nominal compounds, we incorporate the detected compounds in a full SMT pipeline and evaluate the impact of our method in machine translation process.Comment: CNL workshop in COLING 201

    Abstract syntax as interlingua: Scaling up the grammatical framework from controlled languages to robust pipelines

    Get PDF
    Syntax is an interlingual representation used in compilers. Grammatical Framework (GF) applies the abstract syntax idea to natural languages. The development of GF started in 1998, first as a tool for controlled language implementations, where it has gained an established position in both academic and commercial projects. GF provides grammar resources for over 40 languages, enabling accurate generation and translation, as well as grammar engineering tools and components for mobile and Web applications. On the research side, the focus in the last ten years has been on scaling up GF to wide-coverage language processing. The concept of abstract syntax offers a unified view on many other approaches: Universal Dependencies, WordNets, FrameNets, Construction Grammars, and Abstract Meaning Representations. This makes it possible for GF to utilize data from the other approaches and to build robust pipelines. In return, GF can contribute to data-driven approaches by methods to transfer resources from one language to others, to augment data by rule-based generation, to check the consistency of hand-annotated corpora, and to pipe analyses into high-precision semantic back ends. This article gives an overview of the use of abstract syntax as interlingua through both established and emerging NLP applications involving GF

    The many ways of returning to the refrain in Telugu song

    Get PDF
    A refrain (abbreviated R) is a line in a song repeated after each verse, and often used as the\ua0song’s name. Returns to R are usually high points both melodically and lyrically. E.g., a verse\ua0ending "I feel upon my lips again" makes a smooth lead-in (abbreviated L) to the refrain R =\ua0"A taste of honey" (Scott/Marlow, 1962). We notate this "(I feel upon my lips again) A taste of\ua0honey", and call such patterns (lead-in)refrains or (L)R’s. The song goes R … LR … LR, where\ua0L could change verse to verse. In our examples, R and (L)R are often both full sentences. More\ua0interesting L’s are often phrases, clauses or rather than interjections. A word-prefix L can\ua0transform R.Our main contribution is to point out that (L)R patterns are a striking feature of Telugu (TEL)\ua0song, remarkably various and profuse in both old and new songs, yet little remarked in the\ua0literature as far as we are aware. We give examples from the 15th c. to the 21st. In transcription, a\ua0colon marks long vowels, and M, nasalized ones. Retroflexion is shown by capitalization, and\ua0aspiration by h, also a consonant by itself. Glosses are given, some also /morpheme-wise/.Kannada (KAN) and Tamil (TAM) share features with TEL that help make L(R)’s: fairly free word\ua0order, agglutinative particles, and adjectives and relative clauses preceding the noun. We give\ua0only lone KAN and TAM examples, but expect to find more when we search. Hindi (HIN) shares\ua0fewer features with TEL; perhaps therefore, we have so far looked but found few (L)R s in HIN

    Prediction of Learning Curves in Machine Translation

    Get PDF
    Abstract Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific purpose. Since ad-hoc manual translation can represent a significant investment in time and money, a prior assesment of the amount of training data required to achieve a satisfactory accuracy level can be very useful. In this work, we show how to predict what the learning curve would look like if we were to manually translate increasing amounts of data. We consider two scenarios, 1) Monolingual samples in the source and target languages are available and 2) An additional small amount of parallel corpus is also available. We propose methods for predicting learning curves in both these scenarios

    Multilingual Abstractions: Abstract Syntax Trees and Universal Dependencies

    Get PDF
    This thesis studies the connections between parsing friendly representations and interlingua grammars developed for multilingual language generation. Parsing friendly representations refer to dependency tree representations that can be used for robust, accurate and scalable analysis of natural language text. Shared multilingual abstractions are central to both these representations. Universal Dependencies (UD) is a framework to develop cross-lingual representations, using dependency trees for multlingual representations. Similarly, Grammatical Framework (GF) is a framework for interlingual grammars, used to derive abstract syntax trees (ASTs) corresponding to sentences. The first half of this thesis explores the connections between the representations behind these two multilingual abstractions. The first study presents a conversion method from abstract syntax trees (ASTs) to dependency trees and present the mapping between the two abstractions – GF and UD – by applying the conversion from ASTs to UD. Experiments show that there is a lot of similarity behind these two abstractions and our method is used to bootstrap parallel UD treebanks for 31 languages. In the second study, we study the inverse problem i.e. converting UD trees to ASTs. This is motivated with the goal of helping GF-based interlingual translation by using dependency parsers as a robust front end instead of the parser used in GF. The second half of this thesis focuses on the topic of data augmentation for parsing – specifically using grammar-based backends for aiding in dependency parsing. We propose a generic method to generate synthetic UD treebanks using interlingua grammars and the methods developed in the first half. Results show that these synthetic treebanks are an alternative to develop parsing models, especially for under-resourced languages without much resources. This study is followed up by another study on out-of-vocabulary words (OOVs) – a more focused problem in parsing. OOVs pose an interesting problem in parser development and the method we present in this paper is a generic simplification that can act as a drop-in replacement for any symbolic parser. Our idea of replacing unknown words with known, similar words results in small but significant improvements in experiments using two parsers and for a range of 7 languages

    Multilingual Grammars and Universal Dependencies

    No full text
    Abstract syntax trees are an alternative representation to syntactic structures commonly found in NLP systems. This representation allows for sharing of structures across languages, making it well suited to serve as a translation interlingua. Grammatical Framework is a grammar formalism that captures cross-linguistic generalizations through the use of abstract syntax. The Resource Grammar Library (GF-RGL) in GF implements multilingual grammars for over 30 languages.Universal Dependencies (UDs) is a parallel effort to use shared structures to analyse sentences in different languages. The set of part-of-speech tags and functions are shared across languages. The linguistic data available from this project is annotated data i.e. sentences annotated with UD structures in over 40 languages.The main contribution of this thesis is to bridge these two representations: despite the similar motivation behind these two efforts, the representations used vary significantly. Hence, we propose a conversion method to convert the abstract syntax trees in GF to the structures used in UD.We find that the correspondence between GF-RGL and UD is significant, and the differences between the two raise interesting questions about the level of abstraction. We also present practical applications to our method: (1) the use of GF parser as a dependency parser and (2) to bootstrap UD treebanks from GF treebanks.Another topic addressed in this thesis is the problem of out-of-vocabulary words that comes up in symbolic systems. We address this problem in the context of part-of-speech tagging and statistical dependency parsing. We propose a simple method to use a distributional thesaurus to replace unknown words and show through empirical evaluation that our method improves both overall accuracies and accuracies for unknown words. Our method is generic and can be adapted to fit other NLP systems

    GF Wide-coverage English-Finnish MT system for WMT 2015

    No full text
    Abstract This paper describes the GF Widecoverage MT system submitted to WMT 2015 for translation from English to Finnish. Our system uses a interlingua based approach, in which the interlingua is a shared formal representation, that abstracts syntactic structures over multiple languages. Our final submission is a reranked system in which we combine this baseline MT system with a factored LM model
    corecore