31 research outputs found

    Platform for Full-Syntax Grammar Development Using Meta-grammar Constructs

    Get PDF
    PACLIC 20 / Wuhan, China / 1-3 November, 200

    Syntactic Analyzer for Czech Language

    Get PDF
    Diplomová práce popisuje teoretický návrh a vytvoření syntaktického analyzátoru pro český jazyk pracujícího s frázovým přístupem ke stavbě věty. Využívaná frázová syntaxe je založena na slovních druzích, které jsou sdružovány do větších slovních celků - frází. Implementovaný program pracuje s manuálně sestaveným anotovaným vzorkem dat (korpusem češtiny), na základě kterého za běhu vytvoří pravděpodobnostní bezkontextovou gramatiku (strojové učení). Syntaktický analyzátor, jehož jádrem je rozšířený CKY algoritmus, poté pro zadanou českou větu rozhodne, zda-li patří do jazyka generovaného vytvořenou gramatikou, a v kladném případě vrátí nejpravděpodobnější derivační strom této věty. Tento výsledek je následně porovnán s očekávaným řešením, čímž je vyhodnocena úspěšnost syntaktické analýzy.Master’s thesis describes theoretical basics, solution design, and implementation of constituency (phrasal) parser for Czech language, which is based on a part of speech association into phrases. Created program works with manually built and annotated Czech sample corpus to generate probabilistic context free grammar within runtime machine learning. Parser implementation, based on extended CKY algorithm, then for the input Czech sentence decides if the sentence can be generated by the created grammar and for the positive cases constructs the most probable derivation tree. This result is then compared with the expected parse to evaluate constituency parser success rate.

    Promocijas darbs

    Get PDF
    Elektroniskā versija nesatur pielikumusPromocijas darbs veltīts hibrīda latviešu valodas gramatikas modeļa izstrādei un transformēšanai uz Universālo atkarību (Universal Dependencies, UD) modeli. Promocijas darbā ir aizsākts jauns latviešu valodas izpētes virziens – sintaktiski marķētos tekstos balstīti pētījumi. Darba rezultātā ir izstrādāts un aprobēts fundamentāls, latviešu valodai iepriekš nebijis valodas resurss – mašīnlasāms sintaktiski marķēts korpuss 17 tūkstošu teikumu apmērā. Teikumi ir marķēti atbilstoši diviem dažādiem sintaktiskās marķēšanas modeļiem – darbā radītajam frāžu struktūru un atkarību gramatikas hibrīdam un starptautiski aprobētajam UD modelim. Izveidotais valodas resurss publiski pieejams gan lejuplādei, gan tiešsaistes meklēšanai abos iepriekš minētajos marķējuma veidos. Pētījuma laikā radīta rīku kopa un latviešu valodas sintaktiski marķētā korpusa veidošanai vajadzīgā infrastruktūra. Tajā skaitā tika definēti plašam valodas pārklājumam nepieciešamie LU MII eksperimentālā hibrīdā gramatikas modeļa paplašinājumi. Tāpat tika analizētas iespējas atbilstoši hibrīdmodelim marķētus datus pārveidot uz atkarību modeli, un tika radīts atvasināts UD korpuss. Izveidotais sintaktiski marķētais korpuss ir kalpojis par pamatu, lai varētu radīt augstas precizitātes (91%) parsētājus latviešu valodai. Savukārt dalība UD iniciatīvā ir veicinājusi latviešu valodas un arī citu fleksīvu valodu resursu starptautisko atpazīstamību un fleksīvām valodām piemērotāku rīku izveidi datorlingvistikā – pētniecības jomā, kuras vēsturiskā izcelsme pamatā meklējama darbā ar analītiskajām valodām. Atslēgvārdi: sintakses korpuss, Universal Dependencies, valodu tehnoloģijasThe given doctoral thesis describes the creation of a hybrid grammar model for the Latvian language, as well as its subsequent conversion to a Universal Dependencies (UD) grammar model. The thesis also lays the groundwork for Latvian language research through syntactically annotated texts. In this work, a fundamental Latvian language resource was developed and evaluated for the first time – a machine-readable treebank of 17 thousand syntactically annotated sentences. The sentences are annotated according to two syntactic annotation models: the hybrid grammar model developed in the thesis, and the internationally recognised UD model. Both annotated versions of the treebank are publicly available for downloading or querying online. Over the course of the study, a set of tools and infrastructure necessary for treebank creation and maintenance were developed. The language coverage of the IMCS UL experimental hybrid model was extended, and the possibilities were defined for converting data annotated according to the hybrid grammar model to the dependency grammar model. Based on this work, a derived UD treebank was created. The resulting treebank has served as a basis for the development of high accuracy (91%) Latvian language parsers. Furthermore, the participation in the UD initiative has promoted the international recognition of Latvian and other inflective languages and the development of better-fitted tools for inflective language processing in computational linguistics, which historically has been more oriented towards analytic languages. Keywords: treebank, Universal Dependencies, language technologie

    Discourse-level features for statistical machine translation

    Get PDF
    Machine Translation (MT) has progressed tremendously in the past two decades. The rule-based and interlingua approaches have been superseded by statistical models, which learn the most likely translations from large parallel corpora. System design does not amount anymore to crafting syntactical transfer rules, nor does it rely on a semantic representation of the text. Instead, a statistical MT system learns the most likely correspondences and re-ordering of chunks of source words and target words from parallel corpora that have been word-aligned. With this procedure and millions of parallel source and target language sentences, systems can generate translations that are intelligible and require minimal post-editing efforts from the human user. Nevertheless, it has been recognized that the statistical MT paradigm may fall short of modeling a number of linguistic phenomena that are established beyond the phrase level. Research in statistical MT has addressed discourse phenomena explicitly only in the past four years. When it comes to textual coherence structure, cohesive ties relate sentences and entire paragraphs argumentatively to each other. This text structure has to be rendered appropriately in the target text so that it conveys the same meaning as the source text. The lexical and syntactical means through which these cohesive markers are expressed may diverge considerably between languages. Frequently, these markers include discourse connectives, which are function words such as however, instead, since, while, which relate spans of text to each other, e.g. for temporal ordering, contrast or causality. Moreover, to establish the same temporal ordering of events described in a text, the conjugation of verbs has to be coherently translated. The present thesis proposes methods for integrating discourse features into statistical MT. We pre-process the source text prior to automatic translation, focusing on two specific discourse phenomena: discourse connectives and verb tenses. Hand-crafted rules are not required in our proposal; instead, machine learning classifiers are implemented that learn to recognize discourse relations and predict translations of verb tenses. Firstly, we have designed new sets of semantically-oriented features and classifiers to advance the state of the art in automatic disambiguation of discourse connectives. We hereby profited from our multilingual setting and incorporated features that are based on MT and on the insights we gained from contrastive linguistic analysis of parallel corpora. In their best configurations, our classifiers reach high performances (0.7 to 1.0 F1 score) and can therefore reliably be used to automatically annotate the large corpora needed to train SMT systems. Issues of manual annotation and evaluation are discussed as well, and solutions are provided within new annotation and evaluation procedures. As a second contribution, we implemented entire SMT systems that can make use of the (automatically) annotated discourse information. Overall, the thesis confirms that these techniques are a practical solution that leads to global improvements in translation in ranges of 0.2 to 0.5 BLEU score. Further evaluation reveals that in terms of connectives and verb tenses, our statistical MT systems improve the translation of these phenomena in ranges of up to 25%, depending on the performance of the automatic classifiers and on the data sets used

    A Computational Lexicon and Representational Model for Arabic Multiword Expressions

    Get PDF
    The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations. This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions. This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena

    Formal Linguistic Models and Knowledge Processing. A Structuralist Approach to Rule-Based Ontology Learning and Population

    Get PDF
    2013 - 2014The main aim of this research is to propose a structuralist approach for knowledge processing by means of ontology learning and population, achieved starting from unstructured and structured texts. The method suggested includes distributional semantic approaches and NL formalization theories, in order to develop a framework, which relies upon deep linguistic analysis... [edited by author]XIII n.s

    A computational approach to Latin verbs: new resources and methods

    Get PDF
    Questa tesi presenta l'applicazione di metodi computazionali allo studio dei verbi latini. In particolare, mostriamo la creazione di un lessico di sottocategorizzazione estratto automaticamente da corpora annotati; inoltre presentiamo un modello probabilistico per l'acquisizione di preferenze di selezione a partire da corpora annotati e da un'ontologia (Latin WordNet). Infine, descriviamo i risultati di uno studio diacronico e quantitativo sui preverbi spaziali latini

    Head-Driven Phrase Structure Grammar

    Get PDF
    Head-Driven Phrase Structure Grammar (HPSG) is a constraint-based or declarative approach to linguistic knowledge, which analyses all descriptive levels (phonology, morphology, syntax, semantics, pragmatics) with feature value pairs, structure sharing, and relational constraints. In syntax it assumes that expressions have a single relatively simple constituent structure. This volume provides a state-of-the-art introduction to the framework. Various chapters discuss basic assumptions and formal foundations, describe the evolution of the framework, and go into the details of the main syntactic phenomena. Further chapters are devoted to non-syntactic levels of description. The book also considers related fields and research areas (gesture, sign languages, computational linguistics) and includes chapters comparing HPSG with other frameworks (Lexical Functional Grammar, Categorial Grammar, Construction Grammar, Dependency Grammar, and Minimalism)

    Computer Aided Verification

    Get PDF
    This open access two-volume set LNCS 10980 and 10981 constitutes the refereed proceedings of the 30th International Conference on Computer Aided Verification, CAV 2018, held in Oxford, UK, in July 2018. The 52 full and 13 tool papers presented together with 3 invited papers and 2 tutorials were carefully reviewed and selected from 215 submissions. The papers cover a wide range of topics and techniques, from algorithmic and logical foundations of verification to practical applications in distributed, networked, cyber-physical, and autonomous systems. They are organized in topical sections on model checking, program analysis using polyhedra, synthesis, learning, runtime verification, hybrid and timed systems, tools, probabilistic systems, static analysis, theory and security, SAT, SMT and decisions procedures, concurrency, and CPS, hardware, industrial applications

    Tools and Algorithms for the Construction and Analysis of Systems

    Get PDF
    This open access two-volume set constitutes the proceedings of the 27th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2021, which was held during March 27 – April 1, 2021, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2021. The conference was planned to take place in Luxembourg and changed to an online format due to the COVID-19 pandemic. The total of 41 full papers presented in the proceedings was carefully reviewed and selected from 141 submissions. The volume also contains 7 tool papers; 6 Tool Demo papers, 9 SV-Comp Competition Papers. The papers are organized in topical sections as follows: Part I: Game Theory; SMT Verification; Probabilities; Timed Systems; Neural Networks; Analysis of Network Communication. Part II: Verification Techniques (not SMT); Case Studies; Proof Generation/Validation; Tool Papers; Tool Demo Papers; SV-Comp Tool Competition Papers
    corecore