226 research outputs found

    Detecting grammatical errors with treebank-induced, probabilistic parsers

    Get PDF
    Today's grammar checkers often use hand-crafted rule systems that define acceptable language. The development of such rule systems is labour-intensive and has to be repeated for each language. At the same time, grammars automatically induced from syntactically annotated corpora (treebanks) are successfully employed in other applications, for example text understanding and machine translation. At first glance, treebank-induced grammars seem to be unsuitable for grammar checking as they massively over-generate and fail to reject ungrammatical input due to their high robustness. We present three new methods for judging the grammaticality of a sentence with probabilistic, treebank-induced grammars, demonstrating that such grammars can be successfully applied to automatically judge the grammaticality of an input string. Our best-performing method exploits the differences between parse results for grammars trained on grammatical and ungrammatical treebanks. The second approach builds an estimator of the probability of the most likely parse using grammatical training data that has previously been parsed and annotated with parse probabilities. If the estimated probability of an input sentence (whose grammaticality is to be judged by the system) is higher by a certain amount than the actual parse probability, the sentence is flagged as ungrammatical. The third approach extracts discriminative parse tree fragments in the form of CFG rules from parsed grammatical and ungrammatical corpora and trains a binary classifier to distinguish grammatical from ungrammatical sentences. The three approaches are evaluated on a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting common grammatical errors into the British National Corpus. The results are compared to two traditional approaches, one that uses a hand-crafted, discriminative grammar, the XLE ParGram English LFG, and one based on part-of-speech n-grams. In addition, the baseline methods and the new methods are combined in a machine learning-based framework, yielding further improvements

    Data-Oriented Parsing with discontinuous constituents and function tags

    Get PDF
    Statistical parsers are e ective but are typically limited to producing projective dependencies or constituents. On the other hand, linguisti- cally rich parsers recognize non-local relations and analyze both form and function phenomena but rely on extensive manual grammar development. We combine advantages of the two by building a statistical parser that produces richer analyses.  We investigate new techniques to implement treebank-based parsers that allow for discontinuous constituents. We present two systems. One system is based on a string-rewriting Linear Context-Free Rewriting System (LCFRS), while using a Probabilistic Discontinuous Tree Substitution Grammar (PDTSG) to improve disambiguation performance. Another system encodes the discontinuities in the labels of phrase structure trees, allowing for efficient context-free grammar parsing. The two systems demonstrate that tree fragments as used in tree-substitution grammar improve disambiguation performance while capturing non-local relations on an as-needed basis. Additionally, we present results of models that produce function tags, resulting in a more linguistically adequate model of the data. We report substantial accuracy improvements in discontinuous parsing for German, English, and Dutch, including results on spoken Dutch

    Data-Oriented Parsing with Discontinuous Constituents and Function Tags

    Get PDF
    Statistical parsers are e ective but are typically limited to producing projective dependencies or constituents. On the other hand, linguisti- cally rich parsers recognize non-local relations and analyze both form and function phenomena but rely on extensive manual grammar development. We combine advantages of the two by building a statistical parser that produces richer analyses. We investigate new techniques to implement treebank-based parsers that allow for discontinuous constituents. We present two systems. One system is based on a string-rewriting Linear Context-Free Rewriting System (LCFRS), while using a Probabilistic Discontinuous Tree Substitution Grammar (PDTSG) to improve disambiguation performance. Another system encodes the discontinuities in the labels of phrase structure trees, allowing for efficient context-free grammar parsing. The two systems demonstrate that tree fragments as used in tree-substitution grammar improve disambiguation performance while capturing non-local relations on an as-needed basis. Additionally, we present results of models that produce function tags, resulting in a more linguistically adequate model of the data. We report substantial accuracy improvements in discontinuous parsing for German, English, and Dutch, including results on spoken Dutch

    A tree-to-tree model for statistical machine translation

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 227-234).In this thesis, we take a statistical tree-to-tree approach to solving the problem of machine translation (MT). In a statistical tree-to-tree approach, first the source-language input is parsed into a syntactic tree structure; then the source-language tree is mapped to a target-language tree. This kind of approach has several advantages. For one, parsing the input generates valuable information about its meaning. In addition, the mapping from a source-language tree to a target-language tree offers a mechanism for preserving the meaning of the input. Finally, producing a target-language tree helps to ensure the grammaticality of the output. A main focus of this thesis is to develop a statistical tree-to-tree mapping algorithm. Our solution involves a novel representation called an aligned extended projection, or AEP. The AEP, inspired by ideas in linguistic theory related to tree-adjoining grammars, is a parse-tree like structure that models clause-level phenomena such as verbal argument structure and lexical word-order. The AEP also contains alignment information that links the source-language input to the target-language output. Instead of learning a mapping from a source-language tree to a target-language tree, the AEP-based approach learns a mapping from a source-language tree to a target-language AEP. The AEP is a complex structure, and learning a mapping from parse trees to AEPs presents a challenging machine learning problem. In this thesis, we use a linear structured prediction model to solve this learning problem. A human evaluation of the AEP-based translation approach in a German-to-English task shows significant improvements in the grammaticality of translations. This thesis also presents a statistical parser for Spanish that could be used as part of a Spanish/English translation system.by Brooke Alissa Cowan.Ph.D

    On the format for parameters

    Get PDF

    An investigation into near-nativeness at the syntax-lexicon interface: evidence from Dutch learners of English

    Get PDF
    This thesis investigates whether there are differences in language comprehension and language production between highly advanced/near-native adult learners of a second language (late L2ers) and native speakers (L1ers), and if so, how they should be characterised. In previous literature (Sorace & Filiaci 2006, Sorace 2011 inter alia), nonconvergence of the near-native grammar with the native grammar has been identified as most likely to occur at the interface between syntax and another cognitive domain. This thesis focuses on grammatical and ungrammatical representations at the syntax-lexicon interface between very advanced/near-native Dutch learners of English and native speakers of English. We tested differences in syntactic knowledge representations and real-time processing through eight experiments. By syntactic knowledge representations we mean the explicit knowledge of grammar (specifically word order dependence on lexical-semantics) that a language user exhibits in their language comprehension and production, and by realtime processing we mean the language user’s ability to access implicit and explicit knowledge of grammar under time and/or memory constraints in their language comprehension and production. To test for systematic differences at the syntax-lexicon interface we examined linguistic structures in English that differ minimally in word order from Dutch depending on the presence or absence of certain lexical items and their characteristics; these were possessive structures with animate and inanimate possessors and possessums in either a prenominal or postnominal construction, preposed adverbials of location (locative inversions) followed by either unergative or unaccusative verbs, and preposed adverbials of manner containing a negative polarity item (negative inversions) or positive polarity item followed by either V2 or V3 word order. We used Magnitude Estimation Tasks and Speeded Grammaticality Judgement Tasks to test comprehension, and Syntactic Priming (with/without extra memory load) and Speeded Sentence Completion Tasks to test production. We found evidence for differences in comprehension and production between very advanced, near-native Dutch L2ers and native speakers of English, and that these differences appear to be associated with processing rather than with competence. Dutch L2ers differed from English L1ers with respect to preferences in word order of possessive structures and after preposed adverbials of manner. However, these groups did not differ in production and comprehension with respect to transitivity in locative inversions. We conclude that even among highly advanced to near-native late learners of a second language there may be non-convergence of the L2 grammar. Such non-convergence need not coincide with the L1 grammar but may rather be a result of over-applying linguistic L2 knowledge. Thus, very advanced to near-native L2ers still have access to limited (meta)linguistic resources that under time and memory constraints may result in ungrammatical language comprehension and/or production at the syntax-lexicon interface. In sum, in explaining interface phenomena, the results of this study provide evidence for a processing account over a representational account, i.e. Dutch L2ers showed they possess grammatical knowledge of the specific L2 linguistic structures in comprehension and production, but over-applied this knowledge in exceptional cases under time and/or memory pressure. We suggest that current bilingual production models focus more on working memory by including a separate memory component to such models and conducting empirical research to test its influence on L2 production and comprehension

    On the metatheory of linguistics

    Get PDF
    Wurm C. On the metatheory of linguistics. Bielefeld: UB Bielefeld; 2013

    Syntax with oscillators and energy levels

    Get PDF
    This book presents a new approach to studying the syntax of human language, one which emphasizes how we think about time. Tilsen argues that many current theories are unsatisfactory because those theories conceptualize syntactic patterns with spatially arranged structures of objects. These object-structures are atemporal and do not lend well to reasoning about time. The book develops an alternative conceptual model in which oscillatory systems of various types interact with each other through coupling forces, and in which the relative energies of those systems are organized in particular ways. Tilsen emphasizes that the two primary mechanisms of the approach – oscillators and energy levels – require alternative ways of thinking about time. Furthermore, his theory leads to a new way of thinking about grammaticality and the recursive nature of language. The theory is applied to a variety of syntactic phenomena: word order, phrase structure, morphosyntax, constituency, case systems, ellipsis, anaphora, and islands. The book also presents a general program for the study of language in which the construction of linguistic theories is itself an object of theoretical analysis. Reviewed by John Goldsmith, Mark Gibson and an anonymous reviewer. Signed reports are openly available in the downloads session
    corecore