17 research outputs found

    Grammar Induction from Text Using Small Syntactic Prototypes

    Get PDF
    We present an efficient technique to incorporate a small number of cross-linguistic parameter settings defining default word orders to otherwise unsupervised grammar induction. A syntactic prototype, represented by the integrated model between Categorial Grammar and dependency structure, generated from the language parameters, is used to prune the search space. We also propose heuristics which prefer less complex syntactic categories to more complex ones in parse decoding. The system reduces errors generated by the state-of-the-art baselines for WSJ10 (1 % error reduction of F1 score for the model trained on Sections 2–22 and tested on Section 23), Chinese10 (26 % error reduction of F1), German10 (9 % error reduction of F1), and Japanese10 (8% error reduction of F1), and is not significantly different from the baseline for Czech10.

    Plaesarn: Machine-Aided Translation Tool for English-to-Thai

    No full text
    English-Thai MT systems are nowadays restricted by incomplete vocabularies and translation knowledge. Users must consequently accept only one translation result that is sometimes semantically divergent or ungrammatical. With the according reason, we propose novel Internet-based translation assistant software in order to facilitate document translation from English to Thai. In this project, we utilize the structural transfer model as the mechanism. This project di#ers from current English-Thai MT systems in the aspects that it empowers the users to manually select the most appropriate translation from every possibility and to manually train new translation rules to the system if it is necessary. With the applied model, we overcome four translation problems---lexicon rearrangement, structural ambiguity, phrase translation, and classifier generation. Finally, we started the system evaluation with 322 randomly selected sentences on the Future Magazine bilingual corpus and the system yielded 59.87% and 83.08% translation accuracy for the best case and the worse case based on 90.1% average precision of the parser

    Scalable semi-supervised grammar induction using cross-linguistically parameterized syntactic prototypes

    Get PDF
    This thesis is about the task of unsupervised parser induction: automatically learning grammars and parsing models from raw text. We endeavor to induce such parsers by observing sequences of terminal symbols. We focus on overcoming the problem of frequent collocation that is a major source of error in grammar induction. For example, since a verb and a determiner tend to co-occur in a verb phrase, the probability of attaching the determiner to the verb is sometimes higher than that of attaching the core noun to the verb, resulting in erroneous attachment *((Verb Det) Noun) instead of (Verb (Det Noun)). Although frequent collocation is the heart of grammar induction, it is precariously capable of distorting the grammar distribution. Natural language grammars follow a Zipfian (power law) distribution, where the frequency of any grammar rule is inversely proportional to its rank in the frequency table. We believe that covering the most frequent grammar rules in grammar induction will have a strong impact on accuracy. We propose an efficient approach to grammar induction guided by cross-linguistic language parameters. Our language parameters consist of 33 parameters of frequent basic word orders, which are easy to be elicited from grammar compendiums or short interviews with naïve language informants. These parameters are designed to capture frequent word orders in the Zipfian distribution of natural language grammars, while the rest of the grammar including exceptions can be automatically induced from unlabeled data. The language parameters shrink the search space of the grammar induction problem by exploiting both word order information and predefined attachment directions. The contribution of this thesis is three-fold. (1) We show that the language parameters are adequately generalizable cross-linguistically, as our grammar induction experiments will be carried out on 14 languages on top of a simple unsupervised grammar induction system. (2) Our specification of language parameters improves the accuracy of unsupervised parsing even when the parser is exposed to much less frequent linguistic phenomena in longer sentences when the accuracy decreases within 10%. (3) We investigate the prevalent factors of errors in grammar induction which will provide room for accuracy improvement. The proposed language parameters efficiently cope with the most frequent grammar rules in natural languages. With only 10 man-hours for preparing syntactic prototypes, it improves the accuracy of directed dependency recovery over the state-ofthe- art Gillenwater et al.’s (2010) completely unsupervised parser in: (1) Chinese by 30.32% (2) Swedish by 28.96% (3) Portuguese by 37.64% (4) Dutch by 15.17% (5) German by 14.21% (6) Spanish by 13.53% (7) Japanese by 13.13% (8) English by 12.41% (9) Czech by 9.16% (10) Slovene by 7.24% (11) Turkish by 6.72% and (12) Bulgarian by 5.96%. It is noted that although the directed dependency accuracies of some languages are below 60%, their TEDEVAL scores are still satisfactory (approximately 80%). This suggests us that our parsed trees are, in fact, closely related to the gold-standard trees despite the discrepancy of annotation schemes. We perform an error analysis of over- and under-generation analysis. We found three prevalent problems that cause errors in the experiments: (1) PP attachment (2) discrepancies of dependency annotation schemes and (3) rich morphology. The methods presented in this thesis were originally presented in Boonkwan and Steedman (2011). The thesis presents a great deal more detail in the design of crosslinguistic language parameters, the algorithm of lexicon inventory construction, experiment results, and error analysis

    A Linguistics-Driven Approach to Statistical Parsing for Low-Resourced Languages

    No full text

    An Integrated Tool for Translation-Memory Maintenance

    No full text
    This paper presents an integrated tool to construct and maintain translation-memory for memory-based machine translation. This tool was aimed to automate constructing and validating translation-memory both in word and in phrase levels from English-Thai parallel texts. To align English-Thai words and phrases, the crucial problems that must be resolved include multiple-word-expression boundary ambiguity, word-sense ambiguity, and excessive alignment generation. The system consists of three components: (1) word-alignment tool, (2) phrase-alignment tool, and (3) validation tool. Distribution and collocation of words and phrases are analyzed to generate alignment candidates for manual selection

    Memory-inductive categorial grammar: an approach to gap resolution in analytic-language translation

    No full text
    This paper presents a generalized framework of syntax-based gap resolution in analytic language translation using an extended version of categorial grammar. Translating analytic languages into Indo-European languages suffers the issues of gapping, because “deletion under coordination ” and “verb serialization ” are necessary to be resolved beforehand. Rudimentary operations, i.e. antecedent memorization, gap induction, and gap resolution, were introduced to the categorial grammar to resolve gapping issues syntactically. Hereby, pronominal references can be generated for deletion under coordination, while sentence structures can be properly selected for verb serialization.

    NCSEC2004 Parsing Thai for Machine Translation using Augmented State Transducer and Lexical Functional Grammar

    No full text
    This paper presents an efficient, yet finegrained, approach to parsing Thai texts. This approach was intended to resolve omission problems and sentential-NP grouping for Thai-English machine translation. The omission problems are zero anaphora, no explicit tenses and numbers, and no explicit topic markers. To resolve those, the augmented state transducer was exploited to resolve noun grouping and the lexical functional grammar was applied to identify omissions. From the experiment, it was found that the augmented state transducer could properly resolve sentential-noun grouping, while most omissions could be identified by the lexical functional grammar. At average, the parser yields 80.72 % accuracy and the number of produced trees is 30.36 % reduced compared with which of the original LFG
    corecore