117 research outputs found

    Robust Parsing for Ungrammatical Sentences

    Get PDF
    Natural Language Processing (NLP) is a research area that specializes in studying computational approaches to human language. However, not all of the natural language sentences are grammatically correct. Sentences that are ungrammatical, awkward, or too casual/colloquial tend to appear in a variety of NLP applications, from product reviews and social media analysis to intelligent language tutors or multilingual processing. In this thesis, we focus on parsing, because it is an essential component of many NLP applications. We investigate in what ways the performances of statistical parsers degrade when dealing with ungrammatical sentences. We also hypothesize that breaking up parse trees from problematic parts prevents NLP applications from degrading due to incorrect syntactic analysis. A parser is robust if it can overlook problems such as grammar mistakes and produce a parse tree that closely resembles the correct analysis for the intended sentence. We develop a robustness evaluation metric and conduct a series of experiments to compare the performances of state-of-the-art parsers on the ungrammatical sentences. The evaluation results show that ungrammatical sentences present challenges for statistical parsers, because the well-formed syntactic trees they produce may not be appropriate for ungrammatical sentences. We also define a new framework for reviewing the parses of ungrammatical sentences and extracting the coherent parts whose syntactic analyses make sense. We call this task parse tree fragmentation. The experimental results suggest that the proposed overall fragmentation framework is a promising way to handle syntactically unusual sentences

    Seeing the wood for the trees: data-oriented translation

    Get PDF
    Data-Oriented Translation (DOT), which is based on Data-Oriented Parsing (DOP), comprises an experience-based approach to translation, where new translations are derived with reference to grammatical analyses of previous translations. Previous DOT experiments [Poutsma, 1998, Poutsma, 2000a, Poutsma, 2000b] were small in scale because important advances in DOP technology were not incorporated into the translation model. Despite this, related work [Way, 1999, Way, 2003a, Way, 2003b] reports that DOT models are viable in that solutions to ‘hard’ translation cases are readily available. However, it has not been shown to date that DOT models scale to larger datasets. In this work, we describe a novel DOT system, inspired by recent advances in DOP parsing technology. We test our system on larger, more complex corpora than have been used heretofore, and present both automatic and human evaluations which show that high quality translations can be achieved at reasonable speeds

    Using Natural Language as Knowledge Representation in an Intelligent Tutoring System

    Get PDF
    Knowledge used in an intelligent tutoring system to teach students is usually acquired from authors who are experts in the domain. A problem is that they cannot directly add and update knowledge if they don’t learn formal language used in the system. Using natural language to represent knowledge can allow authors to update knowledge easily. This thesis presents a new approach to use unconstrained natural language as knowledge representation for a physics tutoring system so that non-programmers can add knowledge without learning a new knowledge representation. This approach allows domain experts to add not only problem statements, but also background knowledge such as commonsense and domain knowledge including principles in natural language. Rather than translating into a formal language, natural language representation is directly used in inference so that domain experts can understand the internal process, detect knowledge bugs, and revise the knowledgebase easily. In authoring task studies with the new system based on this approach, it was shown that the size of added knowledge was small enough for a domain expert to add, and converged to near zero as more problems were added in one mental model test. After entering the no-new-knowledge state in the test, 5 out of 13 problems (38 percent) were automatically solved by the system without adding new knowledge

    CCG-augmented hierarchical phrase-based statistical machine translation

    Get PDF
    Augmenting Statistical Machine Translation (SMT) systems with syntactic information aims at improving translation quality. Hierarchical Phrase-Based (HPB) SMT takes a step toward incorporating syntax in Phrase-Based (PB) SMT by modelling one aspect of language syntax, namely the hierarchical structure of phrases. Syntax Augmented Machine Translation (SAMT) further incorporates syntactic information extracted using context free phrase structure grammar (CF-PSG) in the HPB SMT model. One of the main challenges facing CF-PSG-based augmentation approaches for SMT systems emerges from the difference in the definition of the constituent in CF-PSG and the ‘phrase’ in SMT systems, which hinders the ability of CF-PSG to express the syntactic function of many SMT phrases. Although the SAMT approach to solving this problem using ‘CCG-like’ operators to combine constituent labels improves syntactic constraint coverage, it significantly increases their sparsity, which restricts translation and negatively affects its quality. In this thesis, we address the problems of sparsity and limited coverage of syntactic constraints facing the CF-PSG-based syntax augmentation approaches for HPB SMT using Combinatory Cateogiral Grammar (CCG). We demonstrate that CCG’s flexible structures and rich syntactic descriptors help to extract richer, more expressive and less sparse syntactic constraints with better coverage than CF-PSG, which enables our CCG-augmented HPB system to outperform the SAMT system. We also try to soften the syntactic constraints imposed by CCG category nonterminal labels by extracting less fine-grained CCG-based labels. We demonstrate that CCG label simplification helps to significantly improve the performance of our CCG category HPB system. Finally, we identify the factors which limit the coverage of the syntactic constraints in our CCG-augmented HPB model. We then try to tackle these factors by extending the definition of the nonterminal label to be composed of a sequence of CCG categories and augmenting the glue grammar with CCG combinatory rules. We demonstrate that our extension approaches help to significantly increase the scope of the syntactic constraints applied in our CCG-augmented HPB model and achieve significant improvements over the HPB SMT baseline

    A novel dependency-based evaluation metric for machine translation

    Get PDF
    Automatic evaluation measures such as BLEU (Papineni et al. (2002)) and NIST (Doddington (2002)) are indispensable in the development of Machine Translation (MT) systems, because they allow MT developers to conduct frequent, fast, and cost-effective evaluations of their evolving translation models. However, most of the automatic evaluation metrics rely on a comparison of word strings, measuring only the surface similarity of the candidate and reference translations, and will penalize any divergence. In effect,a candidate translation expressing the source meaning accurately and fluently will be given a low score if the lexical and syntactic choices it contains, even though perfectly legitimate, are not present in at least one of the references. Necessarily, this score would differ from a much more favourable human judgment that such a translation would receive. This thesis presents a method that automatically evaluates the quality of translation based on the labelled dependency structure of the sentence, rather than on its surface form. Dependencies abstract away from the some of the particulars of the surface string realization and provide a more "normalized" representation of (some) syntactic variants of a given sentence. The translation and reference files are analyzed by a treebank-based, probabilistic Lexical-Functional Grammar (LFG) parser (Cahill et al. (2004)) for English, which produces a set of dependency triples for each input. The translation set is compared to the reference set, and the number of matches is calculated, giving the precision, recall, and f-score for that particular translation. The use of WordNet synonyms and partial matching during the evaluation process allows for adequate treatment of lexical variation, while employing a number of best parses helps neutralize the noise introduced during the parsing stage. The dependency-based method is compared against a number of other popular MT evaluation metrics, including BLEU, NIST, GTM (Turian et al. (2003)), TER (Snover et al. (2006)), and METEOR (Banerjee and Lavie (2005)), in terms of segment- and system-level correlations with human judgments of fluency and adequacy. We also examine whether it shows bias towards statistical MT models. The comparison of the dependency-based method with other evaluation metrics is then extended to languages other than English: French, German, Spanish, and Japanese, where we apply our method to dependencies generated by Microsoft's NLPWin analyzer (Corston-Oliver and Dolan (1999); Heidorn (2000)) as well as, in the case of the Spanish data, those produced by the treebank-based, probabilistic LFG parser of Chrupa la and van Genabith (2006a,b)

    Dependency reordering features for Japanese-English phrase-based translation

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 101-106).Translating Japanese into English is very challenging because of the vast difference in word order between the two languages. For example, the main verb is always at the very end of a Japanese sentence, whereas it comes near the beginning of an English sentence. In this thesis, we develop a Japanese-to-English translation system capable of performing the long-distance reordering necessary to fluently translate Japanese into English. Our system uses novel feature functions, based on a dependency parse of the input Japanese sentence, which identify candidate translations that put dependency relationships into correct English order. For example, one feature identifies translations that put verbs before their objects. The weights for these feature functions are discriminatively trained, and so can be used for any language pair. In our Japanese-to-English system, they improve the BLEU score from 27.96 to 28.54, and we show clear improvements in subjective quality. We also experiment with a well-known technique of training the translation system on a Japanese training corpus that has been reordered into an English-like word order. Impressive results can be achieved by naively reordering each Japanese sentence into reverse order. Translating these reversed sentences with the dependency-parse-based feature functions gives further improvement. Finally, we evaluate our translation systems with human judgment, BLEU score, and METEOR score. We compare these metrics on corpus and sentence level and examine how well they capture improvements in translation word order.by Jason Edward Katz-Brown.M.Eng

    GLR parsing with multiple grammars for natural language queries.

    Get PDF
    Luk Po Chui.Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.Includes bibliographical references (leaves 97-100).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Efficiency and Memory --- p.2Chapter 1.2 --- Ambiguity --- p.3Chapter 1.3 --- Robustness --- p.4Chapter 1.4 --- Thesis Organization --- p.5Chapter 2 --- Background --- p.7Chapter 2.1 --- Introduction --- p.7Chapter 2.2 --- Context-Free Grammars --- p.8Chapter 2.3 --- The LR Parsing Algorithm --- p.9Chapter 2.4 --- The Generalized LR Parsing Algorithm --- p.12Chapter 2.4.1 --- Graph-Structured Stack --- p.12Chapter 2.4.2 --- Packed Shared Parse Forest --- p.14Chapter 2.5 --- Time and Space Complexity --- p.16Chapter 2.6 --- Related Work on Parsing --- p.17Chapter 2.6.1 --- GLR* --- p.17Chapter 2.6.2 --- TINA --- p.18Chapter 2.6.3 --- PHOENIX --- p.19Chapter 2.7 --- Chapter Summary --- p.21Chapter 3 --- Grammar Partitioning --- p.22Chapter 3.1 --- Introduction --- p.22Chapter 3.2 --- Motivation --- p.22Chapter 3.3 --- Previous Work on Grammar Partitioning --- p.24Chapter 3.4 --- Our Grammar Partitioning Approach --- p.26Chapter 3.4.1 --- Definitions and Concepts --- p.26Chapter 3.4.2 --- Guidelines for Grammar Partitioning --- p.29Chapter 3.5 --- An Example --- p.30Chapter 3.6 --- Chapter Summary --- p.34Chapter 4 --- Parser Composition --- p.35Chapter 4.1 --- Introduction --- p.35Chapter 4.2 --- GLR Lattice Parsing --- p.36Chapter 4.2.1 --- Lattice with Multiple Granularity --- p.36Chapter 4.2.2 --- Modifications to the GLR Parsing Algorithm --- p.37Chapter 4.3 --- Parser Composition Algorithms --- p.45Chapter 4.3.1 --- Parser Composition by Cascading --- p.46Chapter 4 3.2 --- Parser Composition with Predictive Pruning --- p.48Chapter 4.3.3 --- Comparison of Parser Composition by Cascading and Parser Composition with Predictive Pruning --- p.54Chapter 4.4 --- Chapter Summary --- p.54Chapter 5 --- Experimental Results and Analysis --- p.56Chapter 5.1 --- Introduction --- p.56Chapter 5.2 --- Experimental Corpus --- p.57Chapter 5.3 --- ATIS Grammar Development --- p.60Chapter 5.4 --- Grammar Partitioning and Parser Composition on ATIS Domain --- p.62Chapter 5.4.1 --- ATIS Grammar Partitioning --- p.62Chapter 5.4.2 --- Parser Composition on ATIS --- p.63Chapter 5.5 --- Ambiguity Handling --- p.66Chapter 5.6 --- Semantic Interpretation --- p.69Chapter 5.6.1 --- Best Path Selection --- p.69Chapter 5.6.2 --- Semantic Frame Generation --- p.71Chapter 5.6.3 --- Post-Processing --- p.72Chapter 5.7 --- Experiments --- p.73Chapter 5.7.1 --- Grammar Coverage --- p.73Chapter 5.7.2 --- Size of Parsing Table --- p.74Chapter 5.7.3 --- Computational Costs --- p.76Chapter 5.7.4 --- Accuracy Measures in Natural Language Understanding --- p.81Chapter 5.7.5 --- Summary of Results --- p.90Chapter 5.8 --- Chapter Summary --- p.91Chapter 6 --- Conclusions --- p.92Chapter 6.1 --- Thesis Summary --- p.92Chapter 6.2 --- Thesis Contributions --- p.93Chapter 6.3 --- Future Work --- p.94Chapter 6.3.1 --- Statistical Approach on Grammar Partitioning --- p.94Chapter 6.3.2 --- Probabilistic modeling for Best Parse Selection --- p.95Chapter 6.3.3 --- Robust Parsing Strategies --- p.96Bibliography --- p.97Chapter A --- ATIS-3 Grammar --- p.101Chapter A.l --- English ATIS-3 Grammar Rules --- p.101Chapter A.2 --- Chinese ATIS-3 Grammar Rules --- p.10
    • …
    corecore