204 research outputs found

    The Circle of Meaning: From Translation to Paraphrasing and Back

    Get PDF
    The preservation of meaning between inputs and outputs is perhaps the most ambitious and, often, the most elusive goal of systems that attempt to process natural language. Nowhere is this goal of more obvious importance than for the tasks of machine translation and paraphrase generation. Preserving meaning between the input and the output is paramount for both, the monolingual vs bilingual distinction notwithstanding. In this thesis, I present a novel, symbiotic relationship between these two tasks that I term the "circle of meaning''. Today's statistical machine translation (SMT) systems require high quality human translations for parameter tuning, in addition to large bi-texts for learning the translation units. This parameter tuning usually involves generating translations at different points in the parameter space and obtaining feedback against human-authored reference translations as to how good the translations. This feedback then dictates what point in the parameter space should be explored next. To measure this feedback, it is generally considered wise to have multiple (usually 4) reference translations to avoid unfair penalization of translation hypotheses which could easily happen given the large number of ways in which a sentence can be translated from one language to another. However, this reliance on multiple reference translations creates a problem since they are labor intensive and expensive to obtain. Therefore, most current MT datasets only contain a single reference. This leads to the problem of reference sparsity---the primary open problem that I address in this dissertation---one that has a serious effect on the SMT parameter tuning process. Bannard and Callison-Burch (2005) were the first to provide a practical connection between phrase-based statistical machine translation and paraphrase generation. However, their technique is restricted to generating phrasal paraphrases. I build upon their approach and augment a phrasal paraphrase extractor into a sentential paraphraser with extremely broad coverage. The novelty in this augmentation lies in the further strengthening of the connection between statistical machine translation and paraphrase generation; whereas Bannard and Callison-Burch only relied on SMT machinery to extract phrasal paraphrase rules and stopped there, I take it a few steps further and build a full English-to-English SMT system. This system can, as expected, ``translate'' any English input sentence into a new English sentence with the same degree of meaning preservation that exists in a bilingual SMT system. In fact, being a state-of-the-art SMT system, it is able to generate n-best "translations" for any given input sentence. This sentential paraphraser, built almost entirely from existing SMT machinery, represents the first 180 degrees of the circle of meaning. To complete the circle, I describe a novel connection in the other direction. I claim that the sentential paraphraser, once built in this fashion, can provide a solution to the reference sparsity problem and, hence, be used to improve the performance a bilingual SMT system. I discuss two different instantiations of the sentential paraphraser and show several results that provide empirical validation for this connection

    Analyzing the Linguistic Features of Standardized Math Items: A Text Mining Approach

    Full text link
    The following is a five-chapter dissertation surrounding the use of text mining techniques for better understanding the language of mathematics items from standardized tests to improve linguistic equity of these items to support assessment of English Language Learners. Introduction: The dissertation begins with an overview of the problem that English Language Learners are likely not able to demonstrate their full mathematical ability due to the construct irrelevant variance caused by these items being written in English. This introduction also introduces the idea of text mining as a methodology for use in exploring this test design issue. Article 1: This article presents an exploratory study of the vocabulary used in released math test items for grades 3-8. The author collected and cleaned the data to arrive at a final corpus of 5674 math problems. Next, a series of text mining techniques were performed including the “bag of words” approach, sentiment analysis, and Latent Dirichlet Allocation (LDA). The bag of words approach generated an overall word list for the entire corpus, by grade level, and by mathematical domain. For each of these lists, the majority of the words found were polysemous, meaning they had multiple meanings, which is inappropriate for ELLs. The sentiment analysis results showed that there was not any obvious negative sentiment found in these items. Finally, the LDA results showed that there were 9 latent topics found within the language of these items. Article 2: This article is an exploratory study of the state of the parts of speech used in released math standardized test items for grades 3-8. The author collected and cleaned the data to arrive at a corpus of 5674 math problems. Next, a series of parts of speech analyses were performed to better understand the grammatical structures used within current mathematics items, as well as a bigrams and trigrams analysis of the most commonly used phrases found within these items. The variation in parts of speech and readability of these items was tracked across grade levels and was found to become more complicated as the grade level increased. The grammatical parts of speech were also used to predict the item difficulty for those items (N = 1627) with some of these parts of speech being found to negatively correlated with item difficulty estimates. Article 3: This article describes the development of an open-source text parser for multiple-choice mathematics items intended for students in grades 3-8. To train this parser, initially, seven machine learning classification algorithms were used to predict item difficulty as measured by p-value. The most accurate of these models was a special kind of Support Vector Machine called a Support Vector Classifier which had almost 50% accuracy. This parser was trained to estimate approximate item difficulty level as well as to identify problematic vocabulary words, estimate the readability of the question, and support the user to know which problematic parts of speech are being used in the item. Math Item Parse is operational but is still in a prototype stage because a larger training set is needed to improve the model accuracy. Final Discussion: The dissertation concludes with a short discussion that describes how these findings impact educators, test developers, methodologists, and policy makers, and discusses the biggest limitations of this dissertation and offers some next steps

    Robust handling of out-of-vocabulary words in deep language processing

    Get PDF
    Tese de doutoramento, Informática (Ciências da Computação), Universidade de Lisboa, Faculdade de Ciências, 2014Deep grammars handle with precision complex grammatical phenomena and are able to provide a semantic representation of their input sentences in some logic form amenable to computational processing, making such grammars desirable for advanced Natural Language Processing tasks. The robustness of these grammars still has room to be improved. If any of the words in a sentence is not present in the lexicon of the grammar, i.e. if it is an out-of-vocabulary (OOV) word, a full parse of that sentence may not be produced. Given that the occurrence of such words is inevitable, e.g. due to the property of lexical novelty that is intrinsic to natural languages, deep grammars need some mechanism to handle OOV words if they are to be used in applications to analyze unrestricted text. The aim of this work is thus to investigate ways of improving the handling of OOV words in deep grammars. The lexicon of a deep grammar is highly thorough, with words being assigned extremely detailed linguistic information. Accurately assigning similarly detailed information to OOV words calls for the development of novel approaches, since current techniques mostly rely on shallow features and on a limited window of context, while there are many cases where the relevant information is to be found in wider linguistic structure and in long-distance relations. The solution proposed here consists of a classifier, SVM-TK, that is placed between the input to the grammar and the grammar itself. This classifier can take a variety of features and assign to words deep lexical types which can then be used by the grammar when faced with OOV words. The classifier is based on support-vector machines which, through the use of kernels, allows the seamless use of features encoding linguistic structure in the classifier. This dissertation focuses on the HPSG framework, but the method can be used in any framework where the lexical information can be encoded as a word tag. As a case study, we take LX-Gram, a computational grammar for Portuguese, to improve its robustness with respect to OOV verbs. Given that the subcategorization frame of a word is a substantial part of what is encoded in an HPSG deep lexical type, the classifier takes graph encoding grammatical dependencies as features. At runtime, these dependencies are produced by a probabilistic dependency parser. The SVM-TK classifier is compared against the state-of-the-art approaches for OOV handling, which consist of using a standard POS-tagger to assign lexical types, in essence doing POS-tagging with a highly granular tagset. Results show that SVM-TK is able to improve on the state-of-the-art, with the usual data-sparseness bottleneck issues imposing this to happen when the amount of training data is large enough.As gramáticas de processamento profundo lidam de forma precisa com fenómenos linguisticos complexos e são capazes de providenciar uma representação semântica das frases que lhes são dadas, o que torna tais gramáticas desejáveis para tarefas avançadas em Processamento de Linguagem Natural. A robustez destas gramáticas tem ainda espaço para ser melhorada. Se alguma das palavras numa frase não se encontra presente no léxico da gramática (em inglês, uma palavra out-of-vocabulary, ou OOV), pode não ser possível produzir uma análise completa dessa frase. Dado que a ocorrência de tais palavras é algo inevitável, e.g. devido à novidade lexical que é intrínseca às línguas naturais, as gramáticas profundas requerem algum mecanismo que lhes permita lidar com palavras OOV de forma a que possam ser usadas para análise de texto em aplicações. O objectivo deste trabalho é então investigar formas de melhor lidar com palavras OOV numa gramática de processamento profundo. O léxico de uma gramática profunda é altamente granular, sendo cada palavra associada com informação linguística extremamente detalhada. Atribuir corretamente a palavras OOV informação linguística com o nível de detalhe adequado requer que se desenvolvam técnicas inovadoras, dado que as abordagens atuais baseiam-se, na sua maioria, em características superficiais (shallow features) e em janelas de contexto limitadas, apesar de haver muitos casos onde a informação relevante se encontra na estrutura linguística e em relações de longa distância. A solução proposta neste trabalho consiste num classificador, SVM-TK, que é colocado entre o input da gramática e a gramática propriamente dita. Este classificador aceita uma variedade de features e atribui às palavras tipos lexicais profundos que podem então ser usado pela gramática sempre que esta se depare com palavras OOV. O classificador baseia-se em máquinas de vetores de suporte (support-vector machines). Esta técnica, quando combinada com o uso de kernels, permite que o classificador use, de forma transparente, features que codificam estrutura linguística. Esta dissertação foca-se no enquadramento teórico HPSG, embora o método proposto possa ser usado em qualquer enquadramento onde a informação lexical possa ser codificada sob a forma de uma etiqueta atribuída a uma palavra. Como caso de estudo, usamos a LX-Gram, uma gramatica computacional para a língua portuguesa, e melhoramos a sua robustez a verbos OOV. Dado que a grelha de subcategorização de uma palavra é uma parte substancial daquilo que se encontra codificado num tipo lexical profundo em HPSG, o classificador usa features baseados em dependências gramaticais. No momento de execução, estas dependências são produzidas por um analisador de dependências probabilístico. O classificador SVM-TK é comparado com o estado-da-arte para a tarefa de resolução de palavras OOV, que consiste em usar um anotador morfossintático (POS-tagger) para atribuir tipos lexicais, fazendo, no fundo, anotação com um conjunto de etiquetas altamente detalhado. Os resultados mostram que o SVM-TK melhora o estado-da-arte, com os já habituais problemas de esparssez de dados fazendo com que este efeito seja notado quando a quantidade de dados de treino é suficientemente grande.Fundação para a Ciência e a Tecnologia (FCT, SFRH/BD/41465/2007

    Energy Efficiency Models for Scientific Applications on Supercomputers

    Get PDF

    Machine Learning for Holistic Evaluation of Scientific Essays

    Full text link
    Abstract. In the US in particular, there is an increasing emphasis on the importance of science in education. To better understand a scien-tific topic, students need to compile information from multiple sources and determine the principal causal factors involved. We describe an ap-proach for automatically inferring the quality and completeness of causal reasoning in essays on two separate scientific topics using a novel, two-phase machine learning approach for detecting causal relations. For each core essay concept, we initially trained a window-based tagging model to predict which individual words belonged to that concept. Using the predictions from this first set of models, we then trained a second stacked model on all the predicted word tags present in a sentence to predict in-ferences between essay concepts. The results indicate we could use such a system to provide explicit feedback to students to improve reasoning and essay writing skills

    Natural Language Processing Resources for Finnish. Corpus Development in the General and Clinical Domains

    Get PDF
    Siirretty Doriast

    Contributions to the Theory of Finite-State Based Grammars

    Get PDF
    This dissertation is a theoretical study of finite-state based grammars used in natural language processing. The study is concerned with certain varieties of finite-state intersection grammars (FSIG) whose parsers define regular relations between surface strings and annotated surface strings. The study focuses on the following three aspects of FSIGs: (i) Computational complexity of grammars under limiting parameters In the study, the computational complexity in practical natural language processing is approached through performance-motivated parameters on structural complexity. Each parameter splits some grammars in the Chomsky hierarchy into an infinite set of subset approximations. When the approximations are regular, they seem to fall into the logarithmic-time hierarchyand the dot-depth hierarchy of star-free regular languages. This theoretical result is important and possibly relevant to grammar induction. (ii) Linguistically applicable structural representations Related to the linguistically applicable representations of syntactic entities, the study contains new bracketing schemes that cope with dependency links, left- and right branching, crossing dependencies and spurious ambiguity. New grammar representations that resemble the Chomsky-SchĂĽtzenberger representation of context-free languages are presented in the study, and they include, in particular, representations for mildly context-sensitive non-projective dependency grammars whose performance-motivated approximations are linear time parseable. (iii) Compilation and simplification of linguistic constraints Efficient compilation methods for certain regular operations such as generalized restriction are presented. These include an elegant algorithm that has already been adopted as the approach in a proprietary finite-state tool. In addition to the compilation methods, an approach to on-the-fly simplifications of finite-state representations for parse forests is sketched. These findings are tightly coupled with each other under the theme of locality. I argue that the findings help us to develop better, linguistically oriented formalisms for finite-state parsing and to develop more efficient parsers for natural language processing. Avainsanat: syntactic parsing, finite-state automata, dependency grammar, first-order logic, linguistic performance, star-free regular approximations, mildly context-sensitive grammar
    • …