204 research outputs found
The Circle of Meaning: From Translation to Paraphrasing and Back
The preservation of meaning between inputs and outputs is perhaps
the most ambitious and, often, the most elusive goal of systems
that attempt to process natural language. Nowhere is this goal of
more obvious importance than for the tasks of machine translation
and paraphrase generation. Preserving meaning between the input and
the output is paramount for both, the monolingual vs bilingual distinction
notwithstanding. In this thesis, I present a novel, symbiotic relationship
between these two tasks that I term the "circle of meaning''.
Today's statistical machine translation (SMT) systems require high
quality human translations for parameter tuning, in addition to
large bi-texts for learning the translation units. This parameter
tuning usually involves generating translations at different points
in the parameter space and obtaining feedback against human-authored
reference translations as to how good the translations. This feedback
then dictates what point in the parameter space should be explored
next. To measure this feedback, it is generally considered wise to have
multiple (usually 4) reference translations to avoid unfair penalization of translation
hypotheses which could easily happen given the large number of ways in which
a sentence can be translated from one language to another. However, this reliance on multiple reference translations
creates a problem since they are labor intensive and expensive to obtain.
Therefore, most current MT datasets only contain a single reference.
This leads to the problem of reference sparsity---the primary open problem
that I address in this dissertation---one that has a serious effect on the
SMT parameter tuning process.
Bannard and Callison-Burch (2005) were the first to provide a practical
connection between phrase-based statistical machine translation and paraphrase
generation. However, their technique is restricted to generating phrasal
paraphrases. I build upon their approach and augment a phrasal paraphrase
extractor into a sentential paraphraser with extremely broad coverage.
The novelty in this augmentation lies in the further strengthening of
the connection between statistical machine translation and paraphrase
generation; whereas Bannard and Callison-Burch only relied on SMT machinery
to extract phrasal paraphrase rules and stopped there, I take it a few
steps further and build a full English-to-English SMT system. This system
can, as expected, ``translate'' any English input sentence into a new English
sentence with the same degree of meaning preservation that exists in a bilingual
SMT system. In fact, being a state-of-the-art SMT system, it is able to generate
n-best "translations" for any given input sentence. This sentential
paraphraser, built almost entirely from existing SMT machinery, represents
the first 180 degrees of the circle of meaning.
To complete the circle, I describe a novel connection in the other direction.
I claim that the sentential paraphraser, once built in this fashion, can
provide a solution to the reference sparsity problem and, hence, be used
to improve the performance a bilingual SMT system. I discuss two different
instantiations of the sentential paraphraser and show several results that
provide empirical validation for this connection
Analyzing the Linguistic Features of Standardized Math Items: A Text Mining Approach
The following is a five-chapter dissertation surrounding the use of text mining techniques for better understanding the language of mathematics items from standardized tests to improve linguistic equity of these items to support assessment of English Language Learners.
Introduction: The dissertation begins with an overview of the problem that English Language Learners are likely not able to demonstrate their full mathematical ability due to the construct irrelevant variance caused by these items being written in English. This introduction also introduces the idea of text mining as a methodology for use in exploring this test design issue.
Article 1: This article presents an exploratory study of the vocabulary used in released math test items for grades 3-8. The author collected and cleaned the data to arrive at a final corpus of 5674 math problems. Next, a series of text mining techniques were performed including the “bag of words” approach, sentiment analysis, and Latent Dirichlet Allocation (LDA). The bag of words approach generated an overall word list for the entire corpus, by grade level, and by mathematical domain. For each of these lists, the majority of the words found were polysemous, meaning they had multiple meanings, which is inappropriate for ELLs. The sentiment analysis results showed that there was not any obvious negative sentiment found in these items. Finally, the LDA results showed that there were 9 latent topics found within the language of these items.
Article 2: This article is an exploratory study of the state of the parts of speech used in released math standardized test items for grades 3-8. The author collected and cleaned the data to arrive at a corpus of 5674 math problems. Next, a series of parts of speech analyses were performed to better understand the grammatical structures used within current mathematics items, as well as a bigrams and trigrams analysis of the most commonly used phrases found within these items. The variation in parts of speech and readability of these items was tracked across grade levels and was found to become more complicated as the grade level increased. The grammatical parts of speech were also used to predict the item difficulty for those items (N = 1627) with some of these parts of speech being found to negatively correlated with item difficulty estimates.
Article 3: This article describes the development of an open-source text parser for multiple-choice mathematics items intended for students in grades 3-8. To train this parser, initially, seven machine learning classification algorithms were used to predict item difficulty as measured by p-value. The most accurate of these models was a special kind of Support Vector Machine called a Support Vector Classifier which had almost 50% accuracy. This parser was trained to estimate approximate item difficulty level as well as to identify problematic vocabulary words, estimate the readability of the question, and support the user to know which problematic parts of speech are being used in the item. Math Item Parse is operational but is still in a prototype stage because a larger training set is needed to improve the model accuracy.
Final Discussion: The dissertation concludes with a short discussion that describes how these findings impact educators, test developers, methodologists, and policy makers, and discusses the biggest limitations of this dissertation and offers some next steps
Robust handling of out-of-vocabulary words in deep language processing
Tese de doutoramento, Informática (CiĂŞncias da Computação), Universidade de Lisboa, Faculdade de CiĂŞncias, 2014Deep grammars handle with precision complex grammatical phenomena and are able to provide a semantic representation of their input sentences in some logic form amenable to computational processing, making such grammars desirable for advanced Natural Language Processing tasks. The robustness of these grammars still has room to be improved. If any of the words in a sentence is not present in the lexicon of the grammar, i.e. if it is an out-of-vocabulary (OOV) word, a full parse of that sentence may not be produced. Given that the occurrence of such words is inevitable, e.g. due to the property of lexical novelty that is intrinsic to natural languages, deep grammars need some mechanism to handle OOV words if they are to be used in applications to analyze unrestricted text. The aim of this work is thus to investigate ways of improving the handling of OOV words in deep grammars. The lexicon of a deep grammar is highly thorough, with words being assigned extremely detailed linguistic information. Accurately assigning similarly detailed information to OOV words calls for the development of novel approaches, since current techniques mostly rely on shallow features and on a limited window of context, while there are many cases where the relevant information is to be found in wider linguistic structure and in long-distance relations. The solution proposed here consists of a classifier, SVM-TK, that is placed between the input to the grammar and the grammar itself. This classifier can take a variety of features and assign to words deep lexical types which can then be used by the grammar when faced with OOV words. The classifier is based on support-vector machines which, through the use of kernels, allows the seamless use of features encoding linguistic structure in the classifier. This dissertation focuses on the HPSG framework, but the method can be used in any framework where the lexical information can be encoded as a word tag. As a case study, we take LX-Gram, a computational grammar for Portuguese, to improve its robustness with respect to OOV verbs. Given that the subcategorization frame of a word is a substantial part of what is encoded in an HPSG deep lexical type, the classifier takes graph encoding grammatical dependencies as features. At runtime, these dependencies are produced by a probabilistic dependency parser. The SVM-TK classifier is compared against the state-of-the-art approaches for OOV handling, which consist of using a standard POS-tagger to assign lexical types, in essence doing POS-tagging with a highly granular tagset. Results show that SVM-TK is able to improve on the state-of-the-art, with the usual data-sparseness bottleneck issues imposing this to happen when the amount of training data is large enough.As gramáticas de processamento profundo lidam de forma precisa com fenĂłmenos linguisticos complexos e sĂŁo capazes de providenciar uma representação semântica das frases que lhes sĂŁo dadas, o que torna tais gramáticas desejáveis para tarefas avançadas em Processamento de Linguagem Natural. A robustez destas gramáticas tem ainda espaço para ser melhorada. Se alguma das palavras numa frase nĂŁo se encontra presente no lĂ©xico da gramática (em inglĂŞs, uma palavra out-of-vocabulary, ou OOV), pode nĂŁo ser possĂvel produzir uma análise completa dessa frase. Dado que a ocorrĂŞncia de tais palavras Ă© algo inevitável, e.g. devido Ă novidade lexical que Ă© intrĂnseca Ă s lĂnguas naturais, as gramáticas profundas requerem algum mecanismo que lhes permita lidar com palavras OOV de forma a que possam ser usadas para análise de texto em aplicações. O objectivo deste trabalho Ă© entĂŁo investigar formas de melhor lidar com palavras OOV numa gramática de processamento profundo. O lĂ©xico de uma gramática profunda Ă© altamente granular, sendo cada palavra associada com informação linguĂstica extremamente detalhada. Atribuir corretamente a palavras OOV informação linguĂstica com o nĂvel de detalhe adequado requer que se desenvolvam tĂ©cnicas inovadoras, dado que as abordagens atuais baseiam-se, na sua maioria, em caracterĂsticas superficiais (shallow features) e em janelas de contexto limitadas, apesar de haver muitos casos onde a informação relevante se encontra na estrutura linguĂstica e em relações de longa distância. A solução proposta neste trabalho consiste num classificador, SVM-TK, que Ă© colocado entre o input da gramática e a gramática propriamente dita. Este classificador aceita uma variedade de features e atribui Ă s palavras tipos lexicais profundos que podem entĂŁo ser usado pela gramática sempre que esta se depare com palavras OOV. O classificador baseia-se em máquinas de vetores de suporte (support-vector machines). Esta tĂ©cnica, quando combinada com o uso de kernels, permite que o classificador use, de forma transparente, features que codificam estrutura linguĂstica. Esta dissertação foca-se no enquadramento teĂłrico HPSG, embora o mĂ©todo proposto possa ser usado em qualquer enquadramento onde a informação lexical possa ser codificada sob a forma de uma etiqueta atribuĂda a uma palavra. Como caso de estudo, usamos a LX-Gram, uma gramatica computacional para a lĂngua portuguesa, e melhoramos a sua robustez a verbos OOV. Dado que a grelha de subcategorização de uma palavra Ă© uma parte substancial daquilo que se encontra codificado num tipo lexical profundo em HPSG, o classificador usa features baseados em dependĂŞncias gramaticais. No momento de execução, estas dependĂŞncias sĂŁo produzidas por um analisador de dependĂŞncias probabilĂstico. O classificador SVM-TK Ă© comparado com o estado-da-arte para a tarefa de resolução de palavras OOV, que consiste em usar um anotador morfossintático (POS-tagger) para atribuir tipos lexicais, fazendo, no fundo, anotação com um conjunto de etiquetas altamente detalhado. Os resultados mostram que o SVM-TK melhora o estado-da-arte, com os já habituais problemas de esparssez de dados fazendo com que este efeito seja notado quando a quantidade de dados de treino Ă© suficientemente grande.Fundação para a CiĂŞncia e a Tecnologia (FCT, SFRH/BD/41465/2007
Recommended from our members
Essays in Financial Economics: Announcement Effects in Fixed Income Markets
ABSTRACT
ESSAYS IN FINANCIAL ECONOMICS: ANNOUNCEMENT EFFECTS IN FIXED INCOME MARKETS
PHD IN FINANCE MAY 2018
JAMES J FOREST
B.A., FRAMINGHAM STATE UNIVERSITY
M.S., NORTHEASTERN UNIVERSITY
Ph.D., UNIVERSITY OF MASSACHUSETTS – AMHERST
Directed by: Professor Hossein B. Kazemi
This dissertation demonstrates the use of empirical techniques for dealing with modeling issues that arise when analyzing announcement effects in fixed income markets. It describes empirical challenges in achieving unbiased and efficient parameter estimates and shows the importance of modelling a wide range of macroeconomic announcement effects to avoid omitted variable bias. Employing techniques common in Macroeconomics, financial market researchers are better able to provide meaningful results.
In “The Effect of Macroeconomic Announcements on Credit Markets: An Autometric General-to-Specific Analysis of the Greenspan Era,” I show that a congruent, parsimonious, encompassing model discovered using David Hendry’s econometric modelling approach overcomes the many inadequacies of the typical static models of US Treasury returns. The typical specification tends to fail most specification tests. Results suggest a place for general-to-specific modelling in financial economics, a place where it has only recently been employed.
In “A High-Frequency Analysis of Trading Activity in the Corporate Bond Market: Macro Announcements or Seasonality?” Here we explore whether factors that drive trading activity of US corporate bond market. Our main findings are that the thinly-traded market for corporate bonds is less affected by surprises in individual economic reports and that the market is dominated by day-of-week and time-of-day affects. We find that, unlike daily returns on the S&P 500, corporate bonds are sensitive to surprises in both labor market and inflation data. Trading activity is affected by absolute surprises in core CPI and nonfarm payrolls, but neither core PPI nor jobless claims affect order flow. Perhaps most interesting, however, is the presence of “behavioral seasonal” effects associated with the onset and incidence of seasonal affective disorder. This “winter blues” effect has been seen affecting activity in equity markets by Kamstra, M. J., L. A. Kramer and M. D. Levi (American Economic Review; 2000, 2003).
In “The Effect of Treasury Auction Results on Interest Rates: The 1990s Experience,” I examine the response of U.S. Treasury returns to auction announcements. Rate changes differ significantly on auction days for one-year bills. Surprises in the release of bid-to-cover ratios and noncompetitive bidding affect Treasury 30-year returns significantly. Other maturities, however, are relatively unaffected. The results complement the study by Lou, Yan and Zhang (2013) and show the benefits of controlling macroeconomic announcements when analyzing market responses to auctions
Machine Learning for Holistic Evaluation of Scientific Essays
Abstract. In the US in particular, there is an increasing emphasis on the importance of science in education. To better understand a scien-tific topic, students need to compile information from multiple sources and determine the principal causal factors involved. We describe an ap-proach for automatically inferring the quality and completeness of causal reasoning in essays on two separate scientific topics using a novel, two-phase machine learning approach for detecting causal relations. For each core essay concept, we initially trained a window-based tagging model to predict which individual words belonged to that concept. Using the predictions from this first set of models, we then trained a second stacked model on all the predicted word tags present in a sentence to predict in-ferences between essay concepts. The results indicate we could use such a system to provide explicit feedback to students to improve reasoning and essay writing skills
Natural Language Processing Resources for Finnish. Corpus Development in the General and Clinical Domains
Siirretty Doriast
Contributions to the Theory of Finite-State Based Grammars
This dissertation is a theoretical study of finite-state based grammars used in natural language processing. The study is concerned with certain varieties of finite-state intersection grammars (FSIG) whose parsers define regular relations between surface strings and annotated surface strings. The study focuses on the following three aspects of FSIGs:
(i) Computational complexity of grammars under limiting parameters In the study, the computational complexity in practical natural language processing is approached through performance-motivated parameters on structural complexity. Each parameter splits some grammars in the Chomsky hierarchy into an infinite set of subset approximations. When the approximations are regular, they seem to fall into the logarithmic-time hierarchyand the dot-depth hierarchy of star-free regular languages. This theoretical result is important and possibly relevant to grammar induction.
(ii) Linguistically applicable structural representations Related to the linguistically applicable representations of syntactic entities, the study contains new bracketing schemes that cope with dependency links, left- and right branching, crossing dependencies and spurious ambiguity. New grammar representations that resemble the Chomsky-SchĂĽtzenberger representation of context-free languages are presented in the study, and they include, in particular, representations for mildly context-sensitive non-projective dependency grammars whose performance-motivated approximations are linear time parseable.
(iii) Compilation and simplification of linguistic constraints Efficient compilation methods for certain regular operations such as generalized restriction are presented. These include an elegant algorithm that has already been adopted as the approach in a proprietary finite-state tool. In addition to the compilation methods, an approach to on-the-fly simplifications of finite-state representations for parse forests is sketched.
These findings are tightly coupled with each other under the theme of locality. I argue that the findings help us to develop better, linguistically oriented formalisms for finite-state parsing and to develop more efficient parsers for natural language processing.
Avainsanat: syntactic parsing, finite-state automata, dependency grammar, first-order logic, linguistic performance, star-free regular approximations, mildly context-sensitive grammar
- …