28 research outputs found
Substring-based Machine Translation
Abstract Machine translation is traditionally formulated as the transduction of strings of words from the source to the target language. As a result, additional lexical processing steps such as morphological analysis, transliteration, and tokenization are required to process the internal structure of words to help cope with data-sparsity issues that occur when simply dividing words according to white spaces. In this paper, we take a different approach: not dividing lexical processing and translation into two steps, but simply viewing translation as a single transduction between character strings in the source and target languages. In particular, we demonstrate that the key to achieving accuracies on a par with word-based translation in the character-based framework is the use of a many-to-many alignment strategy that can accurately capture correspondences between arbitrary substrings. We build on the alignment method proposed in Neubig et al (2011), improving its efficiency and accuracy with a focus on character-based translation. Using a many-to-many aligner imbued with these improvements, we demonstrate that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-based translation for two distant language pairs
Unsupervised multilingual learning
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 241-254).For centuries, scholars have explored the deep links among human languages. In this thesis, we present a class of probabilistic models that exploit these links as a form of naturally occurring supervision. These models allow us to substantially improve performance for core text processing tasks, such as morphological segmentation, part-of-speech tagging, and syntactic parsing. Besides these traditional NLP tasks, we also present a multilingual model for lost language deciphersment. We test this model on the ancient Ugaritic language. Our results show that we can automatically uncover much of the historical relationship between Ugaritic and Biblical Hebrew, a known related language.by Benjamin Snyder.Ph.D
Unsupervised Structure Induction for Natural Language Processing
Ph.DDOCTOR OF PHILOSOPH
Bimorphisms and synchronous grammars
We tend to think of the study of language as proceeding by characterizing the strings and structures of a language, and we think of natural language processing as using those structures to build systems of utility in manipulating the language. But many language-related problems are more fruitfully viewed as requiring the specification of a relation between two languages, rather than the specification of a single language. We provide a synthesis and extension of work that unifies two approaches to such language relations: the automata-theoretic approach based on tree transducers that transform trees to their counterparts in the relation, and the grammatical approach based on synchronous grammars that derive pairs of trees in the relation. In particular, we characterize synchronous tree-substitution grammars and synchronous tree-adjoining grammars in terms of bimorphisms, which have previously been used to characterize tree transducers. In the process, we provide new approaches to formalizing the various concepts: a metanotation for describing varieties of tree automata and transducers in equational terms; a rigorous formalization of tree-adjoining and tree-substitution grammars and their synchronous counterparts, using trees over ranked alphabets; and generalizations of tree-adjoining grammar allowing multiple adjunction.Engineering and Applied Science
Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches
We demonstrate the effectiveness of multilingual learning for unsupervised
part-of-speech tagging. The central assumption of our work is that by combining
cues from multiple languages, the structure of each becomes more apparent. We
consider two ways of applying this intuition to the problem of unsupervised
part-of-speech tagging: a model that directly merges tag structures for a pair
of languages into a single sequence and a second model which instead
incorporates multilingual context using latent variables. Both approaches are
formulated as hierarchical Bayesian models, using Markov Chain Monte Carlo
sampling techniques for inference. Our results demonstrate that by
incorporating multilingual evidence we can achieve impressive performance gains
across a range of scenarios. We also found that performance improves steadily
as the number of available languages increases
In Language and Information Technologies
With the rising amount of available multilingual text data, computational linguistics faces an opportunity and a challenge. This text can enrich the domains of NLP applications and improve their performance. Traditional supervised learning for this kind of data would require annotation of part of this text for induction of natural language structure. For these large amounts of rich text, such an annotation task can be daunting and expensive. Unsupervised learning of natural language structure can compensate for the need for such annotation. Natural language structure can be modeled through the use of probabilistic grammars, generative statistical models which are useful for compositional and sequential structures. Probabilistic grammars are widely used in natural language processing, but they are also used in other fields as well, such as computer vision, computational biology and cognitive science. This dissertation focuses on presenting a theoretical and an empirical analysis for the learning of these widely used grammars in the unsupervised setting. We analyze computational properties involved in estimation of probabilisti