13 research outputs found
A Formal Model of Ambiguity and its Applications in Machine Translation
Systems that process natural language must cope with and resolve ambiguity. In this dissertation, a model of language processing is advocated in which multiple inputs and multiple analyses of inputs are considered concurrently and a single analysis is only a last resort. Compared to conventional models, this approach can be understood as replacing single-element inputs and outputs with weighted sets of inputs and outputs. Although processing components must deal with sets (rather than individual elements), constraints are imposed on the elements of these sets, and the representations from existing models may be reused. However, to deal efficiently with large (or infinite) sets, compact representations of sets that share structure between elements, such as weighted finite-state transducers and synchronous context-free grammars, are necessary. These representations and algorithms for manipulating them are discussed in depth in depth.
To establish the effectiveness and tractability of the proposed processing model, it is applied to several problems in machine translation. Starting with spoken language translation, it is shown that translating a set of transcription hypotheses yields better translations compared to a baseline in which a single (1-best) transcription hypothesis is selected and then translated, independent of the translation model formalism used. More subtle forms of ambiguity that arise even in text-only translation (such as decisions conventionally made during system development about how to preprocess text) are then discussed, and it is shown that the ambiguity-preserving paradigm can be employed in these cases as well, again leading to improved translation quality. A model for supervised learning that learns from training data where sets (rather than single elements) of correct labels are provided for each training instance and use it to learn a model of compound word segmentation is also introduced, which is used as a preprocessing step in machine translation
Rapid Resource Transfer for Multilingual Natural Language Processing
Until recently the focus of the Natural Language Processing (NLP)
community has been on a handful of mostly European languages. However, the
rapid changes taking place in the economic and political climate of the
world precipitate a similar change to the relative importance given to
various languages. The importance of rapidly acquiring NLP resources and
computational capabilities in new languages is widely accepted.
Statistical NLP models have a distinct advantage over rule-based methods
in achieving this goal since they require far less manual labor. However,
statistical methods require two fundamental resources for training: (1)
online corpora (2) manual annotations. Creating these two resources can be
as difficult as porting rule-based methods.
This thesis demonstrates the feasibility of acquiring both corpora and
annotations by exploiting existing resources for well-studied languages.
Basic resources for new languages can be acquired in a rapid and
cost-effective manner by utilizing existing resources cross-lingually.
Currently, the most viable method of obtaining online corpora is
converting existing printed text into electronic form using Optical
Character Recognition (OCR). Unfortunately, a language that lacks online
corpora most likely lacks OCR as well. We tackle this problem by taking an
existing OCR system that was desgined for a specific language and using
that OCR system for a language with a similar script. We present a
generative OCR model that allows us to post-process output from a
non-native OCR system to achieve accuracy close to, or better than, a
native one. Furthermore, we show that the performance of a native or
trained OCR system can be improved by the same method.
Next, we demonstrate cross-utilization of annotations on treebanks. We
present an algorithm that projects dependency trees across parallel
corpora. We also show that a reasonable quality treebank can be generated
by combining projection with a small amount of language-specific
post-processing. The projected treebank allows us to train a parser that
performs comparably to a parser trained on manually generated data
On looking into words (and beyond): Structures, Relations, Analyses
On Looking into Words is a wide-ranging volume spanning current research into word structure and morphology, with a focus on historical linguistics and linguistic theory. The papers are offered as a tribute to Stephen R. Anderson, the Dorothy R. Diebold Professor of Linguistics at Yale, who is retiring at the end of the 2016-2017 academic year. The contributors are friends, colleagues, and former students of Professor Anderson, all important contributors to linguistics in their own right. As is typical for such volumes, the contributions span a variety of topics relating to the interests of the honorand. In this case, the central contributions that Anderson has made to so many areas of linguistics and cognitive science, drawing on synchronic and diachronic phenomena in diverse linguistic systems, are represented through the papers in the volume.
The 26 papers that constitute this volume are unified by their discussion of the interplay between synchrony and diachrony, theory and empirical results, and the role of diachronic evidence in understanding the nature of language. Central concerns of the volume include morphological gaps, learnability, increases and declines in productivity, and the interaction of different components of the grammar. The papers deal with a range of linked synchronic and diachronic topics in phonology, morphology, and syntax (in particular, cliticization), and their implications for linguistic theory
On looking into words (and beyond): Structures, Relations, Analyses
On Looking into Words is a wide-ranging volume spanning current research into word structure and morphology, with a focus on historical linguistics and linguistic theory. The papers are offered as a tribute to Stephen R. Anderson, the Dorothy R. Diebold Professor of Linguistics at Yale, who is retiring at the end of the 2016-2017 academic year. The contributors are friends, colleagues, and former students of Professor Anderson, all important contributors to linguistics in their own right. As is typical for such volumes, the contributions span a variety of topics relating to the interests of the honorand. In this case, the central contributions that Anderson has made to so many areas of linguistics and cognitive science, drawing on synchronic and diachronic phenomena in diverse linguistic systems, are represented through the papers in the volume.
The 26 papers that constitute this volume are unified by their discussion of the interplay between synchrony and diachrony, theory and empirical results, and the role of diachronic evidence in understanding the nature of language. Central concerns of the volume include morphological gaps, learnability, increases and declines in productivity, and the interaction of different components of the grammar. The papers deal with a range of linked synchronic and diachronic topics in phonology, morphology, and syntax (in particular, cliticization), and their implications for linguistic theory
A Formal Characterization of Parsing Word Alignments by Synchronous Grammars with Empirical Evidence to the ITG Hypothesis
Abstract Deciding whether a synchronous grammar formalism generates a given word alignment (the alignment coverage problem) depends on finding an adequate instance grammar and then using it to parse the word alignment. But what does it mean to parse a word alignment by a synchronous grammar? This is formally undefined until we define an unambiguous mapping between grammatical derivations and word-level alignments. This paper proposes an initial, formal characterization of alignment coverage as intersecting two partially ordered sets (graphs) of translation equivalence units, one derived by a grammar instance and another defined by the word alignment. As a first sanity check, we report extensive coverage results for ITG on automatic and manual alignments. Even for the ITG formalism, our formal characterization makes explicit many algorithmic choices often left underspecified in earlier work
Structures, Relations, Analyses
On Looking into Words is a wide-ranging volume spanning current research into word structure and morphology, with a focus on historical linguistics and linguistic theory. The papers are offered as a tribute to Stephen R. Anderson, the Dorothy R. Diebold Professor of Linguistics at Yale, who is retiring at the end of the 2016-2017 academic year. The contributors are friends, colleagues, and former students of Professor Anderson, all important contributors to linguistics in their own right. As is typical for such volumes, the contributions span a variety of topics relating to the interests of the honorand. In this case, the central contributions that Anderson has made to so many areas of linguistics and cognitive science, drawing on synchronic and diachronic phenomena in diverse linguistic systems, are represented through the papers in the volume.
The 26 papers that constitute this volume are unified by their discussion of the interplay between synchrony and diachrony, theory and empirical results, and the role of diachronic evidence in understanding the nature of language. Central concerns of the volume include morphological gaps, learnability, increases and declines in productivity, and the interaction of different components of the grammar. The papers deal with a range of linked synchronic and diachronic topics in phonology, morphology, and syntax (in particular, cliticization), and their implications for linguistic theory
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
An evaluation of the challenges of Multilingualism in Data Warehouse development
In this paper we discuss Business Intelligence and define what is meant by support for Multilingualism in a Business Intelligence reporting context. We identify support for Multilingualism as a challenging issue which has implications for data warehouse design and reporting performance. Data warehouses are a core component of most Business Intelligence systems and the star schema is the approach most widely used to develop data warehouses and dimensional Data Marts. We discuss the way in which Multilingualism can be supported in the Star Schema and identify that current approaches have serious limitations which include data redundancy and data manipulation, performance and maintenance issues. We propose a new approach to enable the optimal application of multilingualism in Business Intelligence. The proposed approach was found to produce satisfactory results when used in a proof-of-concept environment. Future work will include testing the approach in an enterprise environmen