5,672 research outputs found
ANNIS: a linguistic database for exploring information structure
In this paper, we discuss the design and implementation of our first version of the database "ANNIS" (ANNotation of Information Structure). For research based on empirical data, ANNIS provides a uniform environment for storing this data together with its linguistic annotations. A central database promotes standardized annotation, which facilitates interpretation and comparison of the data. ANNIS is used through a standard web browser and offers tier-based visualization of data and annotations, as well as search facilities that allow for cross-level and cross-sentential queries. The paper motivates the design of the system, characterizes its user interface, and provides an initial technical evaluation of ANNIS with respect to data size and query processing
Latent Tree Language Model
In this paper we introduce Latent Tree Language Model (LTLM), a novel
approach to language modeling that encodes syntax and semantics of a given
sentence as a tree of word roles.
The learning phase iteratively updates the trees by moving nodes according to
Gibbs sampling. We introduce two algorithms to infer a tree for a given
sentence. The first one is based on Gibbs sampling. It is fast, but does not
guarantee to find the most probable tree. The second one is based on dynamic
programming. It is slower, but guarantees to find the most probable tree. We
provide comparison of both algorithms.
We combine LTLM with 4-gram Modified Kneser-Ney language model via linear
interpolation. Our experiments with English and Czech corpora show significant
perplexity reductions (up to 46% for English and 49% for Czech) compared with
standalone 4-gram Modified Kneser-Ney language model.Comment: Accepted to EMNLP 201
Natural language processing
Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
Multilingual Unsupervised Sentence Simplification
Progress in Sentence Simplification has been hindered by the lack of
supervised data, particularly in languages other than English. Previous work
has aligned sentences from original and simplified corpora such as English
Wikipedia and Simple English Wikipedia, but this limits corpus size, domain,
and language. In this work, we propose using unsupervised mining techniques to
automatically create training corpora for simplification in multiple languages
from raw Common Crawl web data. When coupled with a controllable generation
mechanism that can flexibly adjust attributes such as length and lexical
complexity, these mined paraphrase corpora can be used to train simplification
systems in any language. We further incorporate multilingual unsupervised
pretraining methods to create even stronger models and show that by training on
mined data rather than supervised corpora, we outperform the previous best
results. We evaluate our approach on English, French, and Spanish
simplification benchmarks and reach state-of-the-art performance with a totally
unsupervised approach. We will release our models and code to mine the data in
any language included in Common Crawl
LAF-Fabric: a data analysis tool for Linguistic Annotation Framework with an application to the Hebrew Bible
The Linguistic Annotation Framework (LAF) provides a general, extensible
stand-off markup system for corpora. This paper discusses LAF-Fabric, a new
tool to analyse LAF resources in general with an extension to process the
Hebrew Bible in particular. We first walk through the history of the Hebrew
Bible as text database in decennium-wide steps. Then we describe how LAF-Fabric
may serve as an analysis tool for this corpus. Finally, we describe three
analytic projects/workflows that benefit from the new LAF representation:
1) the study of linguistic variation: extract cooccurrence data of common
nouns between the books of the Bible (Martijn Naaijer); 2) the study of the
grammar of Hebrew poetry in the Psalms: extract clause typology (Gino Kalkman);
3) construction of a parser of classical Hebrew by Data Oriented Parsing:
generate tree structures from the database (Andreas van Cranenburgh)
- …