8,772 research outputs found
Corpora and evaluation tools for multilingual named entity grammar development
We present an effort for the development of multilingual named entity grammars in a unification-based finite-state formalism (SProUT). Following an extended version of the MUC7 standard, we have developed Named Entity Recognition grammars for German, Chinese, Japanese, French, Spanish, English, and Czech. The grammars recognize person names, organizations, geographical locations, currency, time and date expressions. Subgrammars and gazetteers are shared as much as possible for the grammars of the different languages. Multilingual corpora from the business domain are used for grammar development and evaluation. The annotation format (named entity and other linguistic information) is described. We present an evaluation tool which provides detailed statistics and diagnostics, allows for partial matching of annotations, and supports user-defined mappings between different annotation and grammar output formats
An example-based approach to translating sign language
Users of sign languages are often forced to use a language in which they have reduced competence simply because documentation in their preferred format is not available. While some research exists on translating between natural and sign languages, we present here what we believe to be the first attempt to tackle this problem using an example-based (EBMT) approach.
Having obtained a set of English–Dutch Sign Language examples, we employ an approach to EBMT using the ‘Marker Hypothesis’ (Green, 1979), analogous to the successful system of (Way & Gough, 2003), (Gough & Way, 2004a) and (Gough & Way, 2004b). In a set of experiments, we show that
encouragingly good translation quality may be obtained using such an approach
Lost in translation: the problems of using mainstream MT evaluation metrics for sign language translation
In this paper we consider the problems of applying corpus-based techniques to minority languages that are neither politically recognised nor have a formally accepted writing system, namely sign languages. We discuss the adoption of an annotated form of sign language data as a suitable corpus for the development of a data-driven machine translation (MT) system, and deal with issues that arise from its use. Useful software tools that facilitate easy annotation of video data are also discussed. Furthermore, we address the problems of using traditional MT evaluation metrics for sign language translation. Based on the candidate translations produced from our example-based machine translation system, we discuss why standard metrics fall short of providing an accurate evaluation and suggest more suitable evaluation methods
Treebank-based acquisition of wide-coverage, probabilistic LFG resources: project overview, results and evaluation
This paper presents an overview of a project to acquire wide-coverage, probabilistic Lexical-Functional Grammar
(LFG) resources from treebanks. Our approach is based on an automatic annotation algorithm that annotates “raw” treebank trees with LFG f-structure information approximating to basic predicate-argument/dependency structure. From the f-structure-annotated treebank
we extract probabilistic unification grammar resources. We present the annotation algorithm, the extraction of
lexical information and the acquisition of wide-coverage and robust PCFG-based LFG approximations including
long-distance dependency resolution.
We show how the methodology can be applied to multilingual, treebank-based unification grammar acquisition. Finally
we show how simple (quasi-)logical forms can be derived automatically from the f-structures generated for the treebank trees
Evaluating two methods for Treebank grammar compaction
Treebanks, such as the Penn Treebank, provide a basis for the automatic creation of broad coverage grammars. In the simplest case, rules can simply be ‘read off’ the parse-annotations of the corpus, producing either a simple or probabilistic context-free grammar. Such grammars, however, can be very large, presenting problems for the subsequent computational costs of parsing under the grammar.
In this paper, we explore ways by which a treebank grammar can be reduced in size or ‘compacted’, which involve the use of two kinds of technique: (i) thresholding of rules by their number of occurrences; and (ii) a method of rule-parsing, which has both probabilistic and non-probabilistic variants. Our results show that by a combined use of these two techniques, a probabilistic context-free grammar can be reduced in size by 62% without any loss in parsing performance, and by 71% to give a gain in recall, but some loss in precision
GenERRate: generating errors for use in grammatical error detection
This paper explores the issue of automatically generated ungrammatical data and its use in error detection, with a focus on the task of classifying a sentence as grammatical or ungrammatical. We present an error generation tool called GenERRate and show how GenERRate can be used to improve the performance of a classifier on learner data. We describe
initial attempts to replicate Cambridge Learner Corpus errors using GenERRate
Semantic Types, Lexical Sorts and Classifiers
We propose a cognitively and linguistically motivated set of sorts for
lexical semantics in a compositional setting: the classifiers in languages that
do have such pronouns. These sorts are needed to include lexical considerations
in a semantical analyser such as Boxer or Grail. Indeed, all proposed lexical
extensions of usual Montague semantics to model restriction of selection,
felicitous and infelicitous copredication require a rich and refined type
system whose base types are the lexical sorts, the basis of the many-sorted
logic in which semantical representations of sentences are stated. However,
none of those approaches define precisely the actual base types or sorts to be
used in the lexicon. In this article, we shall discuss some of the options
commonly adopted by researchers in formal lexical semantics, and defend the
view that classifiers in the languages which have such pronouns are an
appealing solution, both linguistically and cognitively motivated
- …