500 research outputs found
Antecedent selection techniques for high-recall roreference resolution
We investigate methods to improve the recall in coreference resolution by also trying to resolve those definite descriptions where no earlier mention of the referent shares the same lexical head (coreferent bridging). The problem, which is notably harder than identifying coreference relations among mentions which have the same lexical head, has been tackled with several rather different approaches, and we attempt to provide a meaningful classification along with a quantitative comparison. Based on the different merits of the methods, we discuss possibilities to improve them and show how they can be effectively combined
PP Attachment Ambiguity Resolution with Corpus-Based Pattern Distributions and Lexical Signatures
Invited PaperInternational audienceIn this paper, we propose a method combining unsupervised learning of lexical frequencies with semantic information aiming at improving PP attachment ambiguity resolution. Using the output of a robust parser, i.e. the set of all possible attachments for a given sentence, we query the Web and obtain statistical information about the frequencies of the attachments distributions as well as lexical signatures of the terms on the patterns. All this information is used to weight the dependencies yielded by the parser
Apport d'un corpus comparable déséquilibré à l'extraction de lexiques bilingues
National audienceThe main work in bilingual lexicon extraction from comparable corpora is based on the implicit hypothesis that corpora are balanced. However, the different related approaches are relatively insensitive to sizes of each part of the comparable corpus. Within this context, we study the influence of unbalanced comparable corpora on the quality of bilingual terminology extraction through different experiments. Our results show the conditions under which the use of an unbalanced comparable corpus can induce a significant gain in the quality of extracted lexicons
Corpus Wide Argument Mining -- a Working Solution
One of the main tasks in argument mining is the retrieval of argumentative
content pertaining to a given topic. Most previous work addressed this task by
retrieving a relatively small number of relevant documents as the initial
source for such content. This line of research yielded moderate success, which
is of limited use in a real-world system. Furthermore, for such a system to
yield a comprehensive set of relevant arguments, over a wide range of topics,
it requires leveraging a large and diverse corpus in an appropriate manner.
Here we present a first end-to-end high-precision, corpus-wide argument mining
system. This is made possible by combining sentence-level queries over an
appropriate indexing of a very large corpus of newspaper articles, with an
iterative annotation scheme. This scheme addresses the inherent label bias in
the data and pinpoints the regions of the sample space whose manual labeling is
required to obtain high-precision among top-ranked candidates
Bootstrapping Lexical Choice via Multiple-Sequence Alignment
An important component of any generation system is the mapping dictionary, a
lexicon of elementary semantic expressions and corresponding natural language
realizations. Typically, labor-intensive knowledge-based methods are used to
construct the dictionary. We instead propose to acquire it automatically via a
novel multiple-pass algorithm employing multiple-sequence alignment, a
technique commonly used in bioinformatics. Crucially, our method leverages
latent information contained in multi-parallel corpora -- datasets that supply
several verbalizations of the corresponding semantics rather than just one.
We used our techniques to generate natural language versions of
computer-generated mathematical proofs, with good results on both a
per-component and overall-output basis. For example, in evaluations involving a
dozen human judges, our system produced output whose readability and
faithfulness to the semantic input rivaled that of a traditional generation
system.Comment: 8 pages; to appear in the proceedings of EMNLP-200
Introduction to the CoNLL-2000 Shared Task: Chunking
We describe the CoNLL-2000 shared task: dividing text into syntactically
related non-overlapping groups of words, so-called text chunking. We give
background information on the data sets, present a general overview of the
systems that have taken part in the shared task and briefly discuss their
performance.Comment: 6 page
- …