96 research outputs found
AnCora-Nom: A Spanish lexicon of deverbal nominalizations
This paper describes a new lexical resource: Ancora-Nom, a Spanish lexicon of deverbal nominalizations. At present, it contains 1,655 lexical entries and 3,094 senses. Each sense has a denotation type associated, and the mapping of nominal complements with arguments and the corresponding theta roles is also annotated. A particular interest of this lexicon is that it has been automatically extracted from the annotated AnCora-Es corpus. AnCora-Nom was derived taking into account the information directly related to nominalizations, but also the morphological and syntactic-semantic information annotated in the corpus, such as WordNet synsets, the specifier type of the nominalization, and its morphological number (singular or plural)
AnCora-Nom: un léxico de nominalizaciones deverbales del español
En este artículo se describe un nuevo recurso: AnCora-Nom, un léxico de nominalizaciones deverbales del español. Actualmente, contiene 1.655 entradas léxicas y 3.094 sentidos, donde cada sentido tiene asociado el tipo denotativo y la estructura argumental con los papeles temáticos correspondientes. Este léxico se ha extraído automáticamente a partir de la información anotada en el corpus AnCora-Es. AnCora-Nom se derivó teniendo en cuenta no sólo la información estrictamente relacionada con las nominalizaciones deverbales sino también con información morfológica y sintáctico-semántica previamente anotada en el corpus.This paper describes a new lexical resource: Ancora-Nom, a Spanish lexicon of deverbal nominalizations. At present, it contains 1,655 lexical entries and 3,094 senses. Each sense has a denotation type associated, and the mapping of nominal complements with arguments and the corresponding theta roles is also annotated. A particular interest of this lexicon is that it has been automatically extracted from the annotated AnCora-Es corpus. AnCora-Nom was derived taking into account the information directly related to nominalizations, but also the morphological and syntactic-semantic information annotated in the corpus.This research has received support from the projects Text-Knowledge 2.0 (TIN2009-13391-C04-04) and AnCora-Net (FFI2009-06497-E/FILO) from the Spanish Ministry of Science and Innovation, and a FPU grant (AP2007-01028) from the Spanish Ministry of Education
Text as scene: discourse deixis and bridging relations
En este artículo se presenta un nuevo marco, “el texto como escena”, que establece
las bases para la anotación de dos relaciones de correferencia: la deixis discursiva y las
relaciones de bridging. La incorporación de lo que llamamos escenas textuales y contextuales
proporciona unas directrices de anotación más flexibles, que diferencian claramente entre tipos
de categorías generales. Un marco como éste, capaz de tratar la deixis discursiva y las
relaciones de bridging desde una perspectiva común, tiene como objetivo mejorar el bajo grado
de acuerdo entre anotadores obtenido por esquemas de anotación anteriores, que son incapaces
de captar las referencias vagas inherentes a estos dos tipos de relaciones. Las directrices aquí
presentadas completan el esquema de anotación diseñado para enriquecer el corpus español
CESS-ECE con información correferencial y así construir el corpus CESS-Ancora.This paper presents a new framework, “text as scene”, which lays the foundations for
the annotation of two coreferential links: discourse deixis and bridging relations. The
incorporation of what we call textual and contextual scenes provides more flexible annotation
guidelines, broad type categories being clearly differentiated. Such a framework that is capable
of dealing with discourse deixis and bridging relations from a common perspective aims at
improving the poor reliability scores obtained by previous annotation schemes, which fail to
capture the vague references inherent in both these links. The guidelines presented here
complete the annotation scheme designed to enrich the Spanish CESS-ECE corpus with
coreference information, thus building the CESS-Ancora corpus.This paper has been supported by the FPU
grant (AP2006-00994) from the Spanish
Ministry of Education and Science. It is based
on work supported by the CESS-ECE
(HUM2004-21127), Lang2World (TIN2006-
15265-C06-06), and Praxem (HUM2006-
27378-E) projects
The use of the past tense aspect in Spanish by study At-Home and Study-Abroad Chinese learners in semi-guided written tasks
This work focuses on the influence of L2 acquisition environments (At-Home and Study-Abroad) on the language proficiency of L1 Mandarin Chinese learners of Spanish. We chose the use of Spanish past tense aspect (pretérito indefinido and pretérito imperfecto) as the entry point to analyze Chinese learners proficiency in three semi-guided writing tasks. Our results reveal that the different teaching objectives in these acquisition environments promote a different development of Chinese learners' language capacities in Spanish: the At-Home learners have a more native-like performance when factors at the discourse level are taken into account, whereas the Study-Abroad learners have a more native-like performance when factors at the lexical level are taken into account.However, the usage pattern of the Spanish past tense aspect by learners in both environments share prototypical associations at the lexical and discourse levels. Keywords:past tense aspect, acquisition environment, L2 Spanish, L1 Mandarin Chines
Iarg-AnCora: Spanish corpus annotated with implicit arguments
This article presents the Spanish Iarg-AnCora corpus (400 k-words, 13,883 sentences) annotated with the implicit arguments of deverbal nominalizations (18,397 occurrences). We describe the methodology used to create it, focusing on the annotation scheme and criteria adopted. The corpus was manually annotated and an interannotator agreement test was conducted (81 % observed agreement) in order to ensure the reliability of the final resource. The annotation of implicit arguments results in an important gain in argument and thematic role coverage (128 % on average). It is the first corpus annotated with implicit arguments for the Spanish language with a wide coverage that is freely available. This corpus can subsequently be used by machine learning-based semantic role labeling systems, and for the linguistic analysis of implicit arguments grounded on real data. Semantic analyzers are essential components of current language technology applications, which need to obtain a deeper understanding of the text in order to make inferences at the highest level to obtain qualitative improvements in the results
Empirical methods for the study of denotation in nominalizations in Spanish
This article deals with deverbal nominalizations in Spanish; concretely, we focus on the denotative distinction between event and result nominalizations. The goals of this work is twofold: first, to detect the most relevant features for this denotative distinction; and, second, to build an automatic classification system of deverbal nominalizations according to their denotation. We have based our study on theoretical hypotheses dealing with this semantic distinction and we have analyzed them empirically by means of Machine Learning techniques which are the basis of the ADN-Classifier. This is the first tool that aims to automatically classify deverbal nominalizations in event, result, or underspecified denotation types in Spanish. The ADN-Classifier has helped us to quantitatively evaluate the validity of our claims regarding deverbal nominalizations. We set up a series of experiments in order to test the ADN-Classifier with different models and in different realistic scenarios depending on the knowledge resources and natural language processors available. The ADN-Classifier achieved good results (87.20% accuracy)
Text as Scene: Discourse Deixis and Bridging Relations
This paper presents a new framework, "text as scene", which lays the foundations for the annotation of two coreferential links: discourse deixis and bridging relations. The incorporation of what we call textual and contextual scenes provides more flexible annotation guidelines, broad type categories being clearly differentiated. Such a framework that is capable of dealing with discourse deixis and bridging relations from a common perspective aims at improving the poor reliability scores obtained by previous annotation schemes, which fail to capture the vague references inherent in both these links. The guidelines presented here complete the annotation scheme designed to enrich the Spanish CESS-ECE corpus with coreference information, thus building the CESS-Ancora corpus
Tecnologies de la llengua i les seves aplicacions
[Resumo] A investigación en Lingüística Computacional e Procesamento da Lenguaje Natural deu
lugar estes últimos anos ás denominadas Tecnoloxías da Linguaxe, cuxo obxectivo
principal é o desenvolvemento de sistemas informáticos capaces de recoñeceren,
comprenderen e xeraren linguaxe humana en todas as súas formas. Con esta finalidade,
desenvolveuse unha serie de aplicacións, como a Tradución Automática, a Extracción e
Recuperación da Información, a Clasificación de Documentos etc., que procesan a
información para facilitaren o acceso, organización e transmisión do coñecemento que xera
a chamada Sociedade da Información en que vivimos.
Como noutras disciplinas científicas, na área da Lingüística Computacional e do
Procesamento da Linguaxe Natural pasouse dunha etapa inicial centrada na investigación
básica de carácter experimental a outra en que se interaxe máis coa sociedade e, por tanto,
máis interesada na creación de produtos e aplicacións que resolvan problemas reais. Isto
significa desenvolver sistemas e recursos capaces de analizaren a linguaxe sen restricións,
isto é, que ofrezan unha ampla cobertura lingüística.
Neste artigo preséntase de xeito introdutorio os recursos (lingüísticos) e as aplicacións máis
características que se desenvolven actualmente no marco das Tecnoloxías da Linguaxe. En
concreto, salientaremos dos recursos necesarios os analizadores e desambiguadores
morfolóxicos e sintácticos, os lexicóns computacionais e os corpus lingüísticos,
nomeadamente os etiquetados. Canto ás aplicacións, centrarémonos básicamente na
Recuperación e Extracción da Información e na Tradución Automática.[Abstract] In the last years, research on Computational Linguistics and Natural Language
Processing has led to Language Technologies, whose main goal is to develop computer
systems capable to recognize, understand and generate human languages in all their
forms. For this purpose, several applications have been developed, such as Machine Translation, Information Retrieval and Information Extraction or Document
Classification. These applications process the language in order to ease access to
knowledge, its organization or its transmission, activities needed by our Information
Society.
As in other disciplines, Computational Linguistics and Natural Language Processing have
gone from a first period of basic, experimental research to another in which new products
and real applications have to be created, in order to solve interaction problems. This means
that we need to develop systems and resources capable to deal with unrestricted language,
that is, broad-coverage systems and resources. This paper presents an introduction to
linguistics resources as well as the main applications being developed nowadays in the
Language Technologies framework. More concretely, it emphasizes morphological
analyzers, taggers, syntactic parsers, computational lexicons and linguistic annotated
corpora. As for applications, stress is laid on Information Retrieval, Information Extraction
and Machine Translation
DISCOver: DIStributional approach based on syntactic dependencies for discovering COnstructions
One of the goals in Cognitive Linguistics is the automatic identification and analysis of constructions, since they are fundamental linguistic units for understanding language. This article presents DISCOver, an unsupervised methodology for the automatic discovery of lexico-syntactic patterns that can be considered as candidates for constructions. This methodology follows a distributional semantic approach. Concretely, it is based on our proposed pattern-construction hypothesis: those contexts that are relevant to the definition of a cluster of semantically related words tend to be (part of) lexico-syntactic constructions. Our proposal uses Distributional Semantic Models for modelling the context taking into account syntactic dependencies. After a clustering process, we linked all those clusters with strong relationships and we use them as a source of information for deriving lexico-syntactic patterns, obtaining a total number of 220,732 candidates from a 100 million token corpus of Spanish. We evaluated the patterns obtained intrinsically, applying statistical association measures and they were also evaluated qualitatively by experts. Our results were superior to the baseline in both quality and quantity in all cases. While our experiments have been carried out using a Spanish corpus, this methodology is language independent and only requires a large corpus annotated with the parts of speech and dependencies to be applied
AnCora-Net: integración multilingüe de recursos lingüísticos
AnCora-Net es un léxico verbal multilingüe creado a partir de la integración de los léxicos verbales AnCora-Verb, del catalán y español, en el Unified Verb Index del inglés. El Unified Verb Index aúna diferentes fuentes de conocimiento del inglés de amplia cobertura que son sin duda un referente en representación semántica. La integración de nuestros recursos con los del inglés nos permite enriquecer el contenido de los léxicos AnCora-Verb con información semántica codificada para el inglés. Asimismo, el Unified Verb Index también se enriquece con la incorporación de los léxicos AnCora-Verb, del catalán y español, dando lugar a un recurso multilingüe que puede ser útil para estudios comparativos
- …