157 research outputs found
Automatic propbank generation for Turkish
Semantic role labeling (SRL) is an important task for understanding natural languages, where the objective is to analyse propositions expressed by the verb and to identify each word that bears a semantic role. It provides an extensive dataset to enhance NLP applications such as information retrieval, machine translation, information extraction, and question answering. However, creating SRL models are difficult. Even in some languages, it is infeasible to create SRL models that have predicate-argument structure due to lack of linguistic resources. In this paper, we present our method to create an automatic Turkish PropBank by exploiting parallel data from the translated sentences of English PropBank. Experiments show that our method gives promising results. © 2019 Association for Computational Linguistics (ACL).Publisher's Versio
Learning narrative structure from annotated folktales
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 97-100).Narrative structure is an ubiquitous and intriguing phenomenon. By virtue of structure we recognize the presence of Villainy or Revenge in a story, even if that word is not actually present in the text. Narrative structure is an anvil for forging new artificial intelligence and machine learning techniques, and is a window into abstraction and conceptual learning as well as into culture and its in influence on cognition. I advance our understanding of narrative structure by describing Analogical Story Merging (ASM), a new machine learning algorithm that can extract culturally-relevant plot patterns from sets of folktales. I demonstrate that ASM can learn a substantive portion of Vladimir Propp's in influential theory of the structure of folktale plots. The challenge was to take descriptions at one semantic level, namely, an event timeline as described in folktales, and abstract to the next higher level: structures such as Villainy, Stuggle- Victory, and Reward. ASM is based on Bayesian Model Merging, a technique for learning regular grammars. I demonstrate that, despite ASM's large search space, a carefully-tuned prior allows the algorithm to converge, and furthermore it reproduces Propp's categories with a chance-adjusted Rand index of 0.511 to 0.714. Three important categories are identied with F-measures above 0.8. The data are 15 Russian folktales, comprising 18,862 words, a subset of Propp's original tales. This subset was annotated for 18 aspects of meaning by 12 annotators using the Story Workbench, a general text-annotation tool I developed for this work. Each aspect was doubly-annotated and adjudicated at inter-annotator F-measures that cluster around 0.7 to 0.8. It is the largest, most deeply-annotated narrative corpus assembled to date. The work has significance far beyond folktales. First, it points the way toward important applications in many domains, including information retrieval, persuasion and negotiation, natural language understanding and generation, and computational creativity. Second, abstraction from natural language semantics is a skill that underlies many cognitive tasks, and so this work provides insight into those processes. Finally, the work opens the door to a computational understanding of cultural in influences on cognition and understanding cultural differences as captured in stories.by Mark Alan Finlayson.Ph.D
Proceedings
Proceedings of the Workshop on Annotation and
Exploitation of Parallel Corpora AEPC 2010.
Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk.
NEALT Proceedings Series, Vol. 10 (2010), 98 pages.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15893
Distributed Representations for Compositional Semantics
The mathematical representation of semantics is a key issue for Natural
Language Processing (NLP). A lot of research has been devoted to finding ways
of representing the semantics of individual words in vector spaces.
Distributional approaches --- meaning distributed representations that exploit
co-occurrence statistics of large corpora --- have proved popular and
successful across a number of tasks. However, natural language usually comes in
structures beyond the word level, with meaning arising not only from the
individual words but also the structure they are contained in at the phrasal or
sentential level. Modelling the compositional process by which the meaning of
an utterance arises from the meaning of its parts is an equally fundamental
task of NLP.
This dissertation explores methods for learning distributed semantic
representations and models for composing these into representations for larger
linguistic units. Our underlying hypothesis is that neural models are a
suitable vehicle for learning semantically rich representations and that such
representations in turn are suitable vehicles for solving important tasks in
natural language processing. The contribution of this thesis is a thorough
evaluation of our hypothesis, as part of which we introduce several new
approaches to representation learning and compositional semantics, as well as
multiple state-of-the-art models which apply distributed semantic
representations to various tasks in NLP.Comment: DPhil Thesis, University of Oxford, Submitted and accepted in 201
GLEN: General-Purpose Event Detection for Thousands of Types
The progress of event extraction research has been hindered by the absence of
wide-coverage, large-scale datasets. To make event extraction systems more
accessible, we build a general-purpose event detection dataset GLEN, which
covers 205K event mentions with 3,465 different types, making it more than 20x
larger in ontology than today's largest event dataset. GLEN is created by
utilizing the DWD Overlay, which provides a mapping between Wikidata Qnodes and
PropBank rolesets. This enables us to use the abundant existing annotation for
PropBank as distant supervision. In addition, we also propose a new multi-stage
event detection model CEDAR specifically designed to handle the large ontology
size in GLEN. We show that our model exhibits superior performance compared to
a range of baselines including InstructGPT. Finally, we perform error analysis
and show that label noise is still the largest challenge for improving
performance for this new dataset. Our dataset, code, and models are released at
\url{https://github.com/ZQS1943/GLEN}.}Comment: Accepted to EMNLP 2023. The first two authors contributed equally.
(16 pages
SRL4ORL: Improving Opinion Role Labeling using Multi-task Learning with Semantic Role Labeling
For over a decade, machine learning has been used to extract
opinion-holder-target structures from text to answer the question "Who
expressed what kind of sentiment towards what?". Recent neural approaches do
not outperform the state-of-the-art feature-based models for Opinion Role
Labeling (ORL). We suspect this is due to the scarcity of labeled training data
and address this issue using different multi-task learning (MTL) techniques
with a related task which has substantially more data, i.e. Semantic Role
Labeling (SRL). We show that two MTL models improve significantly over the
single-task model for labeling of both holders and targets, on the development
and the test sets. We found that the vanilla MTL model which makes predictions
using only shared ORL and SRL features, performs the best. With deeper analysis
we determine what works and what might be done to make further improvements for
ORL.Comment: Published in NAACL 201
Predicate Matrix: an interoperable lexical knowledge base for predicates
183 p.La Matriz de Predicados (Predicate Matrix en inglés) es un nuevo recurso léxico-semántico resultado de la integración de múltiples fuentes de conocimiento, entre las cuales se encuentran FrameNet, VerbNet, PropBank y WordNet. La Matriz de Predicados proporciona un léxico extenso y robusto que permite mejorar la interoperabilidad entre los recursos semánticos mencionados anteriormente. La creación de la Matriz de Predicados se basa en la integración de Semlink y nuevos mappings obtenidos utilizando métodos automáticos que enlazan el conocimiento semántico a nivel léxico y de roles. Asimismo, hemos ampliado la Predicate Matrix para cubrir los predicados nominales (inglés, español) y predicados en otros idiomas (castellano, catalán y vasco). Como resultado, la Matriz de predicados proporciona un léxico multilingüe que permite el análisis semántico interoperable en múltiples idiomas
Proceedings
Proceedings of the Ninth International Workshop
on Treebanks and Linguistic Theories.
Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti.
NEALT Proceedings Series, Vol. 9 (2010), 268 pages.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15891
- …