13,990 research outputs found
Semantic role labeling for protein transport predicates
<p>Abstract</p> <p>Background</p> <p>Automatic semantic role labeling (SRL) is a natural language processing (NLP) technique that maps sentences to semantic representations. This technique has been widely studied in the recent years, but mostly with data in newswire domains. Here, we report on a SRL model for identifying the semantic roles of biomedical predicates describing protein transport in GeneRIFs – manually curated sentences focusing on gene functions. To avoid the computational cost of syntactic parsing, and because the boundaries of our protein transport roles often did not match up with syntactic phrase boundaries, we approached this problem with a word-chunking paradigm and trained support vector machine classifiers to classify words as being at the beginning, inside or outside of a protein transport role.</p> <p>Results</p> <p>We collected a set of 837 GeneRIFs describing movements of proteins between cellular components, whose predicates were annotated for the semantic roles AGENT, PATIENT, ORIGIN and DESTINATION. We trained these models with the features of previous word-chunking models, features adapted from phrase-chunking models, and features derived from an analysis of our data. Our models were able to label protein transport semantic roles with 87.6% precision and 79.0% recall when using manually annotated protein boundaries, and 87.0% precision and 74.5% recall when using automatically identified ones.</p> <p>Conclusion</p> <p>We successfully adapted the word-chunking classification paradigm to semantic role labeling, applying it to a new domain with predicates completely absent from any previous studies. By combining the traditional word and phrasal role labeling features with biomedical features like protein boundaries and MEDPOST part of speech tags, we were able to address the challenges posed by the new domain data and subsequently build robust models that achieved F-measures as high as 83.1. This system for extracting protein transport information from GeneRIFs performs well even with proteins identified automatically, and is therefore more robust than the rule-based methods previously used to extract protein transport roles.</p
Evaluating automatic cross-domain Dutch semantic role annotation
In this paper we present the first corpus where one million Dutch words from a variety of text genres have been annotated with semantic roles. 500K have been completely manually verified and used as training material to automatically label another 500K. All data has been annotated following an adapted version of the PropBank guidelines. The corpus’s rich text type diversity and the availability of manually verified syntactic dependency structures allowed us to experiment with an existing semantic role labeler for Dutch. In order to test the system’s portability across various domains, we experimented with training on individual domains and compared this with training on multiple domains by adding more data. Our results show that training on large data sets is necessary but that including genre-specific training material is also crucial to optimize classification. We observed that a small amount of in-domain training data is already sufficient to improve our semantic role labeler
Cross-Lingual Semantic Role Labeling with High-Quality Translated Training Corpus
Many efforts of research are devoted to semantic role labeling (SRL) which is
crucial for natural language understanding. Supervised approaches have achieved
impressing performances when large-scale corpora are available for
resource-rich languages such as English. While for the low-resource languages
with no annotated SRL dataset, it is still challenging to obtain competitive
performances. Cross-lingual SRL is one promising way to address the problem,
which has achieved great advances with the help of model transferring and
annotation projection. In this paper, we propose a novel alternative based on
corpus translation, constructing high-quality training datasets for the target
languages from the source gold-standard SRL annotations. Experimental results
on Universal Proposition Bank show that the translation-based method is highly
effective, and the automatic pseudo datasets can improve the target-language
SRL performances significantly.Comment: Accepted at ACL 202
Event-based Access to Historical Italian War Memoirs
The progressive digitization of historical archives provides new, often
domain specific, textual resources that report on facts and events which have
happened in the past; among these, memoirs are a very common type of primary
source. In this paper, we present an approach for extracting information from
Italian historical war memoirs and turning it into structured knowledge. This
is based on the semantic notions of events, participants and roles. We evaluate
quantitatively each of the key-steps of our approach and provide a graph-based
representation of the extracted knowledge, which allows to move between a Close
and a Distant Reading of the collection.Comment: 23 pages, 6 figure
- …