15,138 research outputs found
Multi-Task Learning of Keyphrase Boundary Classification
Keyphrase boundary classification (KBC) is the task of detecting keyphrases
in scientific articles and labelling them with respect to predefined types.
Although important in practice, this task is so far underexplored, partly due
to the lack of labelled data. To overcome this, we explore several auxiliary
tasks, including semantic super-sense tagging and identification of multi-word
expressions, and cast the task as a multi-task learning problem with deep
recurrent neural networks. Our multi-task models perform significantly better
than previous state of the art approaches on two scientific KBC datasets,
particularly for long keyphrases.Comment: ACL 201
Searching for Ground Truth: a stepping stone in automating genre classification
This paper examines genre classification of documents and
its role in enabling the effective automated management of digital documents by digital libraries and other repositories. We have previously presented genre classification as a valuable step toward achieving automated extraction of descriptive metadata for digital material. Here, we present results from experiments using human labellers, conducted to assist in genre characterisation and the prediction of obstacles which need to be overcome by an automated system, and to contribute to the process of creating a solid testbed corpus for extending automated genre classification and testing metadata extraction tools across genres. We also describe the performance of two classifiers based on image and stylistic modeling features in labelling the data resulting from the agreement of three human labellers across fifteen genre classes.
Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings
Zero-resource cross-lingual transfer approaches aim to apply supervised
models from a source language to unlabelled target languages. In this paper we
perform an in-depth study of the two main techniques employed so far for
cross-lingual zero-resource sequence labelling, based either on data or model
transfer. Although previous research has proposed translation and annotation
projection (data-based cross-lingual transfer) as an effective technique for
cross-lingual sequence labelling, in this paper we experimentally demonstrate
that high capacity multilingual language models applied in a zero-shot
(model-based cross-lingual transfer) setting consistently outperform data-based
cross-lingual transfer approaches. A detailed analysis of our results suggests
that this might be due to important differences in language use. More
specifically, machine translation often generates a textual signal which is
different to what the models are exposed to when using gold standard data,
which affects both the fine-tuning and evaluation processes. Our results also
indicate that data-based cross-lingual transfer approaches remain a competitive
option when high-capacity multilingual language models are not available.Comment: Findings of the Association for Computational Linguistics: EMNLP 202
- …