1,617 research outputs found
Dictionary writing system (DWS) plus corpus query package (CQP): the case of TshwaneLex
In this article the integrated corpus query functionality of the dictionary compilation software TshwanelLex is analysed. Attention is given to the handling of both raw corpus data and annotated corpus data. With regard to the latter it is shown how, with a minimum of human effort, machine learning techniques can be employed to obtain part-of-speech tagged corpora that can be used for lexicographic purposes. All points are illustrated with data drawn from English and Northern Sotho. The tools and techniques themselves, however, are language-independent, and as Such the encouraging outcomes of this study are far-reaching
Towards Universal Semantic Tagging
The paper proposes the task of universal semantic tagging---tagging word
tokens with language-neutral, semantically informative tags. We argue that the
task, with its independent nature, contributes to better semantic analysis for
wide-coverage multilingual text. We present the initial version of the semantic
tagset and show that (a) the tags provide semantically fine-grained
information, and (b) they are suitable for cross-lingual semantic parsing. An
application of the semantic tagging in the Parallel Meaning Bank supports both
of these points as the tags contribute to formal lexical semantics and their
cross-lingual projection. As a part of the application, we annotate a small
corpus with the semantic tags and present new baseline result for universal
semantic tagging.Comment: 9 pages, International Conference on Computational Semantics (IWCS
Cardinal Virtues: Extracting Relation Cardinalities from Text
Information extraction (IE) from text has largely focused on relations
between individual entities, such as who has won which award. However, some
facts are never fully mentioned, and no IE method has perfect recall. Thus, it
is beneficial to also tap contents about the cardinalities of these relations,
for example, how many awards someone has won. We introduce this novel problem
of extracting cardinalities and discusses the specific challenges that set it
apart from standard IE. We present a distant supervision method using
conditional random fields. A preliminary evaluation results in precision
between 3% and 55%, depending on the difficulty of relations.Comment: 5 pages, ACL 2017 (short paper
Target-Side Context for Discriminative Models in Statistical Machine Translation
Discriminative translation models utilizing source context have been shown to
help statistical machine translation performance. We propose a novel extension
of this work using target context information. Surprisingly, we show that this
model can be efficiently integrated directly in the decoding process. Our
approach scales to large training data sizes and results in consistent
improvements in translation quality on four language pairs. We also provide an
analysis comparing the strengths of the baseline source-context model with our
extended source-context and target-context model and we show that our extension
allows us to better capture morphological coherence. Our work is freely
available as part of Moses.Comment: Accepted as a long paper for ACL 201
#Bieber + #Blast = #BieberBlast: Early Prediction of Popular Hashtag Compounds
Compounding of natural language units is a very common phenomena. In this
paper, we show, for the first time, that Twitter hashtags which, could be
considered as correlates of such linguistic units, undergo compounding. We
identify reasons for this compounding and propose a prediction model that can
identify with 77.07% accuracy if a pair of hashtags compounding in the near
future (i.e., 2 months after compounding) shall become popular. At longer times
T = 6, 10 months the accuracies are 77.52% and 79.13% respectively. This
technique has strong implications to trending hashtag recommendation since
newly formed hashtag compounds can be recommended early, even before the
compounding has taken place. Further, humans can predict compounds with an
overall accuracy of only 48.7% (treated as baseline). Notably, while humans can
discriminate the relatively easier cases, the automatic framework is successful
in classifying the relatively harder cases.Comment: 14 pages, 4 figures, 9 tables, published in CSCW (Computer-Supported
Cooperative Work and Social Computing) 2016. in Proceedings of 19th ACM
conference on Computer-Supported Cooperative Work and Social Computing (CSCW
2016
- …