223 research outputs found
The GW/LT3 VarDial 2016 shared task system for dialects and similar languages detection
This paper describes the GW/LT3 contribution to the 2016 VarDial shared task on the identification of similar languages (task 1) and Arabic dialects (task 2). For both tasks, we experimented with Logistic Regression and Neural Network classifiers in isolation. Additionally, we implemented a cascaded classifier that consists of coarse and fine-grained classifiers (task 1) and a classifier ensemble with majority voting for task 2. The submitted systems obtained state-of-the-art performance and ranked first for the evaluation on social media data (test sets B1 and B2 for task 1), with a maximum weighted F1 score of 91.94%
LDC Arabic Treebanks and Associated Corpora: Data Divisions Manual
The Linguistic Data Consortium (LDC) has developed hundreds of data corpora
for natural language processing (NLP) research. Among these are a number of
annotated treebank corpora for Arabic. Typically, these corpora consist of a
single collection of annotated documents. NLP research, however, usually
requires multiple data sets for the purposes of training models, developing
techniques, and final evaluation. Therefore it becomes necessary to divide the
corpora used into the required data sets (divisions). This document details a
set of rules that have been defined to enable consistent divisions for old and
new Arabic treebanks (ATB) and related corpora.Comment: 14 pages; one cove
Recommended from our members
Learning the Latent Semantics of a Concept from its Definition
In this paper we study unsupervised word sense disambiguation (WSD) based on sense definition. We learn low-dimensional latent semantic vectors of concept definitions to construct a more robust sense similarity measure wmfvec. Experiments on four all-words WSD data sets show significant improvement over the baseline WSD systems and LDA based similarity measures, achieving results comparable to state of the art WSD systems
- …