Search CORE

223 research outputs found

The GW/LT3 VarDial 2016 shared task system for dialects and similar languages detection

Author: Desmet Bart
Diab Mona
Zirikly Ayah
Publication venue: The COLING 2016 Organizing Committee
Publication date: 01/01/2016
Field of study

This paper describes the GW/LT3 contribution to the 2016 VarDial shared task on the identification of similar languages (task 1) and Arabic dialects (task 2). For both tasks, we experimented with Logistic Regression and Neural Network classifiers in isolation. Additionally, we implemented a cascaded classifier that consists of coarse and fine-grained classifiers (task 1) and a classifier ensemble with majority voting for task 2. The submitted systems obtained state-of-the-art performance and ranked first for the evaluation on social media data (test sets B1 and B2 for task 1), with a maximum weighted F1 score of 91.94%

Ghent University Academic Bibliography

LDC Arabic Treebanks and Associated Corpora: Data Divisions Manual

Author: Diab Mona
Habash Nizar
Rambow Owen
Roth Ryan
Publication venue
Publication date: 01/01/2013
Field of study

The Linguistic Data Consortium (LDC) has developed hundreds of data corpora for natural language processing (NLP) research. Among these are a number of annotated treebank corpora for Arabic. Typically, these corpora consist of a single collection of annotated documents. NLP research, however, usually requires multiple data sets for the purposes of training models, developing techniques, and final evaluation. Therefore it becomes necessary to divide the corpora used into the required data sets (divisions). This document details a set of rules that have been defined to enable consistent divisions for old and new Arabic treebanks (ATB) and related corpora.Comment: 14 pages; one cove

arXiv.org e-Print Archive

Columbia University Academic Commons

Recommended from our members

Learning the Latent Semantics of a Concept from its Deﬁnition

Author: Diab Mona
Guo Weiwei
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2012
Field of study

In this paper we study unsupervised word sense disambiguation (WSD) based on sense deﬁnition. We learn low-dimensional latent semantic vectors of concept deﬁnitions to construct a more robust sense similarity measure wmfvec. Experiments on four all-words WSD data sets show signiﬁcant improvement over the baseline WSD systems and LDA based similarity measures, achieving results comparable to state of the art WSD systems

Columbia University Academic Commons