3,628 research outputs found
The GW/LT3 VarDial 2016 shared task system for dialects and similar languages detection
This paper describes the GW/LT3 contribution to the 2016 VarDial shared task on the identification of similar languages (task 1) and Arabic dialects (task 2). For both tasks, we experimented with Logistic Regression and Neural Network classifiers in isolation. Additionally, we implemented a cascaded classifier that consists of coarse and fine-grained classifiers (task 1) and a classifier ensemble with majority voting for task 2. The submitted systems obtained state-of-the-art performance and ranked first for the evaluation on social media data (test sets B1 and B2 for task 1), with a maximum weighted F1 score of 91.94%
Byte-based Language Identification with Deep Convolutional Networks
We report on our system for the shared task on discriminating between similar
languages (DSL 2016). The system uses only byte representations in a deep
residual network (ResNet). The system, named ResIdent, is trained only on the
data released with the task (closed training). We obtain 84.88% accuracy on
subtask A, 68.80% accuracy on subtask B1, and 69.80% accuracy on subtask B2. A
large difference in accuracy on development data can be observed with
relatively minor changes in our network's architecture and hyperparameters. We
therefore expect fine-tuning of these parameters to yield higher accuracies.Comment: 7 pages. Adapted reviewer comments. arXiv admin note: text overlap
with arXiv:1609.0705
Hierarchical Character-Word Models for Language Identification
Social media messages' brevity and unconventional spelling pose a challenge
to language identification. We introduce a hierarchical model that learns
character and contextualized word-level representations for language
identification. Our method performs well against strong base- lines, and can
also reveal code-switching
- …