242 research outputs found
A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-of-Speech Tagging
In this paper, we propose a new approach to construct a system of
transformation rules for the Part-of-Speech (POS) tagging task. Our approach is
based on an incremental knowledge acquisition method where rules are stored in
an exception structure and new rules are only added to correct the errors of
existing rules; thus allowing systematic control of the interaction between the
rules. Experimental results on 13 languages show that our approach is fast in
terms of training time and tagging speed. Furthermore, our approach obtains
very competitive accuracy in comparison to state-of-the-art POS and
morphological taggers.Comment: Version 1: 13 pages. Version 2: Submitted to AI Communications - the
European Journal on Artificial Intelligence. Version 3: Resubmitted after
major revisions. Version 4: Resubmitted after minor revisions. Version 5: to
appear in AI Communications (accepted for publication on 3/12/2015
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Evaluation Campaign
In this report, we describe our participant named-entity recognition system at VLSP 2018 evaluation campaign. We formalized the task as a sequence labeling problem using BIO encoding scheme. We applied a feature-based model which combines word, word-shape features, Brown-cluster-based features, and word-embedding-based features. We compare several methods to deal with nested entities in the dataset. We showed that combining tags of entities at all levels for training a sequence labeling model (joint-tag model) improved the accuracy of nested named-entity recognition
Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi
Building natural language processing systems for non standardized and low
resource languages is a difficult challenge. The recent success of large-scale
multilingual pretrained language models provides new modeling tools to tackle
this. In this work, we study the ability of multilingual language models to
process an unseen dialect. We take user generated North-African Arabic as our
case study, a resource-poor dialectal variety of Arabic with frequent
code-mixing with French and written in Arabizi, a non-standardized
transliteration of Arabic to Latin script. Focusing on two tasks,
part-of-speech tagging and dependency parsing, we show in zero-shot and
unsupervised adaptation scenarios that multilingual language models are able to
transfer to such an unseen dialect, specifically in two extreme cases: (i)
across scripts, using Modern Standard Arabic as a source language, and (ii)
from a distantly related language, unseen during pretraining, namely Maltese.
Our results constitute the first successful transfer experiments on this
dialect, paving thus the way for the development of an NLP ecosystem for
resource-scarce, non-standardized and highly variable vernacular languages
On the Use of Parsing for Named Entity Recognition
[Abstract] Parsing is a core natural language processing technique that can be used to obtain the structure underlying sentences in human languages. Named entity recognition (NER) is the task of identifying the entities that appear in a text. NER is a challenging natural language processing task that is essential to extract knowledge from texts in multiple domains, ranging from financial to medical. It is intuitive that the structure of a text can be helpful to determine whether or not a certain portion of it is an entity and if so, to establish its concrete limits. However, parsing has been a relatively little-used technique in NER systems, since most of them have chosen to consider shallow approaches to deal with text. In this work, we study the characteristics of NER, a task that is far from being solved despite its long history; we analyze the latest advances in parsing that make its use advisable in NER settings; we review the different approaches to NER that make use of syntactic information; and we propose a new way of using parsing in NER based on casting parsing itself as a sequence labeling task.Xunta de Galicia; ED431C 2020/11Xunta de Galicia; ED431G 2019/01This work has been funded by MINECO, AEI and FEDER of UE through the ANSWER-ASAP project (TIN2017-85160-C2-1-R); and by Xunta de Galicia through a Competitive Reference Group grant (ED431C 2020/11). CITIC, as Research Center of the Galician University System, is funded by the ConsellerÃa de Educación, Universidade e Formación Profesional of the Xunta de Galicia through the European Regional Development Fund (ERDF/FEDER) with 80%, the Galicia ERDF 2014-20 Operational Programme, and the remaining 20% from the SecretarÃa Xeral de Universidades (Ref. ED431G 2019/01). Carlos Gómez-RodrÃguez has also received funding from the European Research Council (ERC), under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, Grant No. 714150)
Supervised Identification of Writer\u27s Native Language Based on Their English Word Usage
In this paper, we investigate the possibility of constructing an automated tool for the writer\u27s first language detection based on a~document written in their second language. Since English is the contemporary lingua franca, commonly used by non-native speakers, we have chosen it to be the second language to study. In this paper, we examine English texts from computer science, a field related to mathematics. More generally, we wanted to study texts from a domain that operates with formal rules. We were able to achieve a high classification rate, about~90\%, using a relatively simple model (n-grams with logistic regression). We trained the model to distinguish twelve nationality groups/first languages based on our dataset. The classification mechanism was implemented using logistic regression with L1~regularisation, which performed well with sparse document-term data table. The experiment proved that we can use vocabulary alone to detect the first language with high accuracy
Large-scale Vietnamese point-of-interest classification using weak labeling
Point-of-Interests (POIs) represent geographic location by different categories (e.g., touristic places, amenities, or shops) and play a prominent role in several location-based applications. However, the majority of POIs category labels are crowd-sourced by the community, thus often of low quality. In this paper, we introduce the first annotated dataset for the POIs categorical classification task in Vietnamese. A total of 750,000 POIs are collected from WeMap, a Vietnamese digital map. Large-scale hand-labeling is inherently time-consuming and labor-intensive, thus we have proposed a new approach using weak labeling. As a result, our dataset covers 15 categories with 275,000 weak-labeled POIs for training, and 30,000 gold-standard POIs for testing, making it the largest compared to the existing Vietnamese POIs dataset. We empirically conduct POI categorical classification experiments using a strong baseline (BERT-based fine-tuning) on our dataset and find that our approach shows high efficiency and is applicable on a large scale. The proposed baseline gives an F1 score of 90% on the test dataset, and significantly improves the accuracy of WeMap POI data by a margin of 37% (from 56 to 93%)
A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing
We present a novel neural network model that learns POS tagging and
graph-based dependency parsing jointly. Our model uses bidirectional LSTMs to
learn feature representations shared for both POS tagging and dependency
parsing tasks, thus handling the feature-engineering problem. Our extensive
experiments, on 19 languages from the Universal Dependencies project, show that
our model outperforms the state-of-the-art neural network-based
Stack-propagation model for joint POS tagging and transition-based dependency
parsing, resulting in a new state of the art. Our code is open-source and
available together with pre-trained models at:
https://github.com/datquocnguyen/jPTDPComment: v2: also include universal POS tagging, UAS and LAS accuracies w.r.t
gold-standard segmentation on Universal Dependencies 2.0 - CoNLL 2017 shared
task test data; in CoNLL 201
- …