Search CORE

2,991 research outputs found

Exploration of Approaches to Arabic Named Entity Recognition

Author: Balla Husamelddin
Delaney Sarah Jane
Publication venue: Technological University Dublin
Publication date: 01/01/2020
Field of study

Abstract. The Named Entity Recognition (NER) task has attracted significant attention in Natural Language Processing (NLP) as it can enhance the performance of many NLP applications. In this paper, we compare English NER with Arabic NER in an experimental way to investigate the impact of using different classifiers and sets of features including language-independent and language-specific features. We explore the features and classifiers on five different datasets. We compare deep neural network architectures for NER with more traditional machine learning approaches to NER. We discover that most of the techniques and features used for English NER perform well on Arabic NER. Our results highlight the improvements achieved by using language-specific features in Arabic NER

Arrow@TUDublin

Character-Aware Neural Language Models

Author: Jernite Yacine
Kim Yoon
Rush Alexander M.
Sontag David
Publication venue
Publication date: 01/12/2015
Field of study

We describe a simple neural language model that relies only on character-level inputs. Predictions are still made at the word-level. Our model employs a convolutional neural network (CNN) and a highway network over characters, whose output is given to a long short-term memory (LSTM) recurrent neural network language model (RNN-LM). On the English Penn Treebank the model is on par with the existing state-of-the-art despite having 60% fewer parameters. On languages with rich morphology (Arabic, Czech, French, German, Spanish, Russian), the model outperforms word-level/morpheme-level LSTM baselines, again with fewer parameters. The results suggest that on many languages, character inputs are sufficient for language modeling. Analysis of word representations obtained from the character composition part of the model reveals that the model is able to encode, from characters only, both semantic and orthographic information.Comment: AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

A Survey on Arabic Named Entity Recognition: Past, Recent Advances, and Future Trends

Author: Gu Yingjie
Huai Baoxing
Li Zechang
Qu Xiaoye
Wang Zhefeng
Xia Qingrong
Publication venue
Publication date: 08/08/2023
Field of study

As more and more Arabic texts emerged on the Internet, extracting important information from these Arabic texts is especially useful. As a fundamental technology, Named entity recognition (NER) serves as the core component in information extraction technology, while also playing a critical role in many other Natural Language Processing (NLP) systems, such as question answering and knowledge graph building. In this paper, we provide a comprehensive review of the development of Arabic NER, especially the recent advances in deep learning and pre-trained language model. Specifically, we first introduce the background of Arabic NER, including the characteristics of Arabic and existing resources for Arabic NER. Then, we systematically review the development of Arabic NER methods. Traditional Arabic NER systems focus on feature engineering and designing domain-specific rules. In recent years, deep learning methods achieve significant progress by representing texts via continuous vector representations. With the growth of pre-trained language model, Arabic NER yields better performance. Finally, we conclude the method gap between Arabic NER and NER methods from other languages, which helps outline future directions for Arabic NER.Comment: Accepted by IEEE TKD

arXiv.org e-Print Archive

PersoNER: Persian named-entity recognition

Author: Abdous M
Borzeshi EZ
Piccardi M
Poostchi H
Publication venue
Publication date: 01/01/2016
Field of study

© 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network

OPUS - University of Technology Sydney