Search CORE

4,486 research outputs found

A semi-automatic approach to identifying and unifying ambiguously encoded Arabic-based characters.

Author: Jaf Sardar
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/03/2017
Field of study

In this study, we outline a potential problem in normalising texts that are based on a modified version of the Arabic alphabet. One of the main resources available for processing resource-scarce languages is raw text collected from the Internet. Many less-resourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). The existence of ambiguous characters in words leads to word duplication, thus it is important to identify and unify ambiguous characters during the normalisation stage. Here, we demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying them

Durham Research Online

Crossref

A Simple Approach to Unify Ambiguously Encoded Kurdish Characters

Author: Jaf Sardar
Publication venue
Publication date: 09/09/2016
Field of study

In this study we outline a potential problem in the normalisation stage of processing texts that are based on a modified version of the Arabic alphabet. The main source of resources available for processing resource-scarce languages is raw text. We have identified an interesting challenge that must be addressed when normalising certain natural language texts. Many lessresourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). It is important to identify ambiguous characters during the normalisation stage of most text processing tasks. We will demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying ambiguously encoded characters

Durham Research Online

A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters

Author: Dong Minghui
Jaf Sardar
Lee Lung-Hao
Li Haizhou
Lu Yanfeng
Tseng Yuen-Hsien
Wu Chung-Hsien
Yu Liang-Chih
Publication venue: Institute of Electrical and Electronics Engineers
Publication date: 13/03/2017
Field of study

Durham Research Online

Transfer Learning for Low-Resource Sentiment Analysis

Author: Ahmadi Sina
Daneshfar Fatemeh
Hameed Razhan
Publication venue
Publication date: 10/04/2023
Field of study

Sentiment analysis is the process of identifying and extracting subjective information from text. Despite the advances to employ cross-lingual approaches in an automatic way, the implementation and evaluation of sentiment analysis systems require language-specific data to consider various sociocultural and linguistic peculiarities. In this paper, the collection and annotation of a dataset are described for sentiment analysis of Central Kurdish. We explore a few classical machine learning and neural network-based techniques for this task. Additionally, we employ an approach in transfer learning to leverage pretrained models for data augmentation. We demonstrate that data augmentation achieves a high F

_1

score and accuracy despite the difficulty of the task.Comment: 14 pages - under review at ACM TALLI

arXiv.org e-Print Archive