4,486 research outputs found
A semi-automatic approach to identifying and unifying ambiguously encoded Arabic-based characters.
In this study, we outline a potential problem in normalising texts that are based on a modified version of the Arabic alphabet. One of the main resources available for processing resource-scarce languages is raw text collected from the Internet. Many less-resourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). The existence of ambiguous characters in words leads to word duplication, thus it is important to identify and unify ambiguous characters during the normalisation stage. Here, we demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying them
A Simple Approach to Unify Ambiguously Encoded Kurdish Characters
In this study we outline a potential problem in the normalisation stage of processing texts that are based on a modified version of the Arabic alphabet. The main source of resources available for processing resource-scarce languages is raw text. We have identified an interesting challenge that must be addressed when normalising certain natural language texts. Many lessresourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). It is important to identify ambiguous characters during the normalisation stage of most text processing tasks. We will demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying ambiguously encoded characters
A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters
In this study, we outline a potential problem in normalising texts that are based on a modified version of the Arabic alphabet. One of the main resources available for processing resource-scarce languages is raw text collected from the Internet. Many less-resourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). The existence of ambiguous characters in words leads to word duplication, thus it is important to identify and unify ambiguous characters during the normalisation stage. Here, we demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying them
Transfer Learning for Low-Resource Sentiment Analysis
Sentiment analysis is the process of identifying and extracting subjective
information from text. Despite the advances to employ cross-lingual approaches
in an automatic way, the implementation and evaluation of sentiment analysis
systems require language-specific data to consider various sociocultural and
linguistic peculiarities. In this paper, the collection and annotation of a
dataset are described for sentiment analysis of Central Kurdish. We explore a
few classical machine learning and neural network-based techniques for this
task. Additionally, we employ an approach in transfer learning to leverage
pretrained models for data augmentation. We demonstrate that data augmentation
achieves a high F score and accuracy despite the difficulty of the task.Comment: 14 pages - under review at ACM TALLI
- …