27 research outputs found

    Exploring Twitter as a Source of an Arabic Dialect Corpus

    Get PDF
    Given the lack of Arabic dialect text corpora in comparison with what is available for dialects of English and other languages, there is a need to create dialect text corpora for use in Arabic natural language processing. What is more, there is an increasing use of Arabic dialects in social media, so this text is now considered quite appropriate as a source of a corpus. We collected 210,915K tweets from five groups of Arabic dialects Gulf, Iraqi, Egyptian, Levantine, and North African. This paper explores Twitter as a source and describes the methods that we used to extract tweets and classify them according to the geographic location of the sender. We classified Arabic dialects by using Waikato Environment for Knowledge Analysis (WEKA) data analytic tool which contains many alternative filters and classifiers for machine learning. Our approach in classification tweets achieved an accuracy equal to 79%


    Get PDF
    This article studies about sound variation sand sound change in Arabic dialect Pasar Kliwon. The data searching use observe (simak) and conversation (cakap) method. The technique of data searching is record (rekam) and register (catat). The data searching refers to question list from 120 swadesh vocabularies. Data analysis used padan method and depends on informan’s speech organ. The analysis research use sound change theory according to Crowley (1992) and Muslich (2012). The vowel sound in Arabic dialect Pasar Kliwon divided by two kinds: short vowel sound and long vowel sound. There are twenty sevenconsonant sounds and divided by seven kinds: plosive, fricative, affricative, liquid, voiced, voiceless, and velariation sound. The sound variation of semi-vowel is wawu and ya>’. The vowel sound change divided by four kinds: lenition, anaptycsis, apocope, metathesis. The consonant sound change divided by four kinds:  lenition, anaptycsis, apocope, and sincope.The diftong sound change is monoftongitation

    Phonetic inventory for an Arabic speech corpus

    No full text
    Corpus design for speech synthesis is a well-researched topic in languages such as English compared to Modern Standard Arabic, and there is a tendency to focus on methods to automatically generate the orthographic transcript to be recorded (usually greedy methods). In this work, a study of Modern Standard Arabic (MSA) phonetics and phonology is conducted in order to create criteria for a greedy meth-od to create a speech corpus transcript for recording. The size of the dataset is reduced a number of times using these optimisation methods with different parameters to yield a much smaller dataset with identical phonetic coverage than before the reduction, and this output transcript is chosen for recording. This is part of a larger work to create a completely annotated and segmented speech corpus for MSA


    Get PDF
    This paper is concerned with four main aspects or parts of forensic linguistics: Forensic linguistics in speech mode and in writing, the special status of Arabic, linguistic problems and possibilities of translation for forensics, and Language Analysis for Determination of Origin (LADO). After presenting these issues in the introduction, we describe the language situation of Arabic, mainly in Israel, in the context of these four issues. The discussion is based on the literature concerning problems of translation and LADO in courts of justice in various countries, including Israel. We consider LADO as a developing field of forensic linguistics, and demonstrate by examples some problems that may rise from speech recordings of Arabic speaking asylum seekers. Based on this survey, we point out in the conclusion some research needs of general forensic linguistics and Arabic related forensic linguistics.Artykuł koncentruje się na czterech aspektach lingwistyki sądowej: lingwistyka sądowa jako sposób formułowania treści mówionych i pisanych, szczególny status języka arabskiego, problemy lingwistyczne i możliwości tłumaczenia w sądach, zastosowanie analizy językowej do ustalenia pochodzenia. Po przedstawieniu tych kwestii opisana zostanie w ich kontekście sytuacja języka arabskiego, głównie w Izraelu


    Full text link

    Supervector pre-processing for PRSVM-based Chinese and Arabic dialect identification

    Full text link

    Unsupervised Phoneme Segmentation Based on Main Energy Change for Arabic Speech, Journal of Telecommunications and Information Technology, 2017, nr 1

    Get PDF
    In this paper, a new method for segmenting speech at the phoneme level is presented. For this purpose, author uses the short-time Fourier transform of the speech signal. The goal is to identify the locations of main energy changes in frequency over time, which can be described as phoneme boundaries. A frequency range analysis and search for energy changes in individual area is applied to obtain further precision to identify speech segments that carry out vowel and consonant segment confined in small number of narrow spectral areas. This method merely utilizes the power spectrum of the signal for segmentation. There is no need for any adaptation of the parameters or training for different speakers in advance. In addition, no transcript information, neither any prior linguistic knowledge about the phonemes is needed, or voiced/unvoiced decision making is required. Segmentation results with proposed method have been compared with a manual segmentation, and compared with three same kinds of segmentation methods. These results show that 81% of the boundaries are successfully identified. This research aims to improve the acoustic parameters for all the processing systems of the Arab speech