381 research outputs found

    A Guide to Copy Cataloging Arabic Materials

    Get PDF
    For most catalogers, non-Roman script languages are more difficult to catalog than those in Roman scripts, and Arabic is particularly problematic. The cataloger must have a firm grasp of the language in order to correctly supply unwritten vowels and to use the standard Arabic-English dictionary which lists words by root rather than alphabetically. This manual presents the cataloger who does not have that language knowledge with strategies for effective copy cataloging searching. Topics include the development of Arabic cataloging automation, problems of name authority, distinguishing between Arabic and other languages written in the Arabic script, and using a non-alphabetic Arabic-to-English dictionary

    ArabTeX : a system for typesetting Arabic; user manual version 3.00

    Get PDF
    ArabTeX is a package extending the capabilities of TeX/LaTeX to generate the Arabic writing from an ASCII transliteration for texts in several languages using the Arabic script. It consists of a TeX macro package and an Arabic font in several sizes, presently only available in the Naskhi style. ArabTeX will run with Plain TeX and also with LaTeX. It is compatible with NFSS, NFSS2 and the EDMAC package; other additions to TeX have not been tried. ArabTeX is primarily intended for generating the Arabic writing, but the standard scientific transliteration can also be easily produced. For languages other than Arabic that are customarily written in the Arabic script some limited support is available. ArabTeX defines its own input notation which is both machine, and human, readable, and suited for electronic transmission and Email communication. However, texts in some of the Arabic standard encodings can also be processed. ArabTeX is copyrighted, but free use for scientific, experimental and other strictly private, noncommercial purposes is granted. Offprints of publications using ArabTeX are welcome. Using ArabTeX otherwise requires a license agreement. There is no warranty of any kind, either expressed or implied. The entire risk as to the quality and performance rests with the user

    Urdu Handwritten Characters Data Visualization and Recognition Using Distributed Stochastic Neighborhood Embedding and Deep Network

    Get PDF
    This study was supported by the China University of Petroleum-Beijing and Fundamental Research Funds for Central Universities under Grant no. 2462020YJRC001.Peer reviewedPublisher PD

    A Simple Approach to Unify Ambiguously Encoded Kurdish Characters

    Get PDF
    In this study we outline a potential problem in the normalisation stage of processing texts that are based on a modified version of the Arabic alphabet. The main source of resources available for processing resource-scarce languages is raw text. We have identified an interesting challenge that must be addressed when normalising certain natural language texts. Many lessresourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). It is important to identify ambiguous characters during the normalisation stage of most text processing tasks. We will demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying ambiguously encoded characters

    A semi-automatic approach to identifying and unifying ambiguously encoded Arabic-based characters.

    Get PDF
    In this study, we outline a potential problem in normalising texts that are based on a modified version of the Arabic alphabet. One of the main resources available for processing resource-scarce languages is raw text collected from the Internet. Many less-resourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). The existence of ambiguous characters in words leads to word duplication, thus it is important to identify and unify ambiguous characters during the normalisation stage. Here, we demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying them

    A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters

    Get PDF
    In this study, we outline a potential problem in normalising texts that are based on a modified version of the Arabic alphabet. One of the main resources available for processing resource-scarce languages is raw text collected from the Internet. Many less-resourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). The existence of ambiguous characters in words leads to word duplication, thus it is important to identify and unify ambiguous characters during the normalisation stage. Here, we demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying them

    Urdu Through Its Others: Ghazal, Canonization, and Translation.

    Full text link
    My dissertation, "Urdu Through Its Others: Ghazal, Canonization, and Translation" analyzes the codification of the Urdu literary tradition as it is both celebrated and reviled in a wide variety of popular and scholarly media. I focus specifically on the genre of the ghazal, which, as the most canonical of Urdu literary forms, holds a unique cultural cache throughout all of South Asia and the diaspora. The canonization of the ghazal reifies Urdu's linguistic boundaries through the project of literary histories and comparison with other proximate literary traditions like Hindi, Persian, and English. This reified notion of Urdu not only underwrites Anglicist colonial intervention in India by rhetorically painting Urdu as the backward foil to the English's modern progressivism, but also continues to shape the national Urdu imaginary in which the language is both vilified as dangerously communalist and idealized as redemptively secular. Although canonizing literary histories point to Rekhtah as the historical antecedent of the Urdu language, I show, via readings of the ghazals of Urdu's "founder" Valī Dakkanī (1667-1707), that Rekhtah in fact represents a unique poetic mode--an idiom of translation that forces us to reconsider boundaries between languages against the standardizing forces of canonization. The uneven ways in which the translative quality of Rekhtah get passed on to the Urdu tradition as it unfolds during the period of colonialism have shaped the ways in which Urdu is seen in the national imaginary as derivative, backward, and foreign. At the same time, popular narratives about ghazal work to naturalize the Urdu tradition in India, particularly through the nationalization of canonical poets Mirzā Ghālib (1797-1869) and Faiz Ahmed Faiz (1911-1984). This dissertation diverges from existing attempts to establish canonical literary histories, or reconstruct a moment prior to translation, which ultimately reinforce colonial notions of both history and translation; instead, I focus on the traces of past texts and events as they continue to operate within the present--what I am calling historicity--ultimately arguing that moments of translation themselves constitute the Urdu language and literary tradition.PHDComparative LiteratureUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/135821/1/shakeem_1.pd

    Generating an Arabic Calligraphy Text Blocks for Global Texture Analysis

    Get PDF
    This paper objective is to improve the current method for generating an Arabic Calligraphy text blocks. We test on seven types of Arabic Calligraphy text. We apply  projection profiles and a proposed filter to discriminate each line of the Arabic Calligraphy scripts. After performing text detection, skew correction, text and line normalization subsequently, we generate Arabic Calligraphy text blocks for global texture analysis purposes. We compare our proposed filter with current method and median filter. The results show that the proposed filter  is outperformed. The proposed method can be further  improved to boost the overall performance

    Improving Search via Named Entity Recognition in Morphologically Rich Languages – A Case Study in Urdu

    Get PDF
    University of Minnesota Ph.D. dissertation. February 2018. Major: Computer Science. Advisors: Vipin Kumar, Blake Howald. 1 computer file (PDF); xi, 236 pages.Search is not a solved problem even in the world of Google and Bing's state of the art engines. Google and similar search engines are keyword based. Keyword-based searching suffers from the vocabulary mismatch problem -- the terms in document and user's information request don't overlap. For example, cars and automobiles. This phenomenon is called synonymy. Similarly, the user's term may be polysemous -- a user is inquiring about a river's bank, but documents about financial institutions are matched. Vocabulary mismatch exacerbated when the search occurs in Morphological Rich Language (MRL). Concept search techniques like dimensionality reduction do not improve search in Morphological Rich Languages. Names frequently occur news text and determine the "what," "where," "when," and "who" in the news text. Named Entity Recognition attempts to recognize names automatically in text, but these techniques are far from mature in MRL, especially in Arabic Script languages. Urdu is one the focus MRL of this dissertation among Arabic, Farsi, Hindi, and Russian, but it does not have the enabling technologies for NER and search. A corpus, stop word generation algorithm, a light stemmer, a baseline, and NER algorithm is created so the NER-aware search can be accomplished for Urdu. This dissertation demonstrates that NER-aware search on Arabic, Russian, Urdu, and English shows significant improvement over baseline. Furthermore, this dissertation highlights the challenges for researching in low-resource MRL languages
    corecore