3,469 research outputs found

    Lexical Borrowing (Taʿrib) in Arabic Computing Terminology: Issues and Strategies

    Get PDF
    Computing technology is evolving rapidly, which requires immediate terminology creation in the Arabic language to cope with such an evolution. Technical loanwords form a big part of modern Arabic terminology and they are spreading rapidly within the language. This research investigates the extent to which the Arabic neologization mechanism of taʿrīb (lexical borrowing) is used in computing terminology creation in comparison with the mechanisms of ishtiqāq (derivation), majāz (semantic extension) and tarkīb (compounding). In addition, it assesses the impact and importance of taʿrīb as a computing terminology creation mechanism in Arabic. This research is based on a corpus of specialised dictionaries and specialised literature. The aforementioned mechanisms are used to various degrees in Arabic in the creation of computing terminology, and are used interchangeably to produce equivalents of single foreign terms, which has caused confusion in the use of the language. The extent of the use of taʿrīb in computing terminology creation, and its impact on, and importance to Arabic as a computing terminology creation mechanism is determined based on two criteria. First, a comparison of the extent of use of the aforementioned mechanisms based on three selected corpora of dictionaries and magazines of Arabic technical computing terminology is presented. Second, an assessment of the lexicographical treatments of the computing terms coined by the aforementioned mechanisms is offered, with special consideration of the terms coined by taʿrīb as the main mechanism under discussion. The findings show that taʿrīb is by far the most used Arabic word formation mechanism in terms of computing terminology creation, followed by tarkīb, ishtiqāq and majāz. In addition, it has been concluded that taʿrīb clearly has a major impact on, and is of great importance to Arabic in computing terminology creation

    Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification

    Get PDF
    This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from non-parallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries beyond the coverage of traditionally used aligned parallel corpora, which can be used for finding translation equivalents for the ‘long tail’ in Zipfian distribution: low-frequency and usually unambiguous lexical items in closely-related languages (many of those often under-resourced). Graphonological Levenshtein edit distance relies on editing hierarchical representations of phonological features for graphemes (graphonological representations) and improves on phonological edit distance proposed for measuring dialectological variation. Graphonological edit distance works directly with character strings and does not require an intermediate stage of phonological transcription, exploiting the advantages of historical and morphological principles of orthography, which are obscured if only phonetic principle is applied. Difficulties associated with plain feature representations (unstructured feature sets or vectors) are addressed by using linguistically-motivated feature hierarchy that restricts matching of lower-level graphonological features when higher-level features are not matched. The paper presents an evaluation of the graphonological edit distance in comparison with the traditional Levenshtein edit distance from the perspective of its usefulness for the task of automated cognate identification. It discusses the advantages of the proposed method, which can be used for morphology induction, for robust transliteration across different alphabets (Latin, Cyrillic, Arabic, etc.) and robust identification of words with non-standard or distorted spelling, e.g., in user-generated content on the web such as posts on social media, blogs and comments. Software for calculating the modified feature-based Levenshtein distance, and the corresponding graphonological feature representations (vectors and the hierarchies of graphemes’ features) are released on the author’s webpage: http://corpus.leeds.ac.uk/bogdan/phonologylevenshtein/. Features are currently available for Latin and Cyrillic alphabets and will be extended to other alphabets and languages

    In search of knowledge: text mining dedicated to technical translation

    Get PDF
    Articolo pubblicato su CD e commercializzato direttamente dall'ASLIB (http://shop.emeraldinsight.com/product_info.htm/cPath/56_59/products_id/431). Programma del convegno su http://aslib.co.uk/conferences/tc_2011/programme.htm

    Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection

    Get PDF
    Many important forms of data are stored digitally in XML format. Errors can occur in the textual content of the data in the fields of the XML. Fixing these errors manually is time-consuming and expensive, especially for large amounts of data. There is increasing interest in the research, development, and use of automated techniques for assisting with data cleaning. Electronic dictionaries are an important form of data frequently stored in XML format that frequently have errors introduced through a mixture of manual typographical entry errors and optical character recognition errors. In this paper we describe methods for flagging statistical anomalies as likely errors in electronic dictionaries stored in XML format. We describe six systems based on different sources of information. The systems detect errors using various signals in the data including uncommon characters, text length, character-based language models, word-based language models, tied-field length ratios, and tied-field transliteration models. Four of the systems detect errors based on expectations automatically inferred from content within elements of a single field type. We call these single-field systems. Two of the systems detect errors based on correspondence expectations automatically inferred from content within elements of multiple related field types. We call these tied-field systems. For each system, we provide an intuitive analysis of the type of error that it is successful at detecting. Finally, we describe two larger-scale evaluations using crowdsourcing with Amazon's Mechanical Turk platform and using the annotations of a domain expert. The evaluations consistently show that the systems are useful for improving the efficiency with which errors in XML electronic dictionaries can be detected.Comment: 8 pages, 4 figures, 5 tables; published in Proceedings of the 2016 IEEE Tenth International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, pages 79-86, February 201

    Building basic vocabulary across 40 languages

    Get PDF
    The paper explores the options for building bilingual dictionaries by automated methods. We define the notion ‘basic vocabulary ’ and investigate how well the conceptual units that make up this language-independent vocabulary are covered by language-specific bindings in 40 languages

    Translation technologies. Scope, tools and resources

    Get PDF
    Translation technologies constitute an important new field of interdisciplinary study lying midway between computer science and translation. Its development in the professional world will largely depend on its academic progress and the effective introduction of translation technologies in the translators training curriculum. In this paper different approaches to the subject are examined so as to provide us with a basis on which to conduct an internal analysis of the field of Translation technologies and to structure its content. Following criteria based on professional practice and on the idiosyncrasy of the computer tools and resources that play a part in translation activity, we present our definition of Translation technologies and the field classified in five block
    corecore