Search CORE

3,469 research outputs found

Rapid creation of large-scale corpora and frequency dictionaries

Author: Kornai András
Recski Gábor András
Varga Dániel
Zséder Attlia
Publication venue: ELRA
Publication date: 01/01/2012
Field of study

SZTAKI Publication Repository

AFRILEX 2002: 7th international conference of the African Association for Lexicography: Culture and dictionaries: programme and abstracts

Author: de Schryver Gilles-Maurice
Publication venue: (SF)2 Press
Publication date: 01/01/2002
Field of study

Ghent University Academic Bibliography

AFRILEX 2006: the user perspective in lexicography, programme & abstracts

Author: de Schryver Gilles-Maurice
Publication venue: (SF)2 Press
Publication date: 01/01/2006
Field of study

Ghent University Academic Bibliography

Lexical Borrowing (Taʿrib) in Arabic Computing Terminology: Issues and Strategies

Author: HAFIZ ALBARA,ALTAHER,A
Publication venue
Publication date: 01/01/2015
Field of study

Computing technology is evolving rapidly, which requires immediate terminology creation in the Arabic language to cope with such an evolution. Technical loanwords form a big part of modern Arabic terminology and they are spreading rapidly within the language. This research investigates the extent to which the Arabic neologization mechanism of taʿrīb (lexical borrowing) is used in computing terminology creation in comparison with the mechanisms of ishtiqāq (derivation), majāz (semantic extension) and tarkīb (compounding). In addition, it assesses the impact and importance of taʿrīb as a computing terminology creation mechanism in Arabic. This research is based on a corpus of specialised dictionaries and specialised literature. The aforementioned mechanisms are used to various degrees in Arabic in the creation of computing terminology, and are used interchangeably to produce equivalents of single foreign terms, which has caused confusion in the use of the language. The extent of the use of taʿrīb in computing terminology creation, and its impact on, and importance to Arabic as a computing terminology creation mechanism is determined based on two criteria. First, a comparison of the extent of use of the aforementioned mechanisms based on three selected corpora of dictionaries and magazines of Arabic technical computing terminology is presented. Second, an assessment of the lexicographical treatments of the computing terms coined by the aforementioned mechanisms is offered, with special consideration of the terms coined by taʿrīb as the main mechanism under discussion. The findings show that taʿrīb is by far the most used Arabic word formation mechanism in terms of computing terminology creation, followed by tarkīb, ishtiqāq and majāz. In addition, it has been concluded that taʿrīb clearly has a major impact on, and is of great importance to Arabic in computing terminology creation

Durham e-Theses

Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification

Author: Babych B
Publication venue: 'University of Latvia'
Publication date: 02/06/2016
Field of study

This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from non-parallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries beyond the coverage of traditionally used aligned parallel corpora, which can be used for finding translation equivalents for the ‘long tail’ in Zipfian distribution: low-frequency and usually unambiguous lexical items in closely-related languages (many of those often under-resourced). Graphonological Levenshtein edit distance relies on editing hierarchical representations of phonological features for graphemes (graphonological representations) and improves on phonological edit distance proposed for measuring dialectological variation. Graphonological edit distance works directly with character strings and does not require an intermediate stage of phonological transcription, exploiting the advantages of historical and morphological principles of orthography, which are obscured if only phonetic principle is applied. Difficulties associated with plain feature representations (unstructured feature sets or vectors) are addressed by using linguistically-motivated feature hierarchy that restricts matching of lower-level graphonological features when higher-level features are not matched. The paper presents an evaluation of the graphonological edit distance in comparison with the traditional Levenshtein edit distance from the perspective of its usefulness for the task of automated cognate identification. It discusses the advantages of the proposed method, which can be used for morphology induction, for robust transliteration across different alphabets (Latin, Cyrillic, Arabic, etc.) and robust identification of words with non-standard or distorted spelling, e.g., in user-generated content on the web such as posts on social media, blogs and comments. Software for calculating the modified feature-based Levenshtein distance, and the corresponding graphonological feature representations (vectors and the hierarchies of graphemes’ features) are released on the author’s webpage: http://corpus.leeds.ac.uk/bogdan/phonologylevenshtein/. Features are currently available for Latin and Cyrillic alphabets and will be extended to other alphabets and languages

White Rose Research Online

In search of knowledge: text mining dedicated to technical translation

Author: Elia A
Marano F.
Monteleone M
Monti Johanna
Postiglione A
Publication venue
Publication date: 01/01/2011
Field of study

Articolo pubblicato su CD e commercializzato direttamente dall'ASLIB (http://shop.emeraldinsight.com/product_info.htm/cPath/56_59/products_id/431). Programma del convegno su http://aslib.co.uk/conferences/tc_2011/programme.htm

ARCHIVIO ISTITUZIONALE DELLA RICERCA-UNIVERSITA' DEGLI STUDI DI NAPOLI "L'ORIENTALE"

Università degli Studi di Napoli L'Orientale: CINECA IRIS

HAL Descartes

Archivio della Ricerca - Università di Salerno

Hal-Diderot

Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection

Author: Bloodgood Michael
Strauss Benjamin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Many important forms of data are stored digitally in XML format. Errors can occur in the textual content of the data in the fields of the XML. Fixing these errors manually is time-consuming and expensive, especially for large amounts of data. There is increasing interest in the research, development, and use of automated techniques for assisting with data cleaning. Electronic dictionaries are an important form of data frequently stored in XML format that frequently have errors introduced through a mixture of manual typographical entry errors and optical character recognition errors. In this paper we describe methods for flagging statistical anomalies as likely errors in electronic dictionaries stored in XML format. We describe six systems based on different sources of information. The systems detect errors using various signals in the data including uncommon characters, text length, character-based language models, word-based language models, tied-field length ratios, and tied-field transliteration models. Four of the systems detect errors based on expectations automatically inferred from content within elements of a single field type. We call these single-field systems. Two of the systems detect errors based on correspondence expectations automatically inferred from content within elements of multiple related field types. We call these tied-field systems. For each system, we provide an intuitive analysis of the type of error that it is successful at detecting. Finally, we describe two larger-scale evaluations using crowdsourcing with Amazon's Mechanical Turk platform and using the annotations of a domain expert. The evaluations consistently show that the systems are useful for improving the efficiency with which errors in XML electronic dictionaries can be detected.Comment: 8 pages, 4 figures, 5 tables; published in Proceedings of the 2016 IEEE Tenth International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, pages 79-86, February 201

arXiv.org e-Print Archive

Crossref

Digital Repository at the University of Maryland

Building basic vocabulary across 40 languages

Author: Kornai András
Pajkossy Katalin
Ács Judit
Publication venue: Omnipress
Publication date: 01/01/2013
Field of study

The paper explores the options for building bilingual dictionaries by automated methods. We define the notion ‘basic vocabulary ’ and investigate how well the conceptual units that make up this language-independent vocabulary are covered by language-specific bindings in 40 languages

CiteSeerX

SZTAKI Publication Repository

Translation technologies. Scope, tools and resources

Author: Alcina Amparo
Publication venue: 'John Benjamins Publishing Company'
Publication date: 01/01/2008
Field of study

Translation technologies constitute an important new field of interdisciplinary study lying midway between computer science and translation. Its development in the professional world will largely depend on its academic progress and the effective introduction of translation technologies in the translators training curriculum. In this paper different approaches to the subject are examined so as to provide us with a basis on which to conduct an internal analysis of the field of Translation technologies and to structure its content. Following criteria based on professional practice and on the idiosyncrasy of the computer tools and resources that play a part in translation activity, we present our definition of Translation technologies and the field classified in five block

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositori Institucional de la Universitat Jaume I