20 research outputs found

    Constructing a Large-Scale English-Persian Parallel Corpus

    Get PDF
    In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them.The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises.One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.Au cours des dernières années, l’exploitation de grands corpus de textes pour résoudre des problèmes linguistiques, notamment des problèmes de traduction, est devenue une pratique courante. Jusqu’à récemment, aucun corpus bilingue anglais-persan à grande échelle n’avait été constitué, en raison des difficultés qu’implique une telle entreprise.Cet article présente un projet réalisé en vue de colliger des corpus de textes numériques variés, tels que des documents du réseau Internet, avec le moins de bruit possible. L’utilisation d’Internet peut être considérée comme une aide précieuse car, souvent, il existe des traductions antérieures qui sont déjà publiées sur le Web. La tâche consiste à trouver les pages parallèles en anglais et en persan, à évaluer la qualité de leur traduction, à les télécharger et à les aligner. Le corpus ainsi obtenu est un corpus ouvert, soit un corpus auquel de nouvelles données peuvent être ajoutées, selon les besoins.Une des principales conséquences de l’élaboration d’un tel corpus est la mise au point d’un logiciel de concordance parallèle, dans lequel l’utilisateur pourrait introduire une chaîne de caractères dans une langue et afficher toutes les citations concernant cette chaîne dans la langue recherchée ainsi que des phrases correspondantes dans la langue cible. L’étape suivante serait d’utiliser ce corpus parallèle pour construire un logiciel de traduction générale.Le corpus bilingue aligné se trouve être utile dans beaucoup d’autres cas, entre autres pour la traduction par ordinateur, pour lever les ambiguïtés de sens, pour le rétablissement des données interlangues, en lexicographie ainsi que pour l’apprentissage des langues

    UNSUPERVISED PART OF SPEECH TAGGING FOR PERSIAN

    Get PDF
    Abstract In this paper we present a rather novel unsupervised method for part of speech (below POS) disambiguation which has been applied to Persian. This method known as Iterative Improved Feedback (IIF) Model, which is a heuristic one, uses only a raw corpus of Persian as well as all possible tags for every word in that corpus as input. During the process of tagging, the algorithm passes through several iterations corresponding to n-gram levels of analysis to disambiguate each word based on a previously defined threshold. The total accuracy of the program applying in Persian texts has been calculated as 93 percent, which seems very encouraging for POS tagging in this language

    FarsiTag: A Part-of-Speech Tagging System for Persian

    No full text

    Using Bilingual Parallel Corpora in Translation Memory Systems

    No full text
    Automatic word alignment techniques commonly used in Translation Memory systems tend basically to work at single word level where there is a one to one correspondence between words in subsequences of the two languages. This, results in not being able to fully use subsentential repetitions like clauses, phrases and expressions. In this paper, using spaces between words, a search method named "space-based reduction search" is introduced. The main goal is to maximize the use of parallel corpus resources. We want to show that this search method can significantly enhance the chance of finding matches for subsequences of input sentences; hence applicable in a Sub-Sentential Translation Memory (SSTM) system without running automatic alignment tools.

    Designing and Implementing a Cross-Language Information Retrieval System Using Linguistic Corpora

    No full text
    Information retrieval (IR) is a crucial area of natural language processing (NLP) and can be defined as finding documents whose content is relevant to the query need of a user. Cross-language information retrieval (CLIR) refers to a kind of information retrieval in which the language of the query and that of searched document are different. In fact, it is a retrieval process where the user presents queries in one language to retrieve documents in another language. This paper tried to construct a bilingual lexicon of parallel chunks of English and Persian from two very large monolingual corpora an English-Persian parallel corpus which could be directly applied to cross-language information retrieval tasks. For this purpose, a statistical measure known as Association Score (AS) was used to compute the association value between every two corresponding chunks in the corpus using a couple of complicated algorithms. Once the CLIR system was developed using this bilingual lexicon, an experiment was performed on a set of one hundred English and Persian phrases and collocations to see to what extend this system was effective in assisting the users find the most relevant and suitable equivalents of their queries in either language

    Lexical Disambiguation of Polysemeous Adjectives in MT : A Corpus Based Study

    No full text
    Ambiguous nouns and adjectives are words with more than one meaning. Having several meanings creates problems in MT particularly in automatic programs where human plays no role in choosing the equivalent target word and there is no context available to the machine. One of the ways to solve the problem is to provide the Machine with proper and enough contexts; here by context we mean the collocations of a word.. In order to show this we chose English adjectives for our study. We extracted polysemous adjectives from a bilingual English-Persian Middle Dictionary (Hezareh) and checked them all in a bilingual parallel English-Persian Corpus to get the context and collocations for each adjective. Then we recorded all the adjectives with their associate words, context and the meaning provided by the corpus. Then a concordance was created out of these adjectives and their equivalents along with their collocations and a disambiguating program was written for it so that it could choose the best equivalent for the target adjective according to the highest meaning frequency of the given adjective or its collocation. This program’s reliability was tested by using five translators. Their choices were compared through a statistical method (Cronbach alpha) and their covariance was calculated by spss software. The results show that in more than 50% of the cases translators’ choices were the same as the program’s choices. The results of this study could be helpful in MT, bilingual information retrieval, word nets and even in teaching Persian to foreigners
    corecore