Search CORE

4 research outputs found

Constructing a Large-Scale English-Persian Parallel Corpus

Author: Mosavi Miangah Tayebeh
Publication venue: Érudit
Publication date: 01/01/2009
Field of study

In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them.The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises.One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.Au cours des dernières années, l’exploitation de grands corpus de textes pour résoudre des problèmes linguistiques, notamment des problèmes de traduction, est devenue une pratique courante. Jusqu’à récemment, aucun corpus bilingue anglais-persan à grande échelle n’avait été constitué, en raison des difficultés qu’implique une telle entreprise.Cet article présente un projet réalisé en vue de colliger des corpus de textes numériques variés, tels que des documents du réseau Internet, avec le moins de bruit possible. L’utilisation d’Internet peut être considérée comme une aide précieuse car, souvent, il existe des traductions antérieures qui sont déjà publiées sur le Web. La tâche consiste à trouver les pages parallèles en anglais et en persan, à évaluer la qualité de leur traduction, à les télécharger et à les aligner. Le corpus ainsi obtenu est un corpus ouvert, soit un corpus auquel de nouvelles données peuvent être ajoutées, selon les besoins.Une des principales conséquences de l’élaboration d’un tel corpus est la mise au point d’un logiciel de concordance parallèle, dans lequel l’utilisateur pourrait introduire une chaîne de caractères dans une langue et afficher toutes les citations concernant cette chaîne dans la langue recherchée ainsi que des phrases correspondantes dans la langue cible. L’étape suivante serait d’utiliser ce corpus parallèle pour construire un logiciel de traduction générale.Le corpus bilingue aligné se trouve être utile dans beaucoup d’autres cas, entre autres pour la traduction par ordinateur, pour lever les ambiguïtés de sens, pour le rétablissement des données interlangues, en lexicographie ainsi que pour l’apprentissage des langues

Crossref

Érudit

Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration

Author: Fujii Atsushi
Ishikawa Tetsuya
Publication venue
Publication date: 01/01/2001
Field of study

Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our system is highly dependent on the quality of the translation of technical terms. However, the technical term translation is still problematic in that technical terms are often compound words, and thus new terms are progressively created by combining existing base words. In addition, Japanese often represents loanwords based on its special phonogram. Consequently, existing dictionaries find it difficult to achieve sufficient coverage. To counter the first problem, we produce a Japanese/English dictionary for base words, and translate compound words on a word-by-word basis. We also use a probabilistic method to resolve translation ambiguity. For the second problem, we use a transliteration method, which corresponds words unlisted in the base word dictionary to their phonetic equivalents in the target language. We evaluate our system using a test collection for CLIR, and show that both the compound word translation and transliteration methods improve the system performance

arXiv.org e-Print Archive

CiteSeerX

Knowledge-directed intelligent information retrieval for research funding.

Author: Hansraj Sanjith.
Publication venue
Publication date: 01/01/2001
Field of study

Thesis (M.Sc.)- University of Natal, Pietermaritzburg, 2001.Researchers have always found difficulty in attaining funding from the National Research Foundation (NRF) for new research interests. The field of Artificial Intelligence (AI) holds the promise of improving the matching of research proposals to funding sources in the area of Intelligent Information Retrieval (IIR). IIR is a fairly new AI technique that has evolved from the traditional IR systems to solve real-world problems. Typically, an IIR system contains three main components, namely, a knowledge base, an inference engine and a user-interface. Due to its inferential capabilities. IIR has been found to be applicable to domains for which traditional techniques, such as the use of databases, have not been well suited. This applicability has led it to become a viable AI technique from both, a research and an application perspective. This dissertation concentrates on researching and implementing an IIR system in LPA Prolog, that we call FUND, to assist in the matching of research proposals of prospective researchers to funding sources within the National Research Foundation (NRF). FUND'S reasoning strategy for its inference engine is backward chaining that carries out a depth-first search over its knowledge representation structure, namely, a semantic network. The distance constraint of the Constrained Spreading Activation (CSA) technique is incorporated within the search strategy to help prune non-relevant returns by FUND. The evolution of IIR from IR was covered in detail. Various reasoning strategies and knowledge representation schemes were reviewed to find the combination that best suited the problem domain and programming language chosen. FUND accommodated a depth 4, depth 5 and an exhaustive search algorithm. FUND'S effectiveness was tested, in relation to the different searches with respect to their precision and recall ability and in comparison to other similar systems. FUND'S performance in providing researchers with better funding advice in the South African situation proved to be favourably comparable to other similar systems elsewhere

ResearchSpace@UKZN