33 research outputs found
An open-source shallow-transfer machine translation toolbox: consequences of its release and availability
By the time Machine Translation Summit X is held in September 2005, our group will have released an open-source machine translation toolbox as part of a large government-funded project involving four universities and three linguistic technology companies from Spain. The machine translation toolbox, which will most likely be released under a GPL-like license includes (a) the open-source engine itself, a modular shallow-transfer machine translation engine suitable for related languages and largely based upon that of systems we have already developed, such as interNOSTRUM for Spanish—Catalan and Traductor Universia for Spanish—Portuguese, (b) extensive documentation (including document type declarations) specifying the XML format of all linguistic (dictionaries, rules) and document format management files, (c) compilers converting these data into the high-speed (tens of thousands of words a second) format used by the engine, and (d) pilot linguistic data for Spanish—Catalan and Spanish—Galician and format management specifications for the HTML, RTF and plain text formats. After describing very briefly this toolbox, this paper aims at exploring possible consequences of the availability of this architecture, including the community-driven development of machine translation systems for languages lacking this kind of linguistic technology.The development of the toolbox is funded by project FIT-340101-2004-3 (Spanish Ministry of Industry, Commerce and Tourism)
An open-source shallow-transfer machine translation engine for the Romance languages of Spain
We present the current status of development of an open-source shallow-transfer machine translation engine for the Romance languages of Spain (the main ones being Spanish, Catalan and Galician) as part of a larger government-funded project which includes non-Romance languages such as Basque and involving both universities and linguistic technology companies. The machine translation architecture uses finite-state transducers for lexical processing, hidden Markov models for part-of-speech tagging, and finite-state based chunking for structural transfer, and is largely based upon that of systems already developed by the Transducens group at the Universitat d'Alacant, such as interNOSTRUM (Spanish—Catalan) and Traductor Universia (Spanish—Portuguese). The possible scope of the project, however, is wider, since it will be possible to use the resulting machine translation system with new pairs of languages; to that end, the project also aims at proposing standard formats to encode the linguistic data needed. This paper briefly describes the machine translation engine, the formats it uses for linguistic data, and the compilers that convert these data into an efficient format used by the engine.Work funded by projects FIT-340101-2004-3 (Spanish Ministry of Industry, Commerce and Tourism) and TIC2003-08681-C02-01 (Spanish Ministry of Science and Technology). Felipe Sánchez-Martínez is supported by the Spanish Ministry of Science and Education and the European Social Fund through grant BES-2004-4711
Apertium, una plataforma de código abierto para el desarrollo de sistemas de traducción automática
Uno de los principales retos de la informática para las próximas décadas es el desarrollo de sistemas capaces de procesar eficazmente el lenguaje natural (o lenguaje humano). Dentro de este campo, los sistemas de traducción automática, encargados de traducir un texto escrito en un idioma a una versión equivalente en otro idioma, reciben especial atención dado, por ejemplo, el carácter multilingüe de sociedades como la europea. La automatización de dicho proceso es particularmente compleja porque los programas han de enfrentarse a características del lenguaje natural, como la ambigüedad, cuyo tratamiento algorítmico no es factible, de modo que una mera aproximación o automatización parcial del proceso ya se considera un éxito. Los programas de traducción automática han sido tradicionalmente sistemas cerrados, pero en los últimos tiempos la tendencia marcada por el software libre ha llegado también a este campo. En este artículo describimos Apertium, apertium.org, una plataforma avanzada de código abierto, con licencia GNU GPL, que, gracias al desacoplamiento que ofrece entre datos y programas permite desarrollar cómodamente nuevos traductores automáticos. La plataforma Apertium ha sido desarrollada por el grupo de investigación Transducens de la Universitat d’Alacant en el marco de varios proyectos de colaboración con universidades y empresas de España en los que, además de los programas que conforman el motor de traducción, se han confeccionado datos lingüísticos abiertos para la traducción automática catalán–español, gallego–español, portugués–español, francés–catalán, inglés–catalán y occitano–catalán. Tanto la plataforma en la que se integra el motor de traducción como los datos para estos pares de lenguas están disponibles para su descarga en sf.net/projects/apertium/ y para su evaluación en línea en xixona.dlsi.ua.es/prototype/.Este trabajo ha sido parcialmente subvencionado por el Ministerio de Industria, Comercio y Turismo a través de los proyectos FIT-340101-2004-3, FIT-340001-2005-2 y FIT-350401-2006-5, por el Ministerio de Educación y Ciencia a través de los proyectos TIC2003-08681-C02-01 y TIN2006-15071-C03-01, y por la Generalitat de Catalunya a través del proyecto DURSI1-05I. Felipe Sánchez-Martínez disfruta de la ayuda para la formación de personal investigador BES-2004-4711, financiada por el Fondo Social Europeo y el Ministerio de Educación y Ciencia
Self-sampling for human papillomavirus DNA detection: A preliminary study of compliance and feasibility in BOLIVIA
Background: Cervical cancer incidence and mortality rates in Bolivia are among the highest in Latin America. This investigation aims to evaluate the possibility of using simple devices, e.g. a cotton swab and a glass slide, for self-sampling in order to detect human papillomavirus (HPV) DNA by PCR in cervico-vaginal cells. Methods: In the first phase of our study we evaluated the use of a glass slide as a transport medium for cervical cells. A physician took paired-cervical samples from 235 women. One sample was transported in Easyfix® solution and the other sample was smeared over a glass slide. Both were further analyzed and compared for human DNA recovery and HPV detection. A kappa value was determined to evaluate the agreement between the HPV DNA detection rates. In the second phase of the study, 222 women from the urban, peri-urban and rural regions of Cochabamba were requested to perform self-sampling using the following devices: a cotton swab combined with a glass slide, and a vaginal tampon. Women gave their opinion about the self-sampling technique. Finally, the agreement for high risk-HPV detection between self- and physician-collected samples was performed in 201 samples in order to evaluate the self-sampling technique. Results: Firstly, the comparison between Easyfix® solution and the glass slide to transport clinical samples gave a good agreement for HPV DNA detection (Κ = 0.71, 95% CI 0.60-0.81). Secondly, self-sampling, especially with cotton swab combined with glass slide, would generally be preferred over clinician sampling for a screening program based on HPV detection. Finally, we showed a good agreement between self- and physician collected samples for high risk-HPV detection (Κ = 0.71, 95% CI 0.55-0.88). Conclusions: Simple devices such as a cotton swab and a glass slide can be used to perform self-sampling and HPV DNA detection. Furthermore, most Bolivian women preferred self-sampling over clinician-sampling for cervical cancer screening.SCOPUS: ar.jinfo:eu-repo/semantics/publishe
Open-source Portuguese-Spanish machine translation
Abstract. This paper describes the current status of development of an open-source shallow-transfer machine translation (MT) system for the [European] Portuguese ↔ Spanish language pair, developed using the OpenTrad Apertium MT toolbox (www.apertium.org). Apertium uses finite-state transducers for lexical processing, hidden Markov models for part-of-speech tagging, and finite-state-based chunking for structural transfer, and is based on a simple rationale: to produce fast, reasonably intelligible and easily correctable translations between related languages, it suffices to use a MT strategy which uses shallow parsing techniques to refine word-for-word MT. This paper briefly describes the MT engine, the formats it uses for linguistic data, and the compilers that convert these data into an efficient format used by the engine, and then goes on to describe in more detail the pilot Portuguese↔Spanish linguistic data.