Search CORE

14 research outputs found

The EAGLES/ISLE initiative for setting standards: the Computational Lexicon Working Group for Multilingual Lexicons

Author: Calzolari Nicoletta
Zampolli Antonio
Publication venue: Research Institute for Humanities and Social Sciences (RIHSS), the University of Sydney.
Publication date: 01/01/2001
Field of study

ISLE (International Standards for Language Engineering), a transatlantic standards oriented initiative under the Human Language Technology (HLT) programme, is a continuation of the long standing EAGLES (Expert Advisory Group for Language Engineering Standards) initiative, carried out by European and American groups within the EU-US International Research Co-operation, supported by NSF and EC. The objective is to support HLT R&D international and national projects, and HLT industry, by developing and promoting widely agreed and urgently demanded HLT standards and guidelines for infrastructural language resources, tools, and HLT products. ISLE targets the areas of multilingual computational lexicons (MCL), natural interaction and multimodality (NIMM), and evaluation. For MCL, ISLE is working to: extend EAGLES work on lexical semantics, necessary to establish inter-language links; design standards for multilingual lexicons; develop a prototype tool to implement lexicon guidelines; create EAGLES-conformant sample lexicons and tag corpora for validation purposes; develop standardised evaluation procedures for lexicons. For NIMM, a rapidly innovating domain urgently requiring early standardisation, ISLE work is targeted to develop guidelines for: creation of NIMM data resources; interpretative annotation of NIMM data, including spoken dialogue; annotation of discourse phenomena. For evaluation, ISLE is working on: quality models for machine translation systems; maintenance of previous guidelines - in an ISO based framework. We concentrate in the paper on the Computational Lexicon Working Group, describing in detail the proposals of guidelines for the "Multilingual ISLE Lexical Entry" (MILE). We highlight some methodological principles applied in previous EAGLES, and followed in defining MILE. We also provide a description of the EU SIMPLE semantic lexicons built on the basis of previous EAGLES recommendations. Their importance is given by the fact that these lexicons are now enlarged to real-size lexicons within National Projects in 8 EU countries, thus building a really large infrastructural platform of harmonised lexicons in Europe. We will stress the relevance of standardised language resources also for the humanities applications. Numerous theories, approaches, systems are taken into account in ISLE, as any recommendation for harmonisation must build on the major contemporary approaches. Results will be widely disseminated, after validation in collaboration with EU and US HLT R&D projects, and industry. EAGLES work towards de facto standards has already allowed the field of Language Resources to establish broad consensus on key issues for some well-established areas - and will allow similar consensus to be achieved for other important areas through the ISLE project - providing thus a key opportunity for further consolidation and a basis for technological advance. EAGLES previous results in many areas have in fact already become de facto widely adopted standards, and EAGLES itself is a well-known trademark and a point of reference for HLT projects.Hosted by the Scholarly Text and Imaging Service (SETIS), the University of Sydney Library, and the Research Institute for Humanities and Social Sciences (RIHSS), the University of Sydney

Sydney eScholarship

Bilingual newsgroups in Catalonia: a challenge for machine translation

Author: Climent Roca Salvador
Moré López Joaquim
Oliver González Antoni
Salvatierra Mallarach Míriam
Sánchez Sáiz Imma
Taulé Delor Mariona
Vallmanya Cucurull Lluïsa
Publication venue: 'Wiley'
Publication date: 01/01/2003
Field of study

This paper presents a linguistic analysis of a corpus of messages written in Catalan and Spanish, which come from several informal newsgroups on the Universitat Oberta de Catalunya (Open University of Catalonia; henceforth, UOC) Virtual Campus. The surrounding environment is one of extensive bilingualism and contact between Spanish and Catalan. The study was carried out as part of the INTERLINGUA project conducted by the UOC's Internet Interdisciplinary Institute (IN3). Its main goal is to ascertain the linguistic characteristics of the e-mail register in the newsgroups in order to assess their implications for the creation of an online machine translation environment. The results shed empirical light on the relevance of characteristics of the e-mail register, the impact of language contact and interference, and their implications for the use of machine translation for CMC data in order to facilitate cross-linguistic communication on the Internet

The Oberta in open access

Evaluating the output quality of machine translation systems: systran 4.0

Author: Talaván Zanón Noa
Publication venue: 'Universidad de Sevilla - Secretariado de Recursos Audiovisuales y Nuevas Tecnologias'
Publication date: 01/01/2005
Field of study

This paper presents a user-like black-box evaluation of SYSTRAN Premium 4.0., at the same time that it introduces an overall approach to the technique of Machine Translation evaluation. The evaluation of the output quality of a system such as SYSTRAN can give users an idea of the needs that can be covered by such a tool and of the type of uses that are more appropriate and can profit more from such a system

idUS. Depósito de Investigación Universidad de Sevilla

El modelado de OLIF utilizando las especificaciones de EAGLES/ISLE: un enfoque interlingüístico

Author: Arcas Túnez Francisco
Periñán Pascual José Carlos
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN)
Publication date: 01/03/2008
Field of study

[EN] FunGramKB is a lexico-conceptual knowledge base for NLP systems. The FunGramKB lexical model is basically derived from OLIF and enhanced with EAGLES/ISLE recommendations with the purpose of designing robust computational lexica. However, the FunGramKB interlingual approach gives a more cognitive view to EAGLES/ISLE proposals. The aim of this paper is to describe how this approach influences the way of conceiving lexical frames.[ES] FunGramKB es una base de conocimiento léxico-conceptual para su implementación en sistemas del PLN. El modelo léxico de FunGramKB se construyó a partir del modelo de OLIF, aunque fue preciso incorporar algunas de las recomendaciones de EAGLES/ISLE con el fin de poder diseñar lexicones computacionales más robustos. El propósito de este artículo es describir cómo el enfoque interlingüístico de FunGramKB proporciona una visión más cognitiva de los marcos léxicos que las propuestas por OLIF y EAGLES/ISLE.Periñán Pascual, JC.; Arcas Túnez, F. (2008). Modelling OLIF frame with EAGLES/ISLE specifications: an interlingual approach. Procesamiento del Lenguaje Natural. (40):9-16. http://hdl.handle.net/10251/52126S9164

RiuNet

Una visión interdisciplinar de la anotación semántica

Author: Aguado de Cea G.
Pareja-Lora A.
Álvarez de Mon Rego I.
Publication venue: Facultad de Informática (UPM)
Publication date: 01/01/2009
Field of study

Hoy en día Internet es la principal fuente de información. Es inmensa la cantidad de documentos accesibles en lo que se conoce como la World Wide Web (WWW) o, simplemente, la web o la red. ..

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital UPM

Віртуальна українсько- -російсько-англійська термінографічна лабораторія з фізики: сучасні лінгвістичні технології у фаховій мові

Author: Вакуленко Максим Олегович
Publication venue: Београд : Институт за српски језик САНУ
Publication date: 01/01/2017
Field of study

This paper presents the virtual Ukrainian-Russian-English terminographic laboratory on physics, which is a significant example of a unique modern high linguistic technology designed in the Ukrainian Lingua-Information Fund. It is a unique up-to-date technological resource based on the cloud and GRID technologies which makes it possible to create and store a large boy of terminological information (i.e. millions of terms in different languages along with related synonyms, as well as their explanations and illustrations) and work with the term system online anywhere in the world. This is not only a complex multilingual translational and explanatory dictionary, but also a medium whose content can be amended constantly. This medium facilitates a variety of terminological research activities: extracting terms, retrieving and extracting information, building semantic fields, studying lexical semantic relations, revealing lexicographic effects, creating ontologies, etc. The trilingual “Explanatory Dictionary of Physics” [Вакуленко–Вакуленко 2008] containing 6644 articles, which is a dictionary of encyclopedic type, forms a terminology database for it. The terminological problems inherent to modern term systems are also considered, namely the problem of synonymy and the use of verbal nouns.У статті представлено віртуальну українсько-російсько-англійську термі- нографічну лабораторію з фізики, розроблену в Українському мовно-інформа- ційному фонді НАН України, яка є прикладом сучасної наукомісткої лінгвістичної технології. Це – унікальний ресурс на базі хмаринних і GRID-технологій, який дозволяє створювати і зберігати надвеликі масиви різноманітної термінологічної інформації (мільйони термінів різними мовами разом з усіма синонімами, а також їхні тлумачення й ілюстрації) та працювати над терміносистемою в онлайново- му режимі. Це не тільки комплексний багатомовний перекладний і тлумачний словник, а й середовище, контент якого можна постійно вдосконалювати, прово- дити різноманітні термінологічні дослідження: екстрагувати терміни, будувати семантичні поля, вивчати лексико-семантичні відношення, виявляти лексико- графічні ефекти, створювати онтології тощо. Розглянуто також термінологічні проблеми, притаманні сучасним термі- носистемам, зокрема проблему синонімії та вживання віддієслівних іменників.Научни скупови / Српска академија наука и уметности ; књ. 157. Одељење језика и књижевности ; књ. 2

Serbian Academy of Science and Arts Digital Archive (DAIS)

Recycling texts: human evaluation of example-based machine translation subtitles for DVD

Author: Flanagan Marian
Publication venue: Dublin City University. School of Applied Language and Intercultural Studies
Publication date: 01/11/2009
Field of study

This project focuses on translation reusability in audiovisual contexts. Specifically, the project seeks to establish (1) whether target language subtitles produced by an EBMT system are considered intelligible and acceptable by viewers of movies on DVD, and (2)whether a relationship exists between the ‘profiles’ of corpora used to train an EBMT system, on the one hand, and viewers’ judgements of the intelligibility and acceptability of the subtitles produced by the system, on the other. The impact of other factors, namely: whether movie-viewing subjects have knowledge of the soundtrack language; subjects’ linguistic background; and subjects’ prior knowledge of the (Harry Potter) movie clips viewed; is also investigated. Corpus profiling is based on measurements (partly using corpus-analysis tools) of three characteristics of the corpora used to train the EBMT system: the number of source language repetitions they contain; the size of the corpus; and the homogeneity of the corpus (independent variables). As a quality control measure in this prospective profiling phase, we also elicit human judgements (through a combined questionnaire and interview) on the quality of the corpus data and on the reusability in new contexts of the TL subtitles. The intelligibility and acceptability of EBMT-produced subtitles (dependent variables) are, in turn, established through end-user evaluation sessions. In these sessions 44 native German-speaking subjects view short movie clips containing EBMT-generated German subtitles, and following each clip answer questions (again, through a combined questionnaire and interview) relating to the quality characteristics mentioned above. The findings of the study suggest that an increase in corpus size along with a concomitant increase in the number of source language repetitions and a decrease in corpus homogeneity, improves the readability of the EBMT-generated subtitles. It does not, however, have a significant effect on the comprehensibility, style or wellformedness of the EBMT-generated subtitles. Increasing corpus size and SL repetitions also results in a higher number of alternative TL translations in the corpus that are deemed acceptable by evaluators in the corpus profiling phase. The research also finds that subjects are more critical of subtitles when they do not understand the soundtrack language, while subjects’ linguistic background does not have a significant effect on their judgements of the quality of EBMT-generated subtitles. Prior knowledge of the Harry Potter genre, on the other hand, appears to have an effect on how viewing subjects rate the severity of observed errors in the subtitles, and on how they rate the style of subtitles, although this effect is training corpus-dependent. The introduction of repeated subtitles did not reduce the intelligibility or acceptability of the subtitles. Overall, the findings indicate that the subtitles deemed the most acceptable when evaluated in a non-AVT environment (albeit one in which rich contextual information was available) were the same as the subtitles deemed the most acceptable in an AVT environment, although richer data were gathered from the AVT environment

Irish Universities

DCU Online Research Access Service

An Investigation into Automatic Translation of Prepositions in IT Technical Documentation from English to Chinese

Author: Sun Yanli
Publication venue: Dublin City University. Centre for Translation and Textual Studies (CTTS)
Publication date: 01/11/2010
Field of study

Machine Translation (MT) technology has been widely used in the localisation industry to boost the productivity of professional translators. However, due to the high quality of translation expected, the translation performance of an MT system in isolation is less than satisfactory due to various generated errors. This study focuses on translation of prepositions from English into Chinese within technical documents in an industrial localisation context. The aim of the study is to reveal the salient errors in the translation of prepositions and to explore possible methods to remedy these errors. This study proposes three new approaches to improve the translation of prepositions. All approaches attempt to make use of the strengths of the two most popular MT architectures at the moment: Rule-Based MT (RBMT) and Statistical MT (SMT). The approaches include: firstly building an automatic preposition dictionary for the RBMT system; secondly exploring and modifing the process of Statistical Post-Editing (SPE) and thirdly pre-processing the source texts to better suit the RBMT system. Overall evaluation results (both human evaluation and automatic evaluation) show the potential of our new approaches in improving the translation of prepositions. In addition, the current study also reveals a new function of automatic metrics in assisting researchers to obtain more valid or purpose-specific human valuation results

Irish Universities

DCU Online Research Access Service

Diseño de un modelo metodológico para la recolección de información sobre las principales clases léxicas en diversas lenguas.

Author: Orjuela Salinas Nelsy lorena
Publication venue
Publication date: 21/09/2010
Field of study

El presente trabajo tiene como finalidad ofrecer un modelo metodolÃ³gico que permita, a los investigadores de campo interesados en la descripcciÃ³n de lenguas, la recolecciÃ³n de informaciÃ³n sobre clases lÃ©xicas1, y especialmente de las palabras que expresan conceptos de propiedad (i.e., adjetivos). Este trabajo se inscribe en, y estÃ¡ motivado por, la investigaciÃ³n ï¿½Tipología de las clases léxicas de las lenguas amazónicas y andinas de Colombia2ï¿½. Esta investigaciÃ³n propone un estudio constrastivo de aspectos fonolÃ³gicos y morfosintÃ¡cticos de algunas lenguas amazÃ³nicas y andinas de Colombia con el fin de establecer una tipologÃa de las clases lÃ©xicas. Bajo esta prespectiva, el tema trabajo de grado se escogiÃ³ no sÃ³lo por el hecho de participar como estudiante dentro del proyecto en menciÃ³n, sino tambiÃ©n porque a lo largo del estudio del problema de la categorizaciÃ³n surgiÃ³ una problemÃ¡tica metodolÃ³gica. Es decir, si bien el grupo de investigaciÃ³n ha podido caracterizar con mucho mÃ¡s detalle, y con el material con el que cuenta, las clases sustantivo y verbo3, no lo ha podido hacer para el caso de los adjetivos. La escasez de datos4 para la identificaciÃ³n y caracterizaciÃ³n de esta clase lÃ©xica ha generado mÃºltiples interpretaciones sobre su existencia en las lenguas indÃgenas habladas en Colombia.Pregrad

Universidad Nacional De Colombia - Repositorio Institucional UN