14 research outputs found

    The EAGLES/ISLE initiative for setting standards: the Computational Lexicon Working Group for Multilingual Lexicons

    Get PDF
    ISLE (International Standards for Language Engineering), a transatlantic standards oriented initiative under the Human Language Technology (HLT) programme, is a continuation of the long standing EAGLES (Expert Advisory Group for Language Engineering Standards) initiative, carried out by European and American groups within the EU-US International Research Co-operation, supported by NSF and EC. The objective is to support HLT R&D international and national projects, and HLT industry, by developing and promoting widely agreed and urgently demanded HLT standards and guidelines for infrastructural language resources, tools, and HLT products. ISLE targets the areas of multilingual computational lexicons (MCL), natural interaction and multimodality (NIMM), and evaluation. For MCL, ISLE is working to: extend EAGLES work on lexical semantics, necessary to establish inter-language links; design standards for multilingual lexicons; develop a prototype tool to implement lexicon guidelines; create EAGLES-conformant sample lexicons and tag corpora for validation purposes; develop standardised evaluation procedures for lexicons. For NIMM, a rapidly innovating domain urgently requiring early standardisation, ISLE work is targeted to develop guidelines for: creation of NIMM data resources; interpretative annotation of NIMM data, including spoken dialogue; annotation of discourse phenomena. For evaluation, ISLE is working on: quality models for machine translation systems; maintenance of previous guidelines - in an ISO based framework. We concentrate in the paper on the Computational Lexicon Working Group, describing in detail the proposals of guidelines for the "Multilingual ISLE Lexical Entry" (MILE). We highlight some methodological principles applied in previous EAGLES, and followed in defining MILE. We also provide a description of the EU SIMPLE semantic lexicons built on the basis of previous EAGLES recommendations. Their importance is given by the fact that these lexicons are now enlarged to real-size lexicons within National Projects in 8 EU countries, thus building a really large infrastructural platform of harmonised lexicons in Europe. We will stress the relevance of standardised language resources also for the humanities applications. Numerous theories, approaches, systems are taken into account in ISLE, as any recommendation for harmonisation must build on the major contemporary approaches. Results will be widely disseminated, after validation in collaboration with EU and US HLT R&D projects, and industry. EAGLES work towards de facto standards has already allowed the field of Language Resources to establish broad consensus on key issues for some well-established areas - and will allow similar consensus to be achieved for other important areas through the ISLE project - providing thus a key opportunity for further consolidation and a basis for technological advance. EAGLES previous results in many areas have in fact already become de facto widely adopted standards, and EAGLES itself is a well-known trademark and a point of reference for HLT projects.Hosted by the Scholarly Text and Imaging Service (SETIS), the University of Sydney Library, and the Research Institute for Humanities and Social Sciences (RIHSS), the University of Sydney

    Bilingual newsgroups in Catalonia: a challenge for machine translation

    Get PDF
    This paper presents a linguistic analysis of a corpus of messages written in Catalan and Spanish, which come from several informal newsgroups on the Universitat Oberta de Catalunya (Open University of Catalonia; henceforth, UOC) Virtual Campus. The surrounding environment is one of extensive bilingualism and contact between Spanish and Catalan. The study was carried out as part of the INTERLINGUA project conducted by the UOC's Internet Interdisciplinary Institute (IN3). Its main goal is to ascertain the linguistic characteristics of the e-mail register in the newsgroups in order to assess their implications for the creation of an online machine translation environment. The results shed empirical light on the relevance of characteristics of the e-mail register, the impact of language contact and interference, and their implications for the use of machine translation for CMC data in order to facilitate cross-linguistic communication on the Internet

    Evaluating the output quality of machine translation systems: systran 4.0

    Get PDF
    This paper presents a user-like black-box evaluation of SYSTRAN Premium 4.0., at the same time that it introduces an overall approach to the technique of Machine Translation evaluation. The evaluation of the output quality of a system such as SYSTRAN can give users an idea of the needs that can be covered by such a tool and of the type of uses that are more appropriate and can profit more from such a system

    El modelado de OLIF utilizando las especificaciones de EAGLES/ISLE: un enfoque interlingüístico

    Full text link
    [EN] FunGramKB is a lexico-conceptual knowledge base for NLP systems. The FunGramKB lexical model is basically derived from OLIF and enhanced with EAGLES/ISLE recommendations with the purpose of designing robust computational lexica. However, the FunGramKB interlingual approach gives a more cognitive view to EAGLES/ISLE proposals. The aim of this paper is to describe how this approach influences the way of conceiving lexical frames.[ES] FunGramKB es una base de conocimiento léxico-conceptual para su implementación en sistemas del PLN. El modelo léxico de FunGramKB se construyó a partir del modelo de OLIF, aunque fue preciso incorporar algunas de las recomendaciones de EAGLES/ISLE con el fin de poder diseñar lexicones computacionales más robustos. El propósito de este artículo es describir cómo el enfoque interlingüístico de FunGramKB proporciona una visión más cognitiva de los marcos léxicos que las propuestas por OLIF y EAGLES/ISLE.Periñán Pascual, JC.; Arcas Túnez, F. (2008). Modelling OLIF frame with EAGLES/ISLE specifications: an interlingual approach. Procesamiento del Lenguaje Natural. (40):9-16. http://hdl.handle.net/10251/52126S9164

    Una visión interdisciplinar de la anotación semántica

    Get PDF
    Hoy en día Internet es la principal fuente de información. Es inmensa la cantidad de documentos accesibles en lo que se conoce como la World Wide Web (WWW) o, simplemente, la web o la red. ..

    Віртуальна українсько- -російсько-англійська термінографічна лабораторія з фізики: сучасні лінгвістичні технології у фаховій мові

    Get PDF
    This paper presents the virtual Ukrainian-Russian-English terminographic laboratory on physics, which is a significant example of a unique modern high linguistic technology designed in the Ukrainian Lingua-Information Fund. It is a unique up-to-date technological resource based on the cloud and GRID technologies which makes it possible to create and store a large boy of terminological information (i.e. millions of terms in different languages along with related synonyms, as well as their explanations and illustrations) and work with the term system online anywhere in the world. This is not only a complex multilingual translational and explanatory dictionary, but also a medium whose content can be amended constantly. This medium facilitates a variety of terminological research activities: extracting terms, retrieving and extracting information, building semantic fields, studying lexical semantic relations, revealing lexicographic effects, creating ontologies, etc. The trilingual “Explanatory Dictionary of Physics” [Вакуленко–Вакуленко 2008] containing 6644 articles, which is a dictionary of encyclopedic type, forms a terminology database for it. The terminological problems inherent to modern term systems are also considered, namely the problem of synonymy and the use of verbal nouns.У статті представлено віртуальну українсько-російсько-англійську термі- нографічну лабораторію з фізики, розроблену в Українському мовно-інформа- ційному фонді НАН України, яка є прикладом сучасної наукомісткої лінгвістичної технології. Це – унікальний ресурс на базі хмаринних і GRID-технологій, який дозволяє створювати і зберігати надвеликі масиви різноманітної термінологічної інформації (мільйони термінів різними мовами разом з усіма синонімами, а також їхні тлумачення й ілюстрації) та працювати над терміносистемою в онлайново- му режимі. Це не тільки комплексний багатомовний перекладний і тлумачний словник, а й середовище, контент якого можна постійно вдосконалювати, прово- дити різноманітні термінологічні дослідження: екстрагувати терміни, будувати семантичні поля, вивчати лексико-семантичні відношення, виявляти лексико- графічні ефекти, створювати онтології тощо. Розглянуто також термінологічні проблеми, притаманні сучасним термі- носистемам, зокрема проблему синонімії та вживання віддієслівних іменників.Научни скупови / Српска академија наука и уметности ; књ. 157. Одељење језика и књижевности ; књ. 2

    Recycling texts: human evaluation of example-based machine translation subtitles for DVD

    Get PDF
    This project focuses on translation reusability in audiovisual contexts. Specifically, the project seeks to establish (1) whether target language subtitles produced by an EBMT system are considered intelligible and acceptable by viewers of movies on DVD, and (2)whether a relationship exists between the ‘profiles’ of corpora used to train an EBMT system, on the one hand, and viewers’ judgements of the intelligibility and acceptability of the subtitles produced by the system, on the other. The impact of other factors, namely: whether movie-viewing subjects have knowledge of the soundtrack language; subjects’ linguistic background; and subjects’ prior knowledge of the (Harry Potter) movie clips viewed; is also investigated. Corpus profiling is based on measurements (partly using corpus-analysis tools) of three characteristics of the corpora used to train the EBMT system: the number of source language repetitions they contain; the size of the corpus; and the homogeneity of the corpus (independent variables). As a quality control measure in this prospective profiling phase, we also elicit human judgements (through a combined questionnaire and interview) on the quality of the corpus data and on the reusability in new contexts of the TL subtitles. The intelligibility and acceptability of EBMT-produced subtitles (dependent variables) are, in turn, established through end-user evaluation sessions. In these sessions 44 native German-speaking subjects view short movie clips containing EBMT-generated German subtitles, and following each clip answer questions (again, through a combined questionnaire and interview) relating to the quality characteristics mentioned above. The findings of the study suggest that an increase in corpus size along with a concomitant increase in the number of source language repetitions and a decrease in corpus homogeneity, improves the readability of the EBMT-generated subtitles. It does not, however, have a significant effect on the comprehensibility, style or wellformedness of the EBMT-generated subtitles. Increasing corpus size and SL repetitions also results in a higher number of alternative TL translations in the corpus that are deemed acceptable by evaluators in the corpus profiling phase. The research also finds that subjects are more critical of subtitles when they do not understand the soundtrack language, while subjects’ linguistic background does not have a significant effect on their judgements of the quality of EBMT-generated subtitles. Prior knowledge of the Harry Potter genre, on the other hand, appears to have an effect on how viewing subjects rate the severity of observed errors in the subtitles, and on how they rate the style of subtitles, although this effect is training corpus-dependent. The introduction of repeated subtitles did not reduce the intelligibility or acceptability of the subtitles. Overall, the findings indicate that the subtitles deemed the most acceptable when evaluated in a non-AVT environment (albeit one in which rich contextual information was available) were the same as the subtitles deemed the most acceptable in an AVT environment, although richer data were gathered from the AVT environment

    An Investigation into Automatic Translation of Prepositions in IT Technical Documentation from English to Chinese

    Get PDF
    Machine Translation (MT) technology has been widely used in the localisation industry to boost the productivity of professional translators. However, due to the high quality of translation expected, the translation performance of an MT system in isolation is less than satisfactory due to various generated errors. This study focuses on translation of prepositions from English into Chinese within technical documents in an industrial localisation context. The aim of the study is to reveal the salient errors in the translation of prepositions and to explore possible methods to remedy these errors. This study proposes three new approaches to improve the translation of prepositions. All approaches attempt to make use of the strengths of the two most popular MT architectures at the moment: Rule-Based MT (RBMT) and Statistical MT (SMT). The approaches include: firstly building an automatic preposition dictionary for the RBMT system; secondly exploring and modifing the process of Statistical Post-Editing (SPE) and thirdly pre-processing the source texts to better suit the RBMT system. Overall evaluation results (both human evaluation and automatic evaluation) show the potential of our new approaches in improving the translation of prepositions. In addition, the current study also reveals a new function of automatic metrics in assisting researchers to obtain more valid or purpose-specific human valuation results

    Diseño de un modelo metodológico para la recolección de información sobre las principales clases léxicas en diversas lenguas.

    Get PDF
    El presente trabajo tiene como finalidad ofrecer un modelo metodológico que permita, a los investigadores de campo interesados en la descripcción de lenguas, la recolección de información sobre clases léxicas1, y especialmente de las palabras que expresan conceptos de propiedad (i.e., adjetivos). Este trabajo se inscribe en, y está motivado por, la investigación �Tipología de las clases léxicas de las lenguas amazónicas y andinas de Colombia2�. Esta investigación propone un estudio constrastivo de aspectos fonológicos y morfosintácticos de algunas lenguas amazónicas y andinas de Colombia con el fin de establecer una tipología de las clases léxicas. Bajo esta prespectiva, el tema trabajo de grado se escogió no sólo por el hecho de participar como estudiante dentro del proyecto en mención, sino también porque a lo largo del estudio del problema de la categorización surgió una problemática metodológica. Es decir, si bien el grupo de investigación ha podido caracterizar con mucho más detalle, y con el material con el que cuenta, las clases sustantivo y verbo3, no lo ha podido hacer para el caso de los adjetivos. La escasez de datos4 para la identificación y caracterización de esta clase léxica ha generado múltiples interpretaciones sobre su existencia en las lenguas indígenas habladas en Colombia.Pregrad