14 research outputs found

    The ASK Corpus – a Language Learner Corpus of Norwegian as a Second Language

    Get PDF
    In our paper we present the design and interface of ASK, a language learner corpus of Norwegian as a second language which contains essays collected from language tests on two different proficiency levels as well as personal data from the test takers. In addition, the corpus also contains texts and relevant personal data from native Norwegians as control data. The texts as well as the personal data are marked up in XML according to the TEI Guidelines. In order to be able to classify errors in the texts, we have introduced new attributes to the TEI corr and sic tags. For each error tag, a correct form is also in the text annotation. Finally, we employ an automatic tagger developed for standard Norwegian, the Oslo-Bergen Tagger , together with a facility for manual tag correction. As corpus query system, we are using the Corpus Workbench developed at the University of Stuttgart together with a web search interface developed at Aksis, University of Bergen. The system allows for searching for combinations of words, error types, grammatical annotation and personal data.publishedVersio

    Forty years of working with corpora: from Ibsen to Twitter, and beyond

    No full text
    We provide an overview of forty years of work with language corpora by the research group that started in 1972 as the Norwegian Computing Centre for the Humanities. A brief history highlights major corpora and tools that have been developed in numerous collaborations, including corpora of literature, dialect recordings, learner language, parallel texts, newspaper articles, blog posts and tweets. Current activities are also described, with a focus on corpus analysis tools, treebanks and social media analysis. Keywords: corpus building; corpus analysis tools; treebanks; social media analysi

    The ASK Corpus – a Language Learner Corpus of Norwegian as a Second Language

    No full text
    In our paper we present the design and interface of ASK, a language learner corpus of Norwegian as a second language which contains essays collected from language tests on two different proficiency levels as well as personal data from the test takers. In addition, the corpus also contains texts and relevant personal data from native Norwegians as control data. The texts as well as the personal data are marked up in XML according to the TEI Guidelines. In order to be able to classify errors in the texts, we have introduced new attributes to the TEI corr and sic tags. For each error tag, a correct form is also in the text annotation. Finally, we employ an automatic tagger developed for standard Norwegian, the Oslo-Bergen Tagger , together with a facility for manual tag correction. As corpus query system, we are using the Corpus Workbench developed at the University of Stuttgart together with a web search interface developed at Aksis, University of Bergen. The system allows for searching for combinations of words, error types, grammatical annotation and personal data

    The Multilingual Corpus of Survey Questionnaires: A tool for refining survey translation

    No full text
    This article describes the design and compilation of the Multilingual Corpus of Survey Questionnaires (MCSQ), the first publicly available corpus of international survey questionnaires. Version 3.0 (Rosalind Franklin) is compiled from questionnaires from the European Social Survey, the European Values Study, the Survey of Health, Ageing and Retirement in Europe, and the Wage Indicator Survey in the (British) English source language and their translations into eight languages (Catalan, Czech, French, German, Norwegian, Portuguese, Spanish and Russian). Documents in the corpus were translated with the objective of maximising data comparability across cultures. After contextualising aims and procedures in survey translation, this article presents examples of two types of problematic translation outcomes in survey questionnaires: The first type relates to the choice of idiomatic terms or fixed expressions in the source text. The second type relates to cases where the semantic variation of translation choices exceeds the scope allowed to maintain the psychometric properties across languages. With these examples, we aim to demonstrate how corpus linguistics can be used to analyse past translation outcomes and to improve the methodology for translating questionnaires.Cet article décrit la conception et la compilation du Multilingual Corpus of Survey Questionnaires (MCSQ), le premier corpus de questionnaires d’enquêtes internationales accessible au public. La version 3.0 (Rosalind Franklin) est compilée à partir des questionnaires de l’Enquête sociale européenne, de l’European Values Study, de l’Enquête sur la santé et le vieillissement et la retraite en Europe, et du WageIndicator Survey dans la langue de départ, anglais (britannique), et leurs traductions en huit langues (catalan, tchèque, français, allemand, norvégien, portugais, espagnol et russe). Les documents du corpus ont été traduits en vue de maximiser la comparabilité des données entre les cultures. Après avoir contextualisé les objectifs et les procédures de traduction d’enquête, cet article présente des exemples de deux types de résultats de traduction problématiques dans les questionnaires d’enquêtes : le premier type concerne le choix de termes idiomatiques ou d’expressions fixes dans la langue de départ. Le deuxième type concerne les cas où la variation sémantique des choix de traduction dépasse la portée autorisée pour maintenir les propriétés psychométriques à travers des langues. Avec ces exemples, nous souhaitons démontrer comment la linguistique de corpus peut être utilisée pour analyser les résultats de traduction passés et pour améliorer la méthodologie de traduction de questionnaire.Este artículo describe el diseño y la compilación del Multilingual Corpus of Survey Questionnaires (MCSQ), el primer corpus público de cuestionarios de encuestas internacionales. La versión 3.0 (Rosalind Franklin) se compila a partir de cuestionarios de la Encuesta Social Europea, el European Values Study (EVS), la Encuesta de Salud, Envejecimiento y Jubilación en Europa, y el WageIndicator Survey en el idioma de origen (inglés británico) y sus traducciones a ocho idiomas (catalán, checo, francés, alemán, noruego, portugués, español y ruso). Los documentos del corpus se tradujeron con el objetivo de maximizar la comparabilidad de los datos entre culturas. Después de contextualizar los objetivos y procedimientos en la traducción de encuestas, este artículo presenta ejemplos de dos tipos de resultados de traducción problemáticos en cuestionarios de encuestas. El primer tipo se relaciona con la elección de términos idiomáticos o expresiones fijas en el texto original. El segundo tipo se relaciona con los casos en los que la variación semántica de las opciones de traducción excede el alcance permitido para mantener las mismas propiedades psicométricas en todos los idiomas. Con estos ejemplos, nuestro objetivo es demostrar cómo se puede utilizar la lingüística de corpus para analizar los resultados de traducción y mejorar la metodología de traducción de cuestionarios

    Coding and Aligning the English-Norwegian Parallel Corpus

    No full text
    This paper will focus on the encoding of texts and on the development of our method of automatically aligning original and translated texts. These are both critical steps in building a corpus of this kind. The purpose of text encoding is to prepare the texts in such a way that they adequately reflect the source texts and are maximally useful for computational studies. The aim of the alignment program is to specify equivalent points in the original and the translation, so that corresponding sections of the texts can be easily retrieved. Alignment and encoding are closely connected, in that decisions at the encoding stage have important consequences for alignment. 2 Text encoding Before alignment can be attempted, a text must be available in electronic form and contain some basic markup. The pre-alignment text encoding consists of three phases: I Scanning and insertion of initial mark-up II Insertion of additional mark-up III Proofreading The markup is inserted into the text as tags; see Appendix. Angle brackets are used to separate tags from the text: <...> for `start-tags' and </...> for `end-tags'. Tags may have attributes, e.g. <div1 type=part id=ST1.1> indicating the start of a part of a book with the identifier ST1.1. The use of tags and attributes is defined formally by a so-called `document type definition' (DTD). The coding of the texts is in broad agreement with the recommendations of the Text Encoding Initiative (TEI). See further Sperberg-McQueen & Burnard (1994). 2.1 Scanning and insertion of initial mark-up 2 All the texts are scanned by the use of an OCR reader. The scanner outputs word processing files. This means that paragraphs are uniquely identifiable, and can be tagged as such when we convert a text processing file to ASCII format, which is the form..
    corecore