35 research outputs found

    CoRoLa Starts Blooming – An update on the Reference Corpus of Contemporary Romanian Language

    Get PDF
    This article reports on the on-going CoRoLa project, aiming at creating a reference corpus of contemporary Romanian (from 1945 onwards), opened for online free exploitation by researchers in linguistics and language processing, teachers of Romanian, students. We invest serious efforts in persuading large publishing houses and other owners of IPR on relevant language data to join us and contribute the project with selections of their text and speech repositories. The CoRoLa project is coordinated by two Computer Science institutes of the Romanian Academy, but enjoys cooperation of and consulting from professional linguists from other institutes of the Romanian Academy. We foresee a written component of the corpus of more than 500 million word forms, and a speech component of about 300 hours of recordings. The entire collection of texts (covering all functional styles of the language) will be pre-processed and annotated at several levels, and also documented with standardized metadata. The pre-processing includes cleaning the data and harmonising the diacritics, sentence splitting and tokenization. Annotation will include morpho-lexical tagging and lemmatization in the first stage, followed by syntactic, semantic and discourse annotation in a later stage

    Romanian Language Technology — a view from an academic perspective

    Get PDF
    The article reports on research and developments pursued by the Research Institute for Artificial Intelligence "Mihai Draganescu" of the Romanian Academy in order to narrow the gaps identified by the deep analysis on the European languages made by Meta-Net white papers and published by Springer in 2012. Except English, all the European languages needed significant research and development in order to reach an adequate technological level, in line with the expectations and requirements of the knowledge society

    Experiments in Language Variety Geolocation and Dialect Identification

    Get PDF
    Peer reviewe

    Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)

    Full text link

    Sur le renforcement des adverbes indéfinis et relatifs-interrogatifs locatifs par l’adverbe d’altérité sémantiquement apparenté en roumain contemporain

    Get PDF
    Dans notre intervention, à partir du corpus CoRoLa, nous nous proposons de discuter de l’adverbe d’altérité (altundeva) qui s’associe en roumain moderne à des adverbes indéfinis ou relatifs-interrogatifs locatifs, en créant des syntagmes adverbiaux complexes (oriunde altundeva, altundeva unde), à valeur de renforcement ainsi que de différenciation. Une telle analyse peut confirmer, dans certains cas, l’appartenance du roumain à la romanité et, dans d’autres, son individualité parmi les autres langues romanes, du fait de son isolement du continuum roman et, par la suite, de sa voie évolutive particulière

    International Comparable Corpus : Challenges in building multilingual spoken and written comparable corpora

    Get PDF
    This paper reports on the efforts of twelve national teams in building the International Comparable Corpus (ICC; https://korpus.cz/icc) that will contain highly comparable datasets of spoken, written and electronic registers. The languages currently covered are Czech, Finnish, French, German, Irish, Italian, Norwegian, Polish, Slovak, Swedish and, more recently, Chinese, as well as English, which is considered to be the pivot language. The goal of the project is to provide much-needed data for contrastive corpus-based linguistics. The ICC corpus is committed to the idea of re-using existing multilingual resources as much as possible and the design is modelled, with various adjustments, on the International Corpus of English (ICE). As such, ICC will contain approximately the same balance of forty percent of written language and 60 percent of spoken language distributed across 27 different text types and contexts. A number of issues encountered by the project teams are discussed, ranging from copyright and data sustainability to technical advances in data distribution.Peer reviewe

    Abordări actuale în cercetarea limbii române. Instrumente de lucru în domeniul morfosintaxei și al pragmaticii

    Get PDF
    Sunt prezentate principalele volume colective apărute după 1989 în domeniul morfosintaxei și al pragmaticii, în ideea de a semnala publicului internațional câteva instrumente de lucru moderne și de a îndrepta atenția romaniștilor asupra limbii române, a cărei integrare în comparațiile romanice și nonromanice ar putea conduce la observații noi, nuanțate

    Romanian de fapt - from adjectival adjunct to attention marker

    Get PDF
    This research traces back the development of the Romanian phrase de fapt ('in fact, actually, indeed'), based on written and oral corpora. De fapt has been attested in Romanian since late 19th century; chronologically, it is the last of the three Romanian adverbial expressions (alongside în faptă and în fapt) that went through all the stages of the grammaticalization cline proposed by Elizabeth Traugott for this type of adverbs. However, we consider that this phrase actually goes even further by becoming, in press headlines, an attention marker (Fraser 2009: 297), thus joining the category of să vezi ce s-a întâmplat ('you won't believe what has happened'). Thus, in press titles such as Cu ce femeie a petrecut aseară Pepe, de fapt ('Who is the woman Pepe actually spent the evening with'), de fapt loses its contrastive discourse marker rhetorical function of contrasting with a previous element and acquires a new function, i.e. of inviting the reader to read a story that (s)he would have otherwise overlooked. In this type of occurrences, de fapt acquires, for the first time, an intersubjective value
    corecore