Search CORE

240 research outputs found

Morphosyntactic Disambiguation in an Endangered Language Setting

Author: Ens Jeff
Hämäläinen Mika
Pasquier Philippe
Rueter Jack
Publication venue: 'Linkoping University Electronic Press'
Publication date: 01/01/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Language Documentation meets Language Technology

Author: Blokland Rogier
Fedina Marina
Gerstenberger Ciprian
Partanen Niko
Rießler Michael
Wilbur Joshua
Publication venue: 'UiT The Arctic University of Norway'
Publication date: 01/01/2015
Field of study

Blokland R, Fedina M, Gerstenberger C, Partanen N, Rießler M, Wilbur J. Language Documentation meets Language Technology. Septentrio Conference Series. 2015;(2): 8

Publikationer från Uppsala Universitet

Septentrio Academic Publishing

Publications at Bielefeld University

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Sentiment Analysis Using Aligned Word Embeddings for Uralic Languages

Author: Alnajjar Khalid
Hämäläinen Mika
Rueter Jack
Publication venue
Publication date: 24/05/2023
Field of study

In this paper, we present an approach for translating word embeddings from a majority language into 4 minority languages: Erzya, Moksha, Udmurt and Komi-Zyrian. Furthermore, we align these word embeddings and present a novel neural network model that is trained on English data to conduct sentiment analysis and then applied on endangered language data through the aligned word embeddings. To test our model, we annotated a small sentiment analysis corpus for the 4 endangered languages and Finnish. Our method reached at least 56\% accuracy for each endangered language. The models and the sentiment corpus will be released together with this paper. Our research shows that state-of-the-art neural models can be used with endangered languages with the only requirement being a dictionary between the endangered language and a majority language.Comment: Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023

arXiv.org e-Print Archive

Prerequisites For Shallow-Transfer Machine Translation Of Mordvin Languages : Language Documentation With A Purpose

Author: Hämäläinen Mika
Rueter Jack
Publication venue: Ижевск: Институт компьютерных исследований
Publication date: 01/01/2020
Field of study

This paper presents the current lexical, morphological, syntactic and rule-based machine translation work for Erzya and Moksha that can and should be used in the development of a roadmap for Mordvin linguistic research. We seek to illustrate and outline initial problem types to be encountered in the construction of an Apertium-based shallow-transfer machine translation system for the Mordvin language forms. We indicate reference points within Mordvin Studies and other parts of Uralic studies, as a point of departure for outlining a linguistic studies with a means for measuring its own progress and developing a roadmap for further studies. Keywords: Erzya, Moksha, Uralic, Shallow-transfer machine translation, Measurable language research, Measurable language distance, Finite-State Morphology, Universal DependenciesPeer reviewe

Helsingin yliopiston digitaalinen arkisto

Autenttisiin teksteihin perustuva tietokoneavusteinen kielen oppiminen: sovelluksia italian kielelle

Author: China-Kolehmainen Elena
Publication venue: Helsingfors universitet
Publication date: 01/01/2021
Field of study

Computer-Assisted Language Learning (CALL) is one of the sub-disciplines within the area of Second Language Acquisition. Clozes, also called fill-in-the-blank, are largely used exercises in language learning applications. A cloze is an exercise where the learner is asked to provide a fragment that has been removed from the text. For language learning purposes, in addition to open-end clozes where one or more words are removed and the student must fill the gap, another type of cloze is commonly used, namely multiple-choice cloze. In a multiple-choice cloze, a fragment is removed from the text and the student must choose the correct answer from multiple options. Multiple-choice exercises are a common way of practicing and testing grammatical knowledge. The aim of this work is to identify relevant learning constructs for Italian to be applied to automatic exercises creation based on authentic texts in the Revita Framework. Learning constructs are units that represent language knowledge. Revita is a free to use online platform that was designed to provide language learning tools with the aim of revitalizing endangered languages including several Finno-Ugric languages such as North Saami. Later non-endangered languages were added. Italian is the first majority language to be added in a principled way. This work paves the way towards adding new languages in the future. Its purpose is threefold: it contributes to the raising of Italian from its beta status towards a full development stage; it formulates best practices for defining support for a new language and it serves as a documentation of what has been done, how and what remains to be done. Grammars and linguistic resources were consulted to compile an inventory of learning constructs for Italian. Analytic and pronominal verbs, verb government with prepositions, and noun phrase agreement were implemented by designing pattern rules that match sequences of tokens with specific parts-of-speech, surfaces and morphological tags. The rules were tested with test sentences that allowed further refining and correction of the rules. Current precision of the 47 rules for analytic and pronominal verbs on 177 test sentences results in 100%. Recall is 96.4%. Both precision and recall for the 5 noun phrase agreement rules result in 96.0% in respect to the 34 test sentences. Analytic and pronominal verb, as well as noun phrase agreement patterns, were used to generate open-end clozes. Verb government pattern rules were implemented into multiple-choice exercises where one of the four presented options is the correct preposition and the other three are prepositions that do not fit in context. The patterns were designed based on colligations, combinations of tokens (collocations) that are also explained by grammatical constraints. Verb government exercises were generated on a specifically collected corpus of 29074 words. The corpus included three types of text: biography sections from Wikipedia, Italian news articles and Italian language matriculation exams. The last text type generated the most exercises with a rate of 19 exercises every 10000 words, suggesting that the semi-authentic text met best the level of verb government exercises because of appropriate vocabulary frequency and sentence structure complexity. Four native language experts, either teachers of Italian as L2 or linguists, evaluated usability of the generated multiple-choice clozes, which resulted in 93.55%. This result suggests that minor adjustments i.e., the exclusion of target verbs that cause multiple-admissibility, are sufficient to consider verb government patterns usable until the possibility of dealing with multiple-admissible answers is addressed. The implementation of some of the most important learning constructs for Italian resulted feasible with current NLP tools, although quantitative evaluation of precision and recall of the designed rules is needed to evaluate the generation of exercises on authentic text. This work paves the way towards a full development stage of Italian in Revita and enables further pilot studies with actual learners, which will allow to measure learning outcomes in quantitative term

Helsingin yliopiston digitaalinen arkisto

Open-Source Morphology for Endangered Mordvinic Languages

Author: Hämäläinen Mika
Partanen Niko
Rueter Jack
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2020
Field of study

Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered

Author: Alnajjar Khalid
Hämäläinen Mika
Partanen Niko
Rueter Jack
Publication venue: 'Linkoping University Electronic Press'
Publication date: 01/05/2021
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

BaTelÒc: A text base for the Occitan language

Author: Bras Myriam
Vergez-Couret Marianne
Publication venue: University of Hawai'i Press
Publication date: 01/02/2016
Field of study

Language Documentation, as defined by Himmelmann (2006), aims at compiling and preserving linguistic data for studies in linguistics, literature, his- tory, ethnology, sociology. This initiative is vital for endangered languages such as Occitan, a romance language spoken in southern France and in several valleys of Spain and Italy. The documentation of a language concerns all its modalities, covering spoken and written language, various registers and so on. Nowadays, Occitan documentation mostly consists of data from linguistic atlases, virtual libraries from the modern to the contemporary period, and text bases for the Middle Ages. BaTelÒc is a text base for modern and contemporary periods. With the aim of creating a wide coverage of text collections, BaTelÒc gathers not only written literary texts (prose, drama and poetry) but also other genres such as technical texts and newspapers. Enough material is already available to foresee a text base of hundreds of millions of words. BaTelÒc not only aims at documenting Occitan, it is also designed to provide tools to explore texts (different criteria for corpus selection, concordance tools and more complex enquiries with regular expressions). As for linguistic analysis, the second step is to enrich the corpora with annotations. Natural Language Processing of endangered languages such as Occitan is very challenging. It is not possible to transpose existing models for resource-rich languages directly, partly because of the spelling, dialectal variations, and lack of standardization. With BaTelÒc we aim at providing corpora and lexicons for the development of basic natural language processing tools, namely OCR and a Part-of-Speech tagger based on tools initially designed for machine translation and which take variation into account.National Foreign Language Resource Cente

ScholarSpace at University of Hawai'i at Manoa