Search CORE

18 research outputs found

Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks

Author: Yildirim Savas
Publication venue
Publication date: 30/01/2024
Field of study

Deep learning-based and lately Transformer-based language models have been dominating the studies of natural language processing in the last years. Thanks to their accurate and fast fine-tuning characteristics, they have outperformed traditional machine learning-based approaches and achieved state-of-the-art results for many challenging natural language understanding (NLU) problems. Recent studies showed that the Transformer-based models such as BERT, which is Bidirectional Encoder Representations from Transformers, have reached impressive achievements on many tasks. Moreover, thanks to their transfer learning capacity, these architectures allow us to transfer pre-built models and fine-tune them to specific NLU tasks such as question answering. In this study, we provide a Transformer-based model and a baseline benchmark for the Turkish Language. We successfully fine-tuned a Turkish BERT model, namely BERTurk that is trained with base settings, to many downstream tasks and evaluated with a the Turkish Benchmark dataset. We showed that our studies significantly outperformed other existing baseline approaches for Named-Entity Recognition, Sentiment Analysis, Question Answering and Text Classification in Turkish Language. We publicly released these four fine-tuned models and resources in reproducibility and with the view of supporting other Turkish researchers and applications

arXiv.org e-Print Archive

Notes on narrative, cognition, and cultural evolution

Author: Grishakova Marina
Sorokin Siim
Publication venue: 'University of Tartu'
Publication date: 31/12/2016
Field of study

Drawing on non-Darwinian cultural-evolutionary approaches, the paper develops a broad, non-representational perspective on narrative, necessary to account for the narrative “ubiquity” hypothesis. It considers narrativity as a feature of intelligent behaviour and as a formative principle of symbolic representation (“narrative proclivity”). The narrative representation retains a relationship with the “primary” pre-symbolic narrativity of the basic orientational-interpretive (semiotic) behaviour affected by perceptually salient objects and “fits” in natural environments. The paper distinguishes between implicit narrativity (as the basic form of perceptual-cognitive mapping) of intelligent behaviour or non-narrative media, and the “narrative” as a symbolic representation. Human perceptual-attentional routines are enhanced by symbolic representations: due to its attention-monitoring and information-gathering function, narrative serves as a cognitive-exploratory tool facilitating cultural dynamics. The rise of new media and mass communication on the Web has thrown the ability of narrative to shape the public sphere through the ongoing process of negotiated sensemaking and interpretation in a particularly sharp relief

Journals from University of Tartu

Crossref

You can’t suggest that?! : Comparisons and improvements of speller error models

Author: Kaalep Heiki-Jaan
Moshagen Sjur
Pirinen Flammie
Publication venue: 'UiT The Arctic University of Norway'
Publication date: 30/08/2022
Field of study

In this article, we study correction of spelling errors, specifically on how the spelling errors are made and how can we model them computationally in order to fix them.The article describes two different approaches to generating spelling correction suggestions for three Uralic languages: Estonian, North Sámi and South Sámi.The first approach of modelling spelling errors is rule-based, where experts write rules that describe the kind of errors are made, and these are compiled into finite-state automaton that models the errors.The second is data-based, where we show a machine learning algorithm a corpus of errors that humans have made, and it creates a neural network that can model the errors.Both approaches require collection of error corpora and understanding its contents; therefore we also describe the actual errors we have seen in detail.We find that while both approaches create error correction systems, with current resources the expert-build systems are still more reliable

Septentrio Academic Publishing

Munin - Open Research Archive

You can’t suggest that?! Comparisons and improvements of speller error models

Author: Kaalep Heiki-Jaan
Moshagen Sjur Nørstebø
Pirinen Flammie
Publication venue: 'UiT The Arctic University of Norway'
Publication date: 30/08/2022
Field of study

In this article, we study correction of spelling errors, specifically on how the spelling errors are made and how can we model them computationally in order to fix them. The article describes two different approaches to generating spelling correction suggestions for three Uralic languages: Estonian, North Sámi and South Sámi. The first approach of modelling spelling errors is rule-based, where experts write rules that describe the kind of errors are made, and these are compiled into finite-state automaton that models the errors. The second is data-based, where we show a machine learning algorithm a corpus of errors that humans have made, and it creates a neural network that can model the errors. Both approaches require collection of error corpora and understanding its contents; therefore we also describe the actual errors we have seen in detail. We find that while both approaches create error correction systems, with current resources the expert-build systems are still more reliable

Munin - Open Research Archive

Knowledge discovery with CRF-based clustering of named entities without a priori classes

Author: B. Merialdo
H. Ji
J. Kazama
L. Breiman
L. Hubert
S.V. Wenhui Liao
T. Hastie
T. Shi
W.M. Rand
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2014
Field of study

International audienceKnowledge discovery aims at bringing out coherent groups of entities. It is usually based on clustering which necessitates defining a notion of similarity between the relevant entities. In this paper, we propose to divert a supervised machine learning technique (namely Conditional Random Fields, widely used for supervised labeling tasks) in order to calculate, indirectly and without supervision, similarities among text sequences. Our approach consists in generating artificial labeling problems on the data to reveal regularities between entities through their labeling. We describe how this framework can be implemented and experiment it on two information extraction/discovery tasks. The results demonstrate the usefulness of this unsupervised approach, and open many avenues for defining similarities for complex representations of textual data

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

Vrednost korpusa Janes za slovensko normativistiko

Author: Kaja Dobrovoljc
Špela Arhar Holdt
Publication venue: 'University of Ljubljana'
Publication date: 01/09/2016
Field of study

Namen pričujočega prispevka je preveriti vrednost korpusa Janes za normativistične raziskave. Korpus Janes namreč prinaša besedila, ki za razliko od gradiva v referenčnih korpusih večinoma niso jezikovno korigirana in zato realneje izkazuje tendence rabe oz. (ne)intuitivnost obstoječih jezikovnih pravil v širši jezikovni skupnosti. Za študijo primera smo izbrali zveze samostalnika z neujemalnim levim prilastkom (solo petje, RTV prispevek). Analiza razkriva: da se referenčni korpus Kres in korpus Janes glede zapisa teh zvez pomembno razlikujeta; da je raba tovrstnih zvez v korpusu Janes pogostejša in bolj raznolika kot v korpusu Kres; da se v obeh korpusih pojavlja visok delež zvez, ki v rabi izkazujejo variantnost v zapisovanju, tudi na ravni posameznih prilastkov; in – vsaj na prvi pogled – presenetljivo, da je raba v korpusu Janes konsistentnejša, kar nakazuje, da jezikovna regulacija obravnavanega problema povečuje variantnost v jezikovni rabi. Prispevek temelji na konferenčni temi, ki smo jo podatkovno in vsebinsko razširili, vključili smo tudi razpravo o možni nadaljnji obravnavi izbranega jezikovnega problema, širše pa o pomenu in načinu vključitve korpusa Janes v metodologijo slovenske normativistike

Directory of Open Access Journals

Journals of Faculty of Arts, University of Ljubljana

Argumentation Mining in User-Generated Web Discourse

Author: Gurevych Iryna
Habernal Ivan
Publication venue: 'MIT Press - Journals'
Publication date: 01/01/2015
Field of study

The goal of argumentation mining, an evolving research field in computational linguistics, is to design methods capable of analyzing people's argumentation. In this article, we go beyond the state of the art in several ways. (i) We deal with actual Web data and take up the challenges given by the variety of registers, multiple domains, and unrestricted noisy user-generated Web discourse. (ii) We bridge the gap between normative argumentation theories and argumentation phenomena encountered in actual data by adapting an argumentation model tested in an extensive annotation study. (iii) We create a new gold standard corpus (90k tokens in 340 documents) and experiment with several machine learning methods to identify argument components. We offer the data, source codes, and annotation guidelines to the community under free licenses. Our findings show that argumentation mining in user-generated Web discourse is a feasible but challenging task.Comment: Cite as: Habernal, I. & Gurevych, I. (2017). Argumentation Mining in User-Generated Web Discourse. Computational Linguistics 43(1), pp. 125-17

arXiv.org e-Print Archive

TUbiblio

Crossref

Directory of Open Access Journals

TUdatalib Repository (TU Darmstadt)

Digitising Swiss German : how to process and study a polycentric spoken language

Author: Glaser Elvira
Samardžić Tanja
Scherrer Yves
Publication venue
Publication date: 29/11/2019
Field of study

Swiss dialects of German are, unlike many dialects of other standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety and that it is subject to considerable regional variation. This paper presents the ArchiMob corpus, a freely available general-purpose corpus of spoken Swiss German based on oral history interviews. The corpus is a result of a long design process, intensive manual work and specially adapted computational processing. We first present the modalities of access of the corpus for linguistic, historic and computational research. We then describe how the documents were transcribed, segmented and aligned with the sound source. This work involved a series of experiments that have led to automatically annotated normalisation and part-of-speech tagging layers. Finally, we present several case studies to motivate the use of the corpus for digital humanities in general and for dialectology in particular.Peer reviewe

Crossref

ZORA

Helsingin yliopiston digitaalinen arkisto

Fouille de données de santé

Author: Bringay Sandra
Publication venue: HAL CCSD
Publication date: 02/10/2015
Field of study

Dans le domaine de la santé, les techniques d’analyse de données sont de plus en plus populaires et se révèlent même indispensables pour gérer les gros volumes de données produits pour un patient et par le patient. Deux thématiques seront abordées dans cette présentation d'HDR.La première porte sur la définition, la formalisation, l’implémentation et la validation de méthodes d’analyse permettant de décrire le contenu de bases de données médicales. Je me suis particulièrement intéressée aux données séquentielles. J’ai fait évoluer la classique notion de motif séquentiel pour y intégrer des composantes contextuelles, spatiales et sur l’ordre partiel des éléments composant les motifs. Ces nouvelles informations enrichissent la sémantique initiale de ces motifs.La seconde thématique se focalise sur l’analyse des productions et des interactions des patients au travers des médias sociaux. J’ai principalement travaillé sur des méthodes permettant d’analyser les productions narratives des patients selon leurs temporalités, leurs thématiques, les sentiments associés ou encore le rôle et la réputation du locuteur s’étant exprimé dans les messages

Thèses en Ligne