18 research outputs found
Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks
Deep learning-based and lately Transformer-based language models have been
dominating the studies of natural language processing in the last years. Thanks
to their accurate and fast fine-tuning characteristics, they have outperformed
traditional machine learning-based approaches and achieved state-of-the-art
results for many challenging natural language understanding (NLU) problems.
Recent studies showed that the Transformer-based models such as BERT, which is
Bidirectional Encoder Representations from Transformers, have reached
impressive achievements on many tasks. Moreover, thanks to their transfer
learning capacity, these architectures allow us to transfer pre-built models
and fine-tune them to specific NLU tasks such as question answering. In this
study, we provide a Transformer-based model and a baseline benchmark for the
Turkish Language. We successfully fine-tuned a Turkish BERT model, namely
BERTurk that is trained with base settings, to many downstream tasks and
evaluated with a the Turkish Benchmark dataset. We showed that our studies
significantly outperformed other existing baseline approaches for Named-Entity
Recognition, Sentiment Analysis, Question Answering and Text Classification in
Turkish Language. We publicly released these four fine-tuned models and
resources in reproducibility and with the view of supporting other Turkish
researchers and applications
Notes on narrative, cognition, and cultural evolution
Drawing on non-Darwinian cultural-evolutionary approaches, the paper develops a broad, non-representational perspective on narrative, necessary to account for the narrative “ubiquity” hypothesis. It considers narrativity as a feature of intelligent behaviour and as a formative principle of symbolic representation (“narrative proclivity”). The narrative representation retains a relationship with the “primary” pre-symbolic narrativity of the basic orientational-interpretive (semiotic) behaviour affected by perceptually salient objects and “fits” in natural environments. The paper distinguishes between implicit narrativity (as the basic form of perceptual-cognitive mapping) of intelligent behaviour or non-narrative media, and the “narrative” as a symbolic representation. Human perceptual-attentional routines are enhanced by symbolic representations: due to its attention-monitoring and information-gathering function, narrative serves as a cognitive-exploratory tool facilitating cultural dynamics. The rise of new media and mass communication on the Web has thrown the ability of narrative to shape the public sphere through the ongoing process of negotiated sensemaking and interpretation in a particularly sharp relief
You can’t suggest that?! : Comparisons and improvements of speller error models
In this article, we study correction of spelling errors, specifically on how the spelling errors are made and how can we model them computationally in order to fix them.The article describes two different approaches to generating spelling correction suggestions for three Uralic languages: Estonian, North Sámi and South Sámi.The first approach of modelling spelling errors is rule-based, where experts write rules that describe the kind of errors are made, and these are compiled into finite-state automaton that models the errors.The second is data-based, where we show a machine learning algorithm a corpus of errors that humans have made, and it creates a neural network that can model the errors.Both approaches require collection of error corpora and understanding its contents; therefore we also describe the actual errors we have seen in detail.We find that while both approaches create error correction systems, with current resources the expert-build systems are still more reliable
You can’t suggest that?! Comparisons and improvements of speller error models
In this article, we study correction of spelling errors, specifically on how the spelling errors are made and how can we model them computationally in order to fix them.
The article describes two different approaches to generating spelling correction suggestions for three Uralic languages: Estonian, North Sámi and South Sámi.
The first approach of modelling spelling errors is rule-based, where experts write rules that describe the kind of errors are made, and these are compiled into finite-state automaton that models the errors.
The second is data-based, where we show a machine learning algorithm a corpus of errors that humans have made, and it creates a neural network that can model the errors.
Both approaches require collection of error corpora and understanding its contents; therefore we also describe the actual errors we have seen in detail.
We find that while both approaches create error correction systems, with current resources the expert-build systems are still more reliable
Knowledge discovery with CRF-based clustering of named entities without a priori classes
International audienceKnowledge discovery aims at bringing out coherent groups of entities. It is usually based on clustering which necessitates defining a notion of similarity between the relevant entities. In this paper, we propose to divert a supervised machine learning technique (namely Conditional Random Fields, widely used for supervised labeling tasks) in order to calculate, indirectly and without supervision, similarities among text sequences. Our approach consists in generating artificial labeling problems on the data to reveal regularities between entities through their labeling. We describe how this framework can be implemented and experiment it on two information extraction/discovery tasks. The results demonstrate the usefulness of this unsupervised approach, and open many avenues for defining similarities for complex representations of textual data
Vrednost korpusa Janes za slovensko normativistiko
Namen pričujočega prispevka je preveriti vrednost korpusa Janes za normativistične raziskave. Korpus Janes namreč prinaša besedila, ki za razliko od gradiva v referenčnih korpusih večinoma niso jezikovno korigirana in zato realneje izkazuje tendence rabe oz. (ne)intuitivnost obstoječih jezikovnih pravil v širši jezikovni skupnosti. Za študijo primera smo izbrali zveze samostalnika z neujemalnim levim prilastkom (solo petje, RTV prispevek). Analiza razkriva: da se referenčni korpus Kres in korpus Janes glede zapisa teh zvez pomembno razlikujeta; da je raba tovrstnih zvez v korpusu Janes pogostejša in bolj raznolika kot v korpusu Kres; da se v obeh korpusih pojavlja visok delež zvez, ki v rabi izkazujejo variantnost v zapisovanju, tudi na ravni posameznih prilastkov; in – vsaj na prvi pogled – presenetljivo, da je raba v korpusu Janes konsistentnejša, kar nakazuje, da jezikovna regulacija obravnavanega problema povečuje variantnost v jezikovni rabi. Prispevek temelji na konferenčni temi, ki smo jo podatkovno in vsebinsko razširili, vključili smo tudi razpravo o možni nadaljnji obravnavi izbranega jezikovnega problema, širše pa o pomenu in načinu vključitve korpusa Janes v metodologijo slovenske normativistike
Argumentation Mining in User-Generated Web Discourse
The goal of argumentation mining, an evolving research field in computational
linguistics, is to design methods capable of analyzing people's argumentation.
In this article, we go beyond the state of the art in several ways. (i) We deal
with actual Web data and take up the challenges given by the variety of
registers, multiple domains, and unrestricted noisy user-generated Web
discourse. (ii) We bridge the gap between normative argumentation theories and
argumentation phenomena encountered in actual data by adapting an argumentation
model tested in an extensive annotation study. (iii) We create a new gold
standard corpus (90k tokens in 340 documents) and experiment with several
machine learning methods to identify argument components. We offer the data,
source codes, and annotation guidelines to the community under free licenses.
Our findings show that argumentation mining in user-generated Web discourse is
a feasible but challenging task.Comment: Cite as: Habernal, I. & Gurevych, I. (2017). Argumentation Mining in
User-Generated Web Discourse. Computational Linguistics 43(1), pp. 125-17
Digitising Swiss German : how to process and study a polycentric spoken language
Swiss dialects of German are, unlike many dialects of other standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety and that it is subject to considerable regional variation. This paper presents the ArchiMob corpus, a freely available general-purpose corpus of spoken Swiss German based on oral history interviews. The corpus is a result of a long design process, intensive manual work and specially adapted computational processing. We first present the modalities of access of the corpus for linguistic, historic and computational research. We then describe how the documents were transcribed, segmented and aligned with the sound source. This work involved a series of experiments that have led to automatically annotated normalisation and part-of-speech tagging layers. Finally, we present several case studies to motivate the use of the corpus for digital humanities in general and for dialectology in particular.Peer reviewe
Fouille de données de santé
Dans le domaine de la santé, les techniques d’analyse de données sont de plus en plus populaires et se révèlent même indispensables pour gérer les gros volumes de données produits pour un patient et par le patient. Deux thématiques seront abordées dans cette présentation d'HDR.La première porte sur la définition, la formalisation, l’implémentation et la validation de méthodes d’analyse permettant de décrire le contenu de bases de données médicales. Je me suis particulièrement intéressée aux données séquentielles. J’ai fait évoluer la classique notion de motif séquentiel pour y intégrer des composantes contextuelles, spatiales et sur l’ordre partiel des éléments composant les motifs. Ces nouvelles informations enrichissent la sémantique initiale de ces motifs.La seconde thématique se focalise sur l’analyse des productions et des interactions des patients au travers des médias sociaux. J’ai principalement travaillé sur des méthodes permettant d’analyser les productions narratives des patients selon leurs temporalités, leurs thématiques, les sentiments associés ou encore le rôle et la réputation du locuteur s’étant exprimé dans les messages