1,007 research outputs found

    Croatian Corpus of Non‐Professional Written Language – Typical speakers and speakers with language disorders

    Get PDF
    Corpora, as annotated archives of human communication, are objective, reliable resources for language analysis. Here we present the corpus of non-professional written Croatian, based on 1-year sampling of writings by typical speakers and speakers with language disorders. This corpus provides a unique resource because it samples language used by non-professionals, in contrast to corpora based on texts by professional writers (such as journalists, scholars or novelists) sampled over more than a century. In addition, our corpus contains written language from typical and impaired speakers sampled under identical conditions, allowing detailed analyses of language use. This paper describes the language tasks (essay, story generation, non-formal and formal letter and dictation) used to elicit text production, and procedures for sampling and annotation used to generate the corpus. Its usefulness is illustrated through language productivity analyses of transcripts of different genres produced by writers of different age and language status. This corpus may prove useful for the analysis of writing skills in typical and language-impaired speakers of Croatian

    HRVATSKI KORPUS GOVORNOG JEZIKA (HrAL)

    Get PDF
    Interest in spoken-language corpora has increased over the past two decades leading to the development of new corpora and the discovery of new facets of spoken language. These types of corpora represent the most comprehensive data source about the language of ordinary speakers. Such corpora are based on spontaneous, unscripted speech defined by a variety of styles, registers and dialects. The aim of this paper is to present the Croatian Adult Spoken Language Corpus (HrAL), its structure and its possible applications in different linguistic subfields. HrAL was built by sampling spontaneous conversations among 617 speakers from all Croatian counties, and it comprises more than 250,000 tokens and more than 100,000 types. Data were collected during three time slots: from 2010 to 2012, from 2014 to 2015 and during 2016. HrAL is today available within TalkBank, a large database of spoken-language corpora covering different languages (https://talkbank.org), in the Conversational Analyses corpora within the subsection titled Conversational Banks. Data were transcribed, coded and segmented using the transcription format Codes for Human Analysis of Transcripts (CHAT) and the Computerised Language Analysis (CLAN) suite of programmes within the TalkBank toolkit. Speech streams were segmented into communication units (C-units) based on syntactic criteria. Most transcripts were linked to their source audios. The TalkBank is public free, i.e. all data stored in it can be shared by the wider community in accordance with the basic rules of the TalkBank. HrAL provides information about spoken grammar and lexicon, discourse skills, error production and productivity in general. It may be useful for sociolinguistic research and studies of synchronic language changes in Croatian.Zanimanje za korpuse govornog jezika posljednja dva desetljeća raste, pri čemu nastaju i razvijaju se novi istovrsni korpusi koji omogućuju uvid u nove činjenice o govornom jeziku. Ova vrsta korpusa predstavlja najiscrpniji izvor podataka o jeziku prosječnoga govornika. Ti se korpusi temelje na spontanom i nestrukturiranom govorenju koje je određeno različitim stilovima, registrima i dijalektima. Cilj je ovog rada predstaviti Hrvatski korpus govornog jezika odraslih (HrAL), njegovu strukturu i moguću primjenu u različitim lingvističkim granama. HrAL je oblikovan uzorkovanjem spontane konverzacije između 617 govornika iz svih hrvatskih županija i sadrži više od 250.000 pojavnica i više od 100.000 različnica. Podatci su prikupljani u tri vremenska razdoblja: od 2010. do 2011., od 2014. do 2015. te tijekom 2016. godine. HrAL je danas dostupan u TalkBank-u, bazi korpusa govornih jezika prikupljenih u različitim jezicima (https://talkbank.org), i to u pododjeljku Conversational analyses corpora unutar Conversational Bank. Podatci su transkribirani, kodirani i segmentirani rabeći transkripcijske for¬ma¬te Codes for Human Analysis of Transcripts (CHAT) i Computerised Language Analysis (CLAN), iz niza programa TalkBank-a. Govorni nizovi segmentirani su na komunikacijske jedinice (C-jedinice) temeljene na sintaktičkom kriteriju. Većina je transkripata povezana sa svojim audiozapisom. TalkBank je javno dostupan, odnosno svi podatci pohranjeni u njemu mogu biti slobodno upotrijeb¬lje¬ni prema osnovnim pravilima TalkBank-a. HrAL daje informacije o gramatici i leksikonu govornog jezika, diskursnim vještinama, proizve-denim pogreškama i produktivnosti općenito. Koristan je za sociolingvistička istraživanja kao i za istraživanja sinkronijskih jezičnih promjena u hrvatskom

    HRVATSKI KORPUS GOVORNOG JEZIKA (HrAL)

    Get PDF
    Interest in spoken-language corpora has increased over the past two decades leading to the development of new corpora and the discovery of new facets of spoken language. These types of corpora represent the most comprehensive data source about the language of ordinary speakers. Such corpora are based on spontaneous, unscripted speech defined by a variety of styles, registers and dialects. The aim of this paper is to present the Croatian Adult Spoken Language Corpus (HrAL), its structure and its possible applications in different linguistic subfields. HrAL was built by sampling spontaneous conversations among 617 speakers from all Croatian counties, and it comprises more than 250,000 tokens and more than 100,000 types. Data were collected during three time slots: from 2010 to 2012, from 2014 to 2015 and during 2016. HrAL is today available within TalkBank, a large database of spoken-language corpora covering different languages (https://talkbank.org), in the Conversational Analyses corpora within the subsection titled Conversational Banks. Data were transcribed, coded and segmented using the transcription format Codes for Human Analysis of Transcripts (CHAT) and the Computerised Language Analysis (CLAN) suite of programmes within the TalkBank toolkit. Speech streams were segmented into communication units (C-units) based on syntactic criteria. Most transcripts were linked to their source audios. The TalkBank is public free, i.e. all data stored in it can be shared by the wider community in accordance with the basic rules of the TalkBank. HrAL provides information about spoken grammar and lexicon, discourse skills, error production and productivity in general. It may be useful for sociolinguistic research and studies of synchronic language changes in Croatian.Zanimanje za korpuse govornog jezika posljednja dva desetljeća raste, pri čemu nastaju i razvijaju se novi istovrsni korpusi koji omogućuju uvid u nove činjenice o govornom jeziku. Ova vrsta korpusa predstavlja najiscrpniji izvor podataka o jeziku prosječnoga govornika. Ti se korpusi temelje na spontanom i nestrukturiranom govorenju koje je određeno različitim stilovima, registrima i dijalektima. Cilj je ovog rada predstaviti Hrvatski korpus govornog jezika odraslih (HrAL), njegovu strukturu i moguću primjenu u različitim lingvističkim granama. HrAL je oblikovan uzorkovanjem spontane konverzacije između 617 govornika iz svih hrvatskih županija i sadrži više od 250.000 pojavnica i više od 100.000 različnica. Podatci su prikupljani u tri vremenska razdoblja: od 2010. do 2011., od 2014. do 2015. te tijekom 2016. godine. HrAL je danas dostupan u TalkBank-u, bazi korpusa govornih jezika prikupljenih u različitim jezicima (https://talkbank.org), i to u pododjeljku Conversational analyses corpora unutar Conversational Bank. Podatci su transkribirani, kodirani i segmentirani rabeći transkripcijske for¬ma¬te Codes for Human Analysis of Transcripts (CHAT) i Computerised Language Analysis (CLAN), iz niza programa TalkBank-a. Govorni nizovi segmentirani su na komunikacijske jedinice (C-jedinice) temeljene na sintaktičkom kriteriju. Većina je transkripata povezana sa svojim audiozapisom. TalkBank je javno dostupan, odnosno svi podatci pohranjeni u njemu mogu biti slobodno upotrijeb¬lje¬ni prema osnovnim pravilima TalkBank-a. HrAL daje informacije o gramatici i leksikonu govornog jezika, diskursnim vještinama, proizve-denim pogreškama i produktivnosti općenito. Koristan je za sociolingvistička istraživanja kao i za istraživanja sinkronijskih jezičnih promjena u hrvatskom

    Croatian Speech Recognition

    Get PDF

    Cross-lingual Argumentation Mining: Machine Translation (and a bit of Projection) is All You Need!

    Full text link
    Argumentation mining (AM) requires the identification of complex discourse structures and has lately been applied with success monolingually. In this work, we show that the existing resources are, however, not adequate for assessing cross-lingual AM, due to their heterogeneity or lack of complexity. We therefore create suitable parallel corpora by (human and machine) translating a popular AM dataset consisting of persuasive student essays into German, French, Spanish, and Chinese. We then compare (i) annotation projection and (ii) bilingual word embeddings based direct transfer strategies for cross-lingual AM, finding that the former performs considerably better and almost eliminates the loss from cross-lingual transfer. Moreover, we find that annotation projection works equally well when using either costly human or cheap machine translations. Our code and data are available at \url{http://github.com/UKPLab/coling2018-xling_argument_mining}.Comment: Accepted at Coling 201

    Mimicking Word Embeddings using Subword RNNs

    Full text link
    Word embeddings improve generalization over lexical features by placing each word in a lower-dimensional space, using distributional information obtained from unlabeled data. However, the effectiveness of word embeddings for downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which embeddings do not exist. In this paper, we present MIMICK, an approach to generating OOV word embeddings compositionally, by learning a function from spellings to distributional embeddings. Unlike prior work, MIMICK does not require re-training on the original word embedding corpus; instead, learning is performed at the type level. Intrinsic and extrinsic evaluations demonstrate the power of this simple approach. On 23 languages, MIMICK improves performance over a word-based baseline for tagging part-of-speech and morphosyntactic attributes. It is competitive with (and complementary to) a supervised character-based model in low-resource settings.Comment: EMNLP 201

    Literary machine translation under the magnifying glass : assessing the quality of an NMT-translated detective novel on document level

    Get PDF
    Several studies (covering many language pairs and translation tasks) have demonstrated that translation quality has improved enormously since the emergence of neural machine translation systems. This raises the question whether such systems are able to produce high-quality translations for more creative text types such as literature and whether they are able to generate coherent translations on document level. Our study aimed to investigate these two questions by carrying out a document-level evaluation of the raw NMT output of an entire novel. We translated Agatha Christie's novel The Mysterious Affair at Styles with Google's NMT system from English into Dutch and annotated it in two steps: first all fluency errors, then all accuracy errors. We report on the overall quality, determine the remaining issues, compare the most frequent error types to those in general-domain MT, and investigate whether any accuracy and fluency errors co-occur regularly. Additionally, we assess the inter-annotator agreement on the first chapter of the novel

    Uvid u automatsko izlučivanje metaforičkih kolokacija

    Get PDF
    Collocations have been the subject of much scientific research over the years. The focus of this research is on a subset of collocations, namely metaphorical collocations. In metaphorical collocations, a semantic shift has taken place in one of the components, i.e., one of the components takes on a transferred meaning. The main goal of this paper is to review the existing literature and provide a systematic overview of the existing research on collocation extraction, as well as the overview of existing methods, measures, and resources. The existing research is classified according to the approach (statistical, hybrid, and distributional semantics) and presented in three separate sections. The insights gained from existing research serve as a first step in exploring the possibility of developing a method for automatic extraction of metaphorical collocations. The methods, tools, and resources that may prove useful for future work are highlighted.Kolokacije su već dugi niz godina tema mnogih znanstvenih istraživanja. U fokusu ovoga istraživanja podskupina je kolokacija koju čine metaforičke kolokacije. Kod metaforičkih je kolokacija kod jedne od sastavnica došlo do semantičkoga pomaka, tj. jedna od sastavnica poprima preneseno značenje. Glavni su ciljevi ovoga rada istražiti postojeću literaturu te dati sustavan pregled postojećih istraživanja na temu izlučivanja kolokacija i postojećih metoda, mjera i resursa. Postojeća istraživanja opisana su i klasificirana prema različitim pristupima (statistički, hibridni i zasnovani na distribucijskoj semantici). Također su opisane različite asocijativne mjere i postojeći načini procjene rezultata automatskoga izlučivanja kolokacija. Metode, alati i resursi koji su korišteni u prethodnim istraživanjima, a mogli bi biti korisni za naš budući rad posebno su istaknuti. Stečeni uvidi u postojeća istraživanja čine prvi korak u razmatranju mogućnosti razvijanja postupka za automatsko izlučivanje metaforičkih kolokacija

    Procjena kvalitete strojnog prijevoda govora: studija slučaja aplikacije ILA

    Get PDF
    Machine translation (MT) is becoming qualitatively more successful and quantitatively more productive at an unprecedented pace. It is becoming a widespread solution to the challenges of a constantly rising demand for quick and affordable translations of both text and speech, causing disruption and adjustments of the translation practice and profession, but at the same time making multilingual communication easier than ever before. This paper focuses on the speech-to-speech (S2S) translation app Instant Language Assistant (ILA), which brings together the state-of-the-art translation technology: automatic speech recognition, machine translation and text-to-speech synthesis, and allows for MT-mediated multilingual communication. The aim of the paper is to assess the quality of translations of conversational language produced by the S2S translation app ILA for en-de and en-hr language pairs. The research includes several levels of translation quality analysis: human translation quality assessment by translation experts using the Fluency/Adequacy Metrics, light-post editing, and automated MT evaluation (BLEU). Moreover, the translation output is assessed with respect to language pairs to get an insight into whether they affect the MT output quality and how. The results show a relatively high quality of translations produced by the S2S translation app ILA across all assessment models and a correlation between human and automated assessment results.Strojno je prevođenje sve kvalitetnije i sve je više prisutno u svakodnevnom životu. Zbog porasta potražnje za brzim i pristupačnim prijevodima teksta i govora, strojno se prevođenje nameće kao općeprihvaćeno rješenje, što dovodi do korjenitih promjena i prilagodbi u prevoditeljskoj struci i praksi te istodobno višejezičnu komunikaciju čini lakšom nego ikada do sada. Ovaj se rad bavi aplikacijom Instant Language Assistant (ILA) za strojni prijevod govora. ILA omogućuje višejezičnu komunikaciju posredovanu strojnim prevođenjem, a temelji se na najnovijim tehnološkim dostignućima, i to na automatskom prepoznavanju govora, strojnom prevođenju i sintezi teksta u govor. Cilj je rada procijeniti kvalitetu prijevoda razgovornog jezika dobivenog pomoću aplikacije ILA i to za parove jezika engleski – njemački te engleski – hrvatski. Kvaliteta prijevoda analizira se u nekoliko faza: kvalitetu prijevoda procjenjuju stručnjaci pomoću metode procjene tečnosti i točnosti (engl. Fluency/Adequacy Metrics), zatim se provodi ograničena redaktura strojno prevedenih govora (engl. light post-editing), nakon čega slijedi automatsko vrednovanje strojnog prijevoda (BLEU). Strojno prevedeni govor procjenjuje se i uzevši u obzir o kojem je jezičnom paru riječ kako bi se dobio uvid u to utječu li jezični parovi na strojni prijevod i na koji način. Rezultati pokazuju da su prijevodi dobiveni pomoću aplikacije ILA za strojni prijevod govora procijenjeni kao razmjerno visokokvalitetni bez obzira na metodu procjene, kao i da se ljudske procjene kvalitete prijevoda poklapaju sa strojnima

    Negative vaccine voices in Swedish social media

    Get PDF
    Vaccinations are one of the most significant interventions to public health, but vaccine hesitancy creates concerns for a portion of the population in many countries, including Sweden. Since discussions on vaccine hesitancy are often taken on social networking sites, data from Swedish social media are used to study and quantify the sentiment among the discussants on the vaccination-or-not topic during phases of the COVID-19 pandemic. Out of all the posts analyzed a majority showed a stronger negative sentiment, prevailing throughout the whole of the examined period, with some spikes or jumps due to the occurrence of certain vaccine-related events distinguishable in the results. Sentiment analysis can be a valuable tool to track public opinions regarding the use, efficacy, safety, and importance of vaccination
    corecore