21 research outputs found

    Comparing the nonstandard language of Slovene, Croatian and Serbian tweets

    Get PDF
    In this paper we carry out a cross-lingual comparison of nonstandard features in the language of social media for Slovene, Croatian and Serbian. The goal of the analysis is twofold: (1) we try to establish the extent to which the observed phenomena are universal rather than language-specific, and (2) we propose an approach for automatic scoring of (non)standardness levels of user-generated content, which can be used as a separate annotation layer in corpora. Quantitative and qualitative analyses of the results show that the majority of the language used on Twitter is fairly standard, especially in Slovene and Croatian. The prevalent characteristic of nonstandard Slovene tweets is nonstandard orthography, while nonstandard lexis is more typical of Serbian tweets, possibly due to a younger user profile

    Mise au point d'une méthode d'annotation morphosyntaxique fine du serbe

    Get PDF
    International audienceDevelopping a method for detailed morphosyntactic tagging of Serbian This paper presents an experience in detailed morphosyntactic tagging of the Serbian subcorpus of the parallel Serbian-French-English ParCoLab corpus. We enriched an existing POS annotation with finer-grained morphosyntactic properties in order to prepare the corpus for subsequent parsing stages. We compared three approaches: 1) manual annotation; 2) pre-annotation with a tagger trained on Croatian, followed by manual correction; 3) retraining the model on a small validated sample of the corpus (20K tokens), followed by automatic annotation and manual correction. The Croatian model maintains its global stability when applied to Serbian texts, but due to the differences between the two tagsets, important manual interventions were still required. A new model was trained on a validated sample of the corpus: it has the same accuracy as the existing model, but the observed acceleration of the manual correction confirms that it is better suited to the task than the first one. MOTS-CLES : Annotation morphosyntaxique, corpus d'entraînement, serbe.Cet article présente une expérience d'annotation morphosyntaxique fine du volet serbe du corpus parallèle ParCoLab (corpus serbe-français-anglais). Elle a consisté à enrichir une annotation existante en parties du discours avec des traits morphosyntaxiques fins, afin de préparer une étape ultérieure de parsing. Nous avons comparé trois approches : 1) annotation manuelle ; 2) pré-annotation avec un étiqueteur entraîné sur le croate suivie d'une correction manuelle ; 3) ré-entraînement de l'outil sur un petit échantillon validé du corpus, suivi de l'annotation automatique et de la correction manuelle. Le modèle croate maintient une stabilité globale en passant au serbe, mais les différences entre les deux jeux d'étiquettes exigent des interventions manuelles importantes. Le modèle ré-entraîné sur un échantillon de taille limité (20K tokens) atteint la même exactitude que le modèle existant et le gain de temps observé montre que cette méthode optimise la phase de correction

    Babel Treebank of Public Messages in Croatian

    Get PDF
    AbstractThe paper presents the process of constructing a publicly available treebank of public messages written in Croatian. The messages were collected from various electronic sources – e-mail, blog, Facebook and SMS – and published on the Zagreb Museum of Contemporary Art LED facade within the Babel art project. The project aimed to use the facade as an open-space blog or social interface for enabling citizens to publicly express their views. Construction and current state of the treebank is presented along with future work plans. A comparison of Babel Treebank with Croatian Dependency Treebank and SETimes.HR treebank regarding differing domains and annotation schemes is briefly sketched. The treebank is used as a test platform for introducing a new standard for syntactic annotation of Croatian texts. An experiment with morphosyntactic tagging and dependency parsing of the treebank is conducted, providing first insight to computational processing of non-standard text in Croatian

    Mimicking Word Embeddings using Subword RNNs

    Full text link
    Word embeddings improve generalization over lexical features by placing each word in a lower-dimensional space, using distributional information obtained from unlabeled data. However, the effectiveness of word embeddings for downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which embeddings do not exist. In this paper, we present MIMICK, an approach to generating OOV word embeddings compositionally, by learning a function from spellings to distributional embeddings. Unlike prior work, MIMICK does not require re-training on the original word embedding corpus; instead, learning is performed at the type level. Intrinsic and extrinsic evaluations demonstrate the power of this simple approach. On 23 languages, MIMICK improves performance over a word-based baseline for tagging part-of-speech and morphosyntactic attributes. It is competitive with (and complementary to) a supervised character-based model in low-resource settings.Comment: EMNLP 201

    hr500k – A Reference Training Corpus of Croatian.

    Get PDF
    In this paper we present hr500k, a Croatian reference training corpus of 500 thousand tokens, segmented at document, sentence and word level, and annotated for morphosyntax, lemmas, dependency syntax, named entities, and semantic roles. We present each annotation layer via basic label statistics and describe the final encoding of the resource in CoNLL and TEI formats. We also give a description of the rather turbulent history of the resource and give insights into the topic and genre distribution in the corpus. Finally, we discuss further enrichments of the corpus with additional layers, which are already underway

    A Methodological Contribution to the Description of Syntagmatic Units in Specialised Discourse

    Get PDF
    U radu se donose rezultati deskriptivnog terminološkog opisa istraživanja provedenog na jednojezičnom korpusu predmetnog područja znanosti o kršu. Semaziološki orijentiranim pristupom provodi se distribucijska analiza terminoloških jedinica i njihovih supojavnica u diskursu. Oslanjajući se na leksičko-gramatički model, kreiraju se pojmovne klase na temelju dvojnosti sintagmatskih i paradigmatskih relacija. U opisu sintagmatskih jedinica posebna se pozornost posvećuje ulozi pridjevskih riječi kao kvalifikatora, ali i nositelja specijaliziranog značenja. Aktivna uloga pridjeva prikazuje se metodom parafraziranja značenja s obzirom na klase pojmova koje pridjev determinira. Analiza pridjeva u terminološkim sintagmama upućuje na višedimenzionalno dinamično strukturiranje pojmovnih struktura specijaliziranog područja znanja.This paper presents the results of descriptive terminology management study which was conducted using the monolingual specialised corpus in Croatian language. Semasiologicaly oriented methodology was used in distributional analysis of terminological units and its co-occurrences. The lexicogrammar model Classes of objects was applied in order to create semantic groups according to the syntagmatic and paradigmatic profile of terminological units. Syntagmatic units were analysed emphasizing the adjectives that specify the meaning of the concept. Their function is described using the paraphrase model that highlights the specialised meaning of adjectives. Significant adjectives point towards the multidimensional classification of concepts which help represent the dynamic categorisation of concepts inside a specialized field such as karstology

    EXTRACTING ENGLISH WORDS FROM A CORPUS OF CROATIAN

    Get PDF
    Kao globalni jezik modernoga doba engleski je postao dominantan jezik davatelj. Danas se smatra da hrvatski jezik najviše posuđuje upravo iz engleskoga. Utjecaj engleskoga jezika na hrvatski vidljiv je u različitim funkcionalnim stilovima te na gotovo svim jezičnim razinama, no najizraženiji je na leksičkoj razini. U novije vrijeme, posebice u medijima i na društvenim mrežama, sve se češće javljaju neprilagođene engleske riječi, tj. riječi koje su zadržale izvorni oblik, a kojima se po potrebi dodaju hrvatski afiksi. Za sada još uvijek ne postoje konkretni podaci o takvim riječima u hrvatskome jeziku. U cilju pronalaženja engleskih riječi, u drugim su se jezicima koristile različite metode, od ručnih klasifikacija i korištenja postojećih jezičnih resursa do razvoja novih alata i/ili resursa. Međutim, jezične tehnologije za hrvatski jezik još uvijek su nedostatno razvijene. Stoga je cilj ovoga rada ispitati mogućnosti nekih od postojećih alata i resursa za crpljenje engleskih riječi i stvaranje baze engleskih riječi. U tu svrhu pretraživan je mrežni korpus hrvatskog jezika hrWaC pomoću platforme Sketch Engine. Ovom metodom dobiven je popis od 1217 engleskih riječi. Rezultati su pokazali da se pomoću dostupnih alata i resursa za hrvatski jezik može izraditi popis engleskih riječi i njihovih frekvencija, ali i da postoje brojni problemi zbog kojih se rezultati ne mogu smatrati u potpunosti pouzdanima. Isto tako, sam se postupak i dalje mora kombinirati s ručnim metodama i klasifikacijama. Zaključujemo da je za izradu cjelovite baze engleskih riječi u hrvatskome potrebno razviti nove alate i resurse koji bi omogućili automatsko crpljenje engleskih riječi iz korpusa hrvatskoga jezika.As the lingua franca of the modern age, English has become the dominant donor language for many languages, including Croatian. The influence of English on Croatian is evident across different registers and linguistic levels, especially the lexical one. Recently, more and more English words have started to appear in their unadapted form (e.g., freelancer, chat, e-mail) in Croatian, especially in the news and social media. English words can be extracted from corpora either manually, by using existing corpus linguistics tools or by developing new tools. The aim of this paper is to analyse whether the existing tools for Croatian can yield a list of unadapted English words. For that purpose, the web corpus (hrWaC) was analysed using the Sketch Engine platform. A list of 1217 English words was composed using this method. The results showed that it is possible to compile a list of English words and their frequencies with the help of the available tools and resources for the Croatian language, but also that there are many problems due to which the results cannot be considered completely reliable. Moreover, the procedure itself still has to be combined with other manual methods and classifications, and there is a need for the development of new tools for automatic extraction of English words from a corpus of Croatian

    Big Data Analytics and the Social Web: a Tutorial for the Social Scientist

    Full text link
    The social web or web 2.0 has become the biggest and most accessible repository of data about human (social) behavior in history. Due to a knowledge gap between big data analytics and established social science methodology, this enormous source of information, has yet to be exploited for new and interesting studies in various social and humanities related fields. To make one step towards closing this gap, we provide a detailed step-by-step tutorial on some of the most important web mining and analytics methods on a real-world study of Croatia’s biggest political blogging site. The tutorial covers methods for data retrieval, data conversion, cleansing and organization, data analysis (natural language processing, social and conceptual network analysis) as well as data visualization and interpretation. All tools that have been implemented for the sake of this study, data sets through the various steps as well as resulting visualizations have been published on-line and are free to use. The tutorial is not meant to be a comprehensive overview and detailed description of all possible ways of analyzing data from the social web, but using the steps outlined herein one can certainly reproduce the results of the study or use the same or similar methodology for other datasets. Results of the study show that a special kind of conceptual network generated by natural language processing of articles on the blogging site, namely a conceptual network constructed by the rule that two concepts (keywords) are connected if they were extracted from the same article, seem to be the best predictor of the current political discourse in Croatia when compared to the other constructed conceptual networks. These results indicate that a comprehensive study has to be made to investigate this conceptual structure further with an accent on the dynamic processes that have led to the construction of the network
    corecore