11 research outputs found

    hr500k – A Reference Training Corpus of Croatian.

    Get PDF
    In this paper we present hr500k, a Croatian reference training corpus of 500 thousand tokens, segmented at document, sentence and word level, and annotated for morphosyntax, lemmas, dependency syntax, named entities, and semantic roles. We present each annotation layer via basic label statistics and describe the final encoding of the resource in CoNLL and TEI formats. We also give a description of the rather turbulent history of the resource and give insights into the topic and genre distribution in the corpus. Finally, we discuss further enrichments of the corpus with additional layers, which are already underway

    Babel Treebank of Public Messages in Croatian

    Get PDF
    AbstractThe paper presents the process of constructing a publicly available treebank of public messages written in Croatian. The messages were collected from various electronic sources – e-mail, blog, Facebook and SMS – and published on the Zagreb Museum of Contemporary Art LED facade within the Babel art project. The project aimed to use the facade as an open-space blog or social interface for enabling citizens to publicly express their views. Construction and current state of the treebank is presented along with future work plans. A comparison of Babel Treebank with Croatian Dependency Treebank and SETimes.HR treebank regarding differing domains and annotation schemes is briefly sketched. The treebank is used as a test platform for introducing a new standard for syntactic annotation of Croatian texts. An experiment with morphosyntactic tagging and dependency parsing of the treebank is conducted, providing first insight to computational processing of non-standard text in Croatian

    Otvoreni resursi i tehnologije za obradu srpskog jezika

    Get PDF
    Open language resources and tools are very important for increasing the quality and speeding up the development of technologies for natural language processing. This paper presents a set of open resources available for processing the Serbian language. We describe several manually annotated corpora, as well as a range of computational models, including a web service designed in order to facilitate their use

    Cross-lingual Dependency Parsing of Related Languages with Rich Morphosyntactic Tagsets

    Get PDF
    This paper addresses cross-lingual dependency parsing using rich morphosyntactic tagsets. In our case study, we experiment with three related Slavic languages: Croatian, Serbian and Slovene. Four different dependency treebanks are used for monolingual parsing, direct cross-lingual parsing, and a recently introduced crosslingual parsing approach that utilizes statistical machine translation and annotation projection. We argue for the benefits of using rich morphosyntactic tagsets in cross-lingual parsing and empirically support the claim by showing large improvements over an impoverished common feature representation in form of a reduced part-of-speech tagset. In the process, we improve over the previous state-of-the-art scores in dependency parsing for all three languages.Published versio

    German Loanwords in Digital Environment

    Get PDF
    Okosnicu istraživanja opisanom u članku čini popis germanizama formiran iz rječnika hrvatskoga jezika, rječnika stranih riječi i doktorskih disertacija koje se bave proučavanjem germanizama u hrvatskom jeziku. Pripremljeni popis germanizama omogućio je njihovu računalnu analizu u medijskom prostoru, odnosno hrvatskom mrežnom korpusu hrWaC-u, koji obuhvaća tekstove objavljene na web-u u razdoblju od 4 godine. Jezične tehnologije potpomognute naknadnom ručnom obradom podataka omogućile su utvrđivanje učestalosti korištenja germanizama u pisanim tekstovima. Metodama automatske detekcije germanizama pronašli smo germanizme u svim njihovim oblicima koji su danas u uporabi u suvremenim tekstovima, čime je ujedno stvoren i čestotni rječnik germanizama. Provedenim istraživanjem utvrđena je zastupljenost germanizama u tekstovima suvremenog hrvatskog jezika. Iako ukupni broj germanizama koji se danas pojavljuje u hrvatskim rječnicima i ostalim konzultiranim izvorima iz kojih je formiran osnovni popis iznosi 17 988 lema, u uporabi je u suvremenim tekstovima dokazano 8 400 lema. Potvrđenih 8 400 lema dokaz su tomu da se leksičko blago hrvatskog jezika zabilježeno u rječnicima nije izgubilo u suvremenom hrvatskom jeziku, nego je sustavno ušlo u korpus hrvatskog jezika te postoji u tekstovima koji nisu nužno standardnojezični.The framework of this research make German loan words compiled from the dictionaries of the Croatian language, dictionaries of foreign words, and PhD thesis studying German loanwords in the Croatian language. The list of German loanwords found in the contemporary Croatian texts was analised using linguistic technologies supported by subsequent manual data processing. The compiled list of gathered loan words allowed for their computational analysis in the hrWaC web corpus comprising the texts from the whole .hr domain collected in a period of four years. This research has established the presence of the German loan words in contemporary Croatian language texts. Although the total number of German loan words that appeared in contemporary dictionaries of the Croatian language and other sources consulted during the research was 17 988 lemmas, we were able to confirm the usage for only 8 400 lemmas in contemporary texts. Applying the automatic detection method, we have found all German loan words used in contemporary texts in all of their forms, simultaneously creating the German loan words frequency dictionary. The confirmed 8 400 lemmas serve as proof that the lexical treasure of the Croatian language recorded in dictionaries has not been lost in the contemporary Croatian language. On the contrary, it has systematically entered into the web corpus of the Croatian language and is found in the texts that do not necessarily belong to the standard language

    CLARIN

    Get PDF
    The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium

    CLARIN. The infrastructure for language resources

    Get PDF
    CLARIN, the "Common Language Resources and Technology Infrastructure", has established itself as a major player in the field of research infrastructures for the humanities. This volume provides a comprehensive overview of the organization, its members, its goals and its functioning, as well as of the tools and resources hosted by the infrastructure. The many contributors representing various fields, from computer science to law to psychology, analyse a wide range of topics, such as the technology behind the CLARIN infrastructure, the use of CLARIN resources in diverse research projects, the achievements of selected national CLARIN consortia, and the challenges that CLARIN has faced and will face in the future. The book will be published in 2022, 10 years after the establishment of CLARIN as a European Research Infrastructure Consortium by the European Commission (Decision 2012/136/EU)

    CLARIN

    Get PDF
    The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium

    Robust part-of-speech tagging of social media text

    Get PDF
    Part-of-Speech (PoS) tagging (Wortklassenerkennung) ist ein wichtiger Verarbeitungsschritt in vielen sprachverarbeitenden Anwendungen. Heute gibt es daher viele PoS Tagger, die diese wichtige Aufgabe automatisiert erledigen. Es hat sich gezeigt, dass PoS tagging auf informellen Texten oft nur mit unzureichender Genauigkeit möglich ist. Insbesondere Texte aus sozialen Medien sind eine große Herausforderung. Die erhöhte Fehlerrate, welche auf mangelnde Robustheit zurückgeführt werden kann, hat schwere Folgen für Anwendungen die auf PoS Informationen angewiesen sind. Diese Arbeit untersucht daher Tagger-Robustheit unter den drei Gesichtspunkten der (i) Domänenrobustheit, (ii) Sprachrobustheit und (iii) Robustheit gegenüber seltenen linguistischen Phänomene. Für (i) beginnen wir mit einer Analyse der Phänomene, die in informellen Texten häufig anzutreffen sind, aber in formalen Texten nur selten bis gar keine Verwendung finden. Damit schaffen wir einen Überblick über die Art der Phänomene die das Tagging von informellen Texten so schwierig machen. Wir evaluieren viele der üblicherweise benutzen Tagger für die englische und deutsche Sprache auf Texten aus verschiedenen Domänen, um einen umfassenden Überblick über die derzeitige Robustheit der verfügbaren Tagger zu bieten. Die Untersuchung ergab im Wesentlichen, dass alle Tagger auf informellen Texten große Schwächen zeigen. Methoden, um die Robustheit für domänenübergreifendes Tagging zu verbessern, sind prinzipiell hilfreich, lösen aber das grundlegende Robustheitsproblem nicht. Als neuen Lösungsansatz stellen wir Tagging in zwei Schritten vor, welches eine erhöhte Robustheit gegenüber domänenübergreifenden Tagging bietet. Im ersten Schritt wird nur grob-granular getaggt und im zweiten Schritt wird dieses Tagging dann auf das fein-granulare Level verfeinert. Für (ii) untersuchen wir Sprachrobustheit und ob jede Sprache einen zugeschnittenen Tagger benötigt, oder ob es möglich ist einen sprach-unabhängigen Tagger zu konstruieren, der für mehrere Sprachen funktioniert. Dazu vergleichen wir Tagger basierend auf verschiedenen Algorithmen auf 21 Sprachen und analysieren die notwendigen technischen Eigenschaften für einen Tagger, der auf mehreren Sprachen akkurate Modelle lernen kann. Die Untersuchung ergibt, dass Sprachrobustheit an für sich kein schwerwiegendes Problem ist und, dass die Tagsetgröße des Trainingskorpus ein wesentlich stärkerer Einflussfaktor für die Eignung eines Taggers ist als die Zugehörigkeit zu einer gewissen Sprache. Bezüglich (iii) untersuchen wir, wie man mit seltenen Phänomenen umgehen kann, für die nicht genug Trainingsdaten verfügbar sind. Dazu stellen wir eine neue kostengünstige Methode vor, die nur einen minimalen Aufwand an manueller Annotation erwartet, um zusätzliche Daten für solche seltenen Phänomene zu produzieren. Ein Feldversuch hat gezeigt, dass die produzierten Daten ausreichen um das Tagging von seltenen Phänomenen deutlich zu verbessern. Abschließend präsentieren wir zwei Software-Werkzeuge, FlexTag und DeepTC, die wir im Rahmen dieser Arbeit entwickelt haben. Diese Werkzeuge bieten die notwendige Flexibilität und Reproduzierbarkeit für die Experimente in dieser Arbeit.Part-of-speech (PoS) taggers are an important processing component in many Natural Language Processing (NLP) applications, which led to a variety of taggers for tackling this task. Recent work in this field showed that tagging accuracy on informal text domains is poor in comparison to formal text domains. In particular, social media text, which is inherently different from formal standard text, leads to a drastically increased error rate. These arising challenges originate in a lack of robustness of taggers towards domain transfers. This increased error rate has an impact on NLP applications that depend on PoS information. The main contribution of this thesis is the exploration of the concept of robustness under the following three aspects: (i) domain robustness, (ii) language robustness and (iii) long tail robustness. Regarding (i), we start with an analysis of the phenomena found in informal text that make tagging this kind of text challenging. Furthermore, we conduct a comprehensive robustness comparison of many commonly used taggers for English and German by evaluating them on the text of several text domains. We find that the tagging of informal text is poorly supported by available taggers. A review and analysis of currently used methods to adapt taggers to informal text showed that these methods improve tagging accuracy but offer no satisfactory solution. We propose an alternative tagging approach that reaches an increased multi-domain tagging robustness. This approach is based on tagging in two steps. The first step tags on a coarse-grained level and the second step refines the tags to the fine-grained tags. Regarding (ii), we investigate whether each language requires a language-tailored PoS tagger or if the construction of a competitive language independent tagger is feasible. We explore the technical details that contribute to a tagger's language robustness by comparing taggers based on different algorithms to learn models of 21 languages. We find that language robustness is a less severe issue and that the impact of the tagger choice depends more on the granularity of the tagset that shall be learned than on the language. Regarding (iii), we investigate methods to improve tagging of infrequent phenomena of which no sufficient amount of annotated training data is available, which is a common challenge in the social media domain. We propose a new method to overcome this lack of data that offers an inexpensive way of producing more training data. In a field study, we show that the quality of the produced data suffices to train tagger models that can recognize these under-represented phenomena. Furthermore, we present two software tools, FlexTag and DeepTC, which we developed in the course of this thesis. These tools provide the necessary flexibility for conducting all the experiments in this thesis and ensure their reproducibility

    German loanwords in the Croatian digital newspaper corpora

    Get PDF
    The object of the dissertation titled German loanwords in the Croatian digital newspaper corpora was to explore and analyze German loanwords in the contemporary Croatian texts using linguistic technologies supported by subsequent manual data processing. Regarding German loanwords in Croatian language, German-Croatian contacts that left their traces in Croatian can be traced from the prehistoric time (Zepic, 2002). As a consequence of political constellations in European history, Croatian language was under the influence of the German language until the 20th century. Due to the intense political and economic contacts, German loanwords entered the Croatian language on a large scale, with the largest acquisitions between 1527 and 1868, when Croatia was part of the Habsburg Monarchy. On the other hand, from the early standardization phase of the Croatian language up to date, the process of language purification is being conducted, which reduced the number of German loanwords in the standard Croatian language. But, for many years, German had an important role in Croatian language and its presence was especially noticeable in daily newspapers that were published in German in various Croatian cities. Publishing of newspapers in German during the history shows the importance and role of German language in Croatia, just as the contemporary daily press tends to be the reflection of the actual state of language. Using language technologies, the research conducted as part of this PhD dissertation revealed to what extent German loanwords are used in Croatian language (i.e., the representation of German loanwords in contemporary Croatian language texts). The method analyzed the appearance of German loanwords in the Croatian corpora and regional daily newspapers. Firstly, we have determined which German loanwords appear in the contemporary Croatian language, and as a source for compiling such a list we used dictionaries of standard Croatian language, dictionaries of foreign words, regional dictionaries and PhD thesis studying German loanwords in the Croatian language (nine sources in total). The analysis of these sources provided a list of 17.988 different German loanwords. The further comparative computer analysis of these sources showed that all German loanwords were not recorded in all sources. Many German loanwords were recorded in a single source (e.g. Petrovic's Eseker dictionary, which records German words from the Eseker dialect). The compiled list of German loanwords allowed for their computational analysis in the digitized daily press (namely four newspapers collected during the period of twelve months), as well as in the hrWaC web corpus comprising the texts from the whole .hr domain collected over a period of four years. Searching hrWaC, we found 8.400 German loanwords out of 17.988 compiled from the nine sources. These 8.400 lemmas were found in 786.356 text files in their basic form (nominative case for nouns and adjectives, infinitive form for verbs) or in some other wordform. The manual analysis of these 8.400 German loanwords showed that some wordforms of these German loanwords were identical to some Croatian words (i.e, they were homonyms). In some cases, these "fake" Germanisms had significantly higher frequency than the "real" German loanwords. Analysis of German loanwords allowed for the classification of German loanwords into two basic groups. The criterion for the classification was the overlap of German loanwords with Croatian language words, in one or all inflectional forms, or lack of such overlap. The first group consists of German loanwords that have a meaning or meanings that bind to the original German word, which were (phonetically and/or morphologically) adapted to Croatian language, but still represent German loanwords. The second group consists of those German loanwords that overlap with some Croatian words (i.e. they represent homonyms). This second group is divided into two sub-groups: the first sub-group consists of homonyms which belong to the same word class (e.g., "grad"- German loanword meaning "level" vs. "grad"- Croatian word meaning "city") while the second subgroup consists of homonyms that belong to different word classes (e.g., "lak"- German loanword meaning "varnish" vs. "lak"- Croatian word meaning "easy"), ie. the noun vs. the adjective). Furthermore, the computer analysis of the frequency of German loanwords in daily newspapers showed that most of them appeared in Rijeka's Novi List. This newspaper is followed by Slobodna Dalmacija and Vecernji list, with a relatively equal number of German loanwords. The lowest frequency of German loanwords was determined in the daily newspaper Glas Slavonije. In the period of twelve months we found a total of 191 German loanwords in four different daily newspapers: Novi List, Slobodna Dalmacija, Vecernji list and Glas Slavonije.. Given the cross-section of German loanwords found in daily newspapers (191 German loanwords) and hrWaC web corpus, all German loanwords found in newspapers have been confirmed in hrWaC web corpus, since hrWaC contains texts from the entire .hr domain, including digital editions of daily newspapers. Furthemore, language technologies enabled us to determine how often German loanwords appear in contemporary texts - Croatian hrWaC web corpus and regional daily newspapers. Both sources are considered contemporary, since the newspaper and the corpus mirror the everyday use of the Croatian language. This thesis gives the comparative analysis of German loanwords collected from various sources (dictionaries, scientific papers), revealing the overlap and the overlap frequency in all sources as well as the overlap frequency of German loanwords from written sources and contemporary digital texts (daily newspapers and corpus). The study found that German loanwords that are most frequently found in web corpus originate from dictionaries of foreign words. The analysis of the frequency of German loanwords from the "New dictionary of foreign words" by Bratoljub Klaić (2012) and "Dictionary of Foreign Words" by Anic, Klaic & Domovic (2001) showed that 965 German loanwords from these sources had frequency in hrWaC-in greater than 1,000. This dissertation succeeded to built a mechanism for the analysis of German loanwords in Croatian language that can be applied to other languages. It is important to point out that regardless of the computer preparation and processing, much of the analysis had to be made manually. The analysis of the frequency of German loanwords in the contemporary Croatian language resulted in a frequency dictionary of German loanwords in the Croatian language. Out of four hypotheses, two were proven, one only partially, and one was not proven. The hypothesis that German loanwords are systematically used in Croatian language is partially proven. The analysis revealed that out of 17.988 German loanwords collected from all written sources, 47% were found in hrWaC web corpus. However, out of this percentage, only 1.285 German loanwords occured in hrWaC with a frequency greater than 1000. Also, during the period of one year only 191 German loanwords were found in four daily newspapers (18.284 tokens). It was concluded that German loanwords in Croatian language appear systematically, but less than we would expect given the centuries of intensive influence of the German language in our country. The hypothesis that German loanwords without their equivalent in the Croatian language appear more often than German loanwords that have their Croatian equivalent was proven. Namely, 8.400 German loanwords that were found in the web corpus where additionally analyzed manually. It turned out that 52% of German loanwords that were analyzed manually has a Croatian equivalent. The third hypothesis, stating that German loanwords collected from the standard dictionaries of the Croatian language appear uniformly in the regional editions of daily newspapers regardless of the region, was also proven. The hypothesis is proven by analyzing digitized daily press over a period of twelve months. Statistics showed that German loanwords occur evenly in the newspapers from all regions. As for the frequency, German loanwords usually occur in the body of the text, followed by the main title, the heading and the sub-headings. The hypothesis that German loanwords in some newspaper sections appear more frequently, indicating a specific use of German loanwords in certain fields of human activity, has not been proven. Digital processing of newspapers did not enable analysis of German loanwords by section, or the determination of specific applications of German loanwords in certain fields of human endeavor. The analysis of web corpus also did not allow searching by fields of human endeavor, since all texts were saved according to the URLs of the domain. Finally, we can conclude that this research has established the presence of German loanwords in contemporary Croatian language texts. Although the total number of German loanwords that appeared in contemporary dictionaries of the Croatian language and other sources consulted during the research was 17.988 lemmas, we were able to confirm the usage for only 8.400 lemmas in contemporary texts. Doktorska disertacija pod naslovom Germanizmi u digitalnim novinskim korpusima hrvatskoga jezika istražila je germanizme u suvremenim hrvatskim tekstovima koristeći jezične tehnologije potpomognute naknadnom ručnom obradom podataka. Okosnicu istraživanja čini popis germanizama formiran iz rječnika hrvatskoga jezika, rječnika stranih riječi i doktorskih disertacija koje se bave proučavanjem germanizama u hrvatskom jeziku. Pripremljeni popis germanizama omogućio je njihovu računalnu analizu u digitaliziranom dnevnom tisku, četri novinska izdanja kroz period od godinu dana, te u web korpusu hrWaC-u koji obuhvaća tekstove objavljenje na webu u periodu od 4 godine. Ovim istraživanjem utvrdili smo zastupljenost germanizama u tekstovima suvremenog hrvatskog jezika. Iako ukupni broj germanizama koji se danas pojavljuju u hrvatskim rječnicima i ostalim konzultiranim izvorima iz kojih je formiran osnovni popis iznosi 17.988 lema, u uporabi je u suvremenim tekstovima dokazano 8.400 lema. Metodama automatske detekcije germanizama pronašli smo germanizme u svim njihovim oblicima koji su danas u uporabi u suvremenim tekstovima, čime je ujedno stvoren i čestotni rječnik germanizama. Detaljnom analizom utvrđeno je da se germanizmi u hrvatskom jeziku pojavljuju sustavno, ali rjeđe no što bismo očekivali s obzirom na stoljetni intenzivni utjecaj njemačkog jezika u našim krajevima. Potvrđenih 8.400 lema dokaz su da se leksičko blago hrvatskog jezika zabilježeno u rječnicima nije izgubilo u suvremenom hrvatskom jeziku, nego je sustavno ušlo u korpus hrvatskog jezika te postoji u tekstovima koji nisu nužno standardnojezični. Dokazano je da se germanizmi bez svojeg ekvivalenta u hrvatskom jeziku pojavljuju češće od germanizama koji imaju hrvatski ekvivalent. Konačno, dokazano je da se germanizmi uvršteni u standardne rječnike hrvatskoga jezika u regionalnim izdanjima dnevnih novina pojavljuju ravnomjerno bez obzira na regiju, što znači da nisu nužno regionalno obojeni.The object of the study of the dissertation titled German loan words in the Croatian digital newspaper corpora are the German loan words found in the contemporary Croatian texts using linguistic technologies supported by subsequent manual data processing. A list of German loan words compiled from the dictionaries of the Croatian language, dictionaries of foreign words, and PhD thesis studying German loan words in the Croatian language constitutes the framework of this research. The compiled list of German loan words allowed for their computational analysis in the digitized daily press (namely four newspapers collected during the period of one year), as well as in the hrWaC web corpus comprising the texts from the whole .hr domain collected in a period of four years. This research has established the presence of the German loan words in contemporary Croatian language texts. Although the total number of German loan words that appeared in contemporary dictionaries of the Croatian language and other sources consulted during the research was 17988 lemmas, we were able to confirm the usage for only 8400 lemmas in contemporary texts. Applying the automatic detection method, we have found all German loan words used in contemporary texts in all of their forms, simultaneously creating the German loan words frequency dictionary. A detailed analysis has revealed that German loan words occur systematically in the Croatian language, but less frequently than expected, considering substantial, centuries-old influence of the German language in the Croatian regions. The confirmed 8400 lemmas serve as proof that the lexical treasure of the Croatian language recorded in dictionaries has not been lost in the contemporary Croatian language. On the contrary, it has systematically entered into the web corpus of the Croatian language and is found in the texts that do not necessarily belong to the standard language. It has been proven that the German loan words without an equivalent in the Croatian language occur more frequently in the language than the German loan words with their Croatian equivalent. Finally, it has been proven that German loan words that are part of the standard dictionaries of the Croatian language, are equally distributed in regional issues of daily newspapers, regardless of the region, which suggests that their usage is not necessarily region specific
    corecore