Search CORE

15 research outputs found

Babel Treebank of Public Messages in Croatian

Author: Agić Ana
Agić Željko
Merkler Danijela
Publication venue: The Authors. Published by Elsevier Ltd.
Publication date: 25/10/2013
Field of study

AbstractThe paper presents the process of constructing a publicly available treebank of public messages written in Croatian. The messages were collected from various electronic sources – e-mail, blog, Facebook and SMS – and published on the Zagreb Museum of Contemporary Art LED facade within the Babel art project. The project aimed to use the facade as an open-space blog or social interface for enabling citizens to publicly express their views. Construction and current state of the treebank is presented along with future work plans. A comparison of Babel Treebank with Croatian Dependency Treebank and SETimes.HR treebank regarding differing domains and annotation schemes is briefly sketched. The treebank is used as a test platform for introducing a new standard for syntactic annotation of Croatian texts. An experiment with morphosyntactic tagging and dependency parsing of the treebank is conducted, providing first insight to computational processing of non-standard text in Croatian

Elsevier - Publisher Connector

hrWaC and slWac: Compiling web corpora for Croatian and Slovene.

Author: Nikola Ljubešić
Tomaž Erjavec
Publication venue: Springer.
Publication date: 01/01/2011
Field of study

Abstract. Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper introduces two new annotated web corpora: the Croatian hrWaC and the Slovene slWaC. Both were built using a modified standard "Web as Corpus" pipeline having in mind the limited amount of available web data. The modifications are described in the paper, focusing on the content extraction from HTML pages, which combines high precision of extracted language content with a decent recall. The paper also investigates texttypes of the acquired corpora using topic modeling, comparing the two corpora among themselves and with ukWaC

CiteSeerX

Korpus šolskih besedil slovenskega jezika: zasnova in gradnja

Author: Erjavec Tomaž
Jemec Tomazin Mateja
Ježovnik Janoš
Ledinek Nina
Perdih Andrej
Romih Miro
Trojar Mitja
Publication venue: ZRC SAZU, Založba ZRC
Publication date: 28/09/2022
Field of study

This article presents the Corpus of Slovenian School Texts, which is a specialized corpus of written Slovenian containing around 1.8 million tokens. It was designed within the scope of the project Franček, Language Advising Service for Teachers of Slovenian and the Slovenian School Dictionary, and it was intended to provide language material for compilation of Šolski slovar slovenskega jezika (Slovenian School Dictionary), the first research-based school dictionary of Slovenian. The article discusses the text type composition and size of the corpus, sheds light on technical procedures in text preprocessing and corpus annotation, and presents the set of corpus metadata. It also explains in which formats and under what licenses the Corpus of Slovenian School Texts has been made available, and also draws attention to legal aspects of obtaining texts.V prispevku je predstavljen Korpus šolskih besedil slovenskega jezika, specializirani pisni korpus slovenščine v obsegu približno 1,8 milijona pojavnic. Korpus je bil zasnovan v okviru projekta Franček, Jezikovna svetovalnica za učitelje slovenščine in Šolski slovar slovenskega jezika, in sicer kot gradivska osnova za oblikovanje Šolskega slovarja slovenskega jezika, prvega znanstveno utemeljenega pedagoškega slovarja za slovenski jezik. Prispevek obravnava besedilnotipsko sestavo in obseg korpusa, osvetljuje tehnične postopke predpriprave besedil in njihovega jezikoslovnega označevanja ter predstavlja nabor korpusnih metapodatkov, hkrati pa pojasnjuje, v katerih formatih in pod katerimi licencami je Korpus šolskih besedil slovenskega jezika na voljo. Članek opozarja tudi na pravne vidike pridobivanja besedil

ZRC SAZU Publishing (Znanstvenoraziskovalni center - Slovenske akademije znanosti in umetnosti)

Savremeni jezički korpusi na zapadnom Balkanu – istorijat, trenutno stanje i budučnost

Author: Nikola Dobrić
Publication venue: Slavistično društvo Slovenije
Publication date: 01/04/2012
Field of study

Directory of Open Access Journals

Context-dependent factored language models

Author: D Klakow
EM de Novais
Gregor Donaj
H Adel
K Kirchhoff
S Katz
SF Chen
T Hirsimaki
T Rotovnik
Zdravko Kačič
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Main results of MONDILEX project

Author: Dimitrova Ludmila
Erjavec Tomaž
Garabík Radovan
Iomdin Leonid
Koseska-Toszewa Violetta
Shyrokov Volodymyr
Publication venue: 'Institute of Slavic Studies Polish Academy of Sciences'
Publication date: 01/11/2015
Field of study

Main results of MONDILEX projectThe paper presents the results and recommendations of MONDILEX, a 7FP project that covered six Slavic languages: Bulgarian, Polish, Russian, Slovak, Slovene, and Ukrainian. The paper summarizes the research undertaken on standardisation and integration of Slavic language resources and on the establishment of a virtual organisation supporting research infrastructure for Slavic lexicography. The results should be useful for an implementation of a research infrastructure in the coming years

Directory of Open Access Journals

First International Workshop on Lexical Resources

Author: Sagot Benoît
Publication venue: HAL CCSD
Publication date: 01/08/2011
Field of study

International audienceLexical resources are one of the main sources of linguistic information for research and applications in Natural Language Processing and related fields. In recent years advances have been achieved in both symbolic aspects of lexical resource development (lexical formalisms, rule-based tools) and statistical techniques for the acquisition and enrichment of lexical resources, both monolingual and multilingual. The latter have allowed for faster development of large-scale morphological, syntactic and/or semantic resources, for widely-used as well as resource-scarce languages. Moreover, the notion of dynamic lexicon is used increasingly for taking into account the fact that the lexicon undergoes a permanent evolution.This workshop aims at sketching a large picture of the state of the art in the domain of lexical resource modeling and development. It is also dedicated to research on the application of lexical resources for improving corpus-based studies and language processing tools, both in NLP and in other language-related fields, such as linguistics, translation studies, and didactics

INRIA a CCSD electronic archive server

Hal-Diderot

Multiword expressions at length and in depth

Author
Publication venue: Language Science Press
Publication date: 01/04/2020
Field of study

The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work

Directory of Open Access Books (DOAB)

Koncept novega razlagalnega slovarja slovenskega knjižnega jezika

Author: Gliha Komac Nataša
Jakop Nataša
Ježovnik Janoš
Klemenčič Simona
Krvina Domen
Ledinek Nina
Mirtič Tanja
Perdih Andrej
Petric Špela
Snoj Marko
Žele Andreja
Publication venue: 'The Research Center of the Slovenian Academy of Sciences and Arts / Znanstvenoraziskovalni center Slovenske akademije znanosti in umetnosti (ZRC SAZU)'
Publication date: 01/04/2022
Field of study

Koncept novega razlagalnega slovarja slovenskega knjižnega jezika opredeljuje vsebino in zgradbo sodobnega enojezičnega informativno-normativnega slovarja, ki nastaja na Inštitutu za slovenski jezik Frana Ramovša ZRC SAZU. Koncept v prvem poglavju pojasni osnovne lastnosti slovarja, njegov obseg in namen, v drugem poglavju je podrobno razčlenjena sestava slovarskega sestavka, tretje poglavje pa oriše proces redaktorskega dela. Slovar bo vseboval približno 100.000 slovarskih sestavkov, v katerih bodo opisane slovnične, pomenske in druge lastnosti eno- in večbesednih leksikalnih enot sodobne knjižne slovenščine. Vsakoletni slovarski prirastek bo objavljen na portalu Fran: slovarji Inštituta za slovenski jezik Frana Ramovša ZRC SAZU.Vsebina koncepta je plod večletnega leksikološkega in leksikografskega dela sodelavcev Inštituta za slovenski jezik Frana Ramovša ZRC SAZU, posvetovanj s člani uredniškega odbora in prizadevanj za soglasje širše javnosti o podobi novega slovarja. Koncept so sprejeli in potrdili Znanstveni svet Inštituta za slovenski jezik Frana Ramovša ZRC SAZU, Znanstveni svet ZRC SAZU, Razred za filološke in literarne vede SAZU in Izvršilni odbor Predsedstva SAZU

Directory of Open Access Books (DOAB)

Extended papers from the MWE 2017 workshop

Author
Publication venue
Publication date: 01/01/2018
Field of study

Institutional Repository of the Freie Universität Berlin