43 research outputs found
TEI and Microsoft: a marriage made in...
In several on-going projects we were faced with the dilemma of how to reconcile our goal of delivering standardly encoded historical documents, yet have the actual editing and annotation performed by researchers and students who had no knowledge of XML and TEI, and, for the most part, no interest in learning them. The solution we developed consists of allowing the annotators use familiar and flexible editors, such as Microsoft Word (for structural annotation of documents) and Excel (for word-level linguistic annotation) and automatically converting these into TEI. Given the unconstrained nature of such editors this sounds like a recipe for disaster. But the solution crucially depends on a dedicated Web service, to which the annotators can up-load their files; these are then immediately converted to XML/TEI and from it back to a visual format, either HTML or Excel XML, and presented to the annotators. These then get immediate feedback about the quality of their encoding in the source, and can thus correct errors before they accumulate; and the responsibility for the correct encoding rests with the annotators, rather than with the developers of the conversion procedure. The paper describes the web service and details its use in three projects. The main conclusions are that the proposed solution is appropriate for shallow encodings, and nevertheless does require producing detailed annotation guidelines
Babel Treebank of Public Messages in Croatian
AbstractThe paper presents the process of constructing a publicly available treebank of public messages written in Croatian. The messages were collected from various electronic sources ā e-mail, blog, Facebook and SMS ā and published on the Zagreb Museum of Contemporary Art LED facade within the Babel art project. The project aimed to use the facade as an open-space blog or social interface for enabling citizens to publicly express their views. Construction and current state of the treebank is presented along with future work plans. A comparison of Babel Treebank with Croatian Dependency Treebank and SETimes.HR treebank regarding differing domains and annotation schemes is briefly sketched. The treebank is used as a test platform for introducing a new standard for syntactic annotation of Croatian texts. An experiment with morphosyntactic tagging and dependency parsing of the treebank is conducted, providing first insight to computational processing of non-standard text in Croatian
hrWaC and slWac: Compiling web corpora for Croatian and Slovene.
Abstract. Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper introduces two new annotated web corpora: the Croatian hrWaC and the Slovene slWaC. Both were built using a modified standard "Web as Corpus" pipeline having in mind the limited amount of available web data. The modifications are described in the paper, focusing on the content extraction from HTML pages, which combines high precision of extracted language content with a decent recall. The paper also investigates texttypes of the acquired corpora using topic modeling, comparing the two corpora among themselves and with ukWaC
Cross-lingual Dependency Parsing of Related Languages with Rich Morphosyntactic Tagsets
This paper addresses cross-lingual dependency parsing using rich morphosyntactic tagsets. In our case study, we experiment with three related Slavic languages:
Croatian, Serbian and Slovene. Four different dependency treebanks are used for
monolingual parsing, direct cross-lingual
parsing, and a recently introduced crosslingual parsing approach that utilizes statistical machine translation and annotation projection. We argue for the benefits
of using rich morphosyntactic tagsets in
cross-lingual parsing and empirically support the claim by showing large improvements over an impoverished common feature representation in form of a reduced
part-of-speech tagset. In the process, we
improve over the previous state-of-the-art
scores in dependency parsing for all three
languages.Published versio
CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages
We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of
the South Slavic languages, which is based on the Stanza natural language
processing pipeline. We describe the main improvements in CLASSLA-Stanza with
respect to Stanza, and give a detailed description of the model training
process for the latest 2.1 release of the pipeline. We also report performance
scores produced by the pipeline for different languages and varieties.
CLASSLA-Stanza exhibits consistently high performance across all the supported
languages and outperforms or expands its parent pipeline Stanza at all the
supported tasks. We also present the pipeline's new functionality enabling
efficient processing of web data and the reasons that led to its
implementation.Comment: 17 pages, 14 tables, 1 figur
Razpoznavanje imenskih entitet v slovenskem besedilu
Älanek predstavlja algoritem in implementacijo programa za razpoznavanje imen v slovenskem jeziku s pomoÄjo strojnega uÄenja. Nadzorovani pristop na osnovi pogojnih nakljuÄnih polj je nauÄen na oznaÄenem korpusu ssj500k. V korpusu, ki je prosto dostopen pod licenco Creative Commons CC-BY-NC-SA, so pri besednih pojavnicah poleg oblikoskladenjskih oznak in lem oznaÄena tudi imena organizacij, osebna, zemljepisna ter stvarna imena. Älanek predstavlja vpliv na natanÄnost razpoznavanja ob uporabi oblikoskladenjskih oznak, leksikonov in konjunkcij sosednjih lastnosti. Ena od ugotovitev raziskave je, da so oblikoskladenjske oznake pri razpoznavanju entitet koristne. V kombinaciji z vsemi ostalimi lastnostmi doseže sistem na testni množici 74% natanÄnost in 72% priklic, pri Äemer so najbolje razpoznana osebna imena, sledijo jim zemljepisna ter organizacijska in nazadnje stvarna imena. Novo spoznanje Älanka je tudi to, da lahko z delitvijo razreda vseh stvarnih imen na organizacije in preostala stvarna imena dosežemo boljÅ”e rezultate prepoznavanja tudi pri drugih razredih. Preizkusi na neodvisno oznaÄenih korpusi kažejo dobro posploÅ”enost modela za osebna in zemljepisna imena. Programska oprema, narejena v raziskavi, je prosto dostopna pod licenco Apache 2.0 na naslovuĀ http://ailab.ijs.si/~tadej/slner.zip, razvojne razliÄice pa so na voljo na naslovuhttps://github.com/tadejs/slner