1,626 research outputs found
Non-Standard Words as Features for Text Categorization
This paper presents categorization of Croatian texts using Non-Standard Words
(NSW) as features. Non-Standard Words are: numbers, dates, acronyms,
abbreviations, currency, etc. NSWs in Croatian language are determined
according to Croatian NSW taxonomy. For the purpose of this research, 390 text
documents were collected and formed the SKIPEZ collection with 6 classes:
official, literary, informative, popular, educational and scientific. Text
categorization experiment was conducted on three different representations of
the SKIPEZ collection: in the first representation, the frequencies of NSWs are
used as features; in the second representation, the statistic measures of NSWs
(variance, coefficient of variation, standard deviation, etc.) are used as
features; while the third representation combines the first two feature sets.
Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms
were used in text categorization experiments. The best categorization results
are achieved using the first feature set (NSW frequencies) with the
categorization accuracy of 87%. This suggests that the NSWs should be
considered as features in highly inflectional languages, such as Croatian. NSW
based features reduce the dimensionality of the feature space without standard
lemmatization procedures, and therefore the bag-of-NSWs should be considered
for further Croatian texts categorization experiments.Comment: IEEE 37th International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO 2014), pp. 1415-1419,
201
Kompiliranje korpusa u digitalnim humanistiÄkim znanostima u jezicima s ograniÄenim resursima: o praksi kompiliranja tematskih korpusa iz digitalnih medija za srpski, hrvatski i slovenski
The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which
has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing
on some sort of corpus is increasingly resorted to for empiricallyāgrounded socialāscientific analysis
(sometimes dubbed ācorpusāassisted discourse analysisā or ācorpusābased critical discourse analysisā,
cf. HardtāMautner 1995; Baker 2016). In the postāYugoslav space, recent corpus developments have
brought tableāturning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist ā partly due to the fastāchanging
background of these issues, but also due to the fact that there is still a gap in the corpus method, and in
guidelines for corpus compilation, when applied beyond the anglophone contexts.
In this paper we aim to discuss some possible solutions to these difficulties, by presenting one
stepābyāstep account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles
and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic
language contexts, including data scraping options, permissions and ethical issues, the factors that
facilitate or complicate automated collection, and corpus annotation and processing possibilities.
The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of
SouthāSlavic corpus linguistics and corpus linguistic approaches in the humanities more generallyDigitalno doba otvorilo je nove moguÄnosti za sastavljanje korpusa druÅ”tvenog diskursa, Å”to je
korpusnolingvistiÄke metode približilo drugim metodama analize diskursa te humanistiÄkim znanostima.
Äak i kada se ne koriste nikakve specifiÄne tehnike korpusne lingvistike, danas je za empirijski utemeljenu
druÅ”tvenoāznanstvenu analizu sve uÄestalije koriÅ”tenje neke vrste korpusa (ākorpusnoāasistirana analiza
diskursaā ili ākritiÄka korpusna analizaā, HardtāMautner 1995; Baker 2016). U postjugoslavenskom
prostoru, nedavni razvoj korpusne lingvistike donio je prednosti u mnogim podruÄjima istraživanja.
Ipak, za lingviste i analitiÄare diskursa koji se upuÅ”taju u prikupljanje specijaliziranih korpusa za vlastite
istraživaÄke svrhe, i dalje ostaju otvorena mnoga pitanja ā djelomiÄno zbog pozadine korpusne lingvistike
koja se brzo mijenja, ali i zbog Äinjenice da joÅ” uvijek postoji rascjep u poznavanju korpusnih metoda, kao
i metodologije sastavljanja korpusa izvan anglofonskog konteksta. Ovim radom pokuŔavamo smanjiti
spomenuti rascjep predstavljajuÄi jedan postupni prikaz postupka izgradnje korpusa za hrvatski, srpski
i slovenski, kroz primjer sastavljanja tematskog korpusa iz digitalnih medija (novinski Älanci i komentari
Äitatelja). Nakon pregleda tipova korpusa, koriÅ”tenja i prednosti u druÅ”tvenim znanostima i digitalnim
humanistiÄkim znanostima, predstavljamo moguÄnosti sastavljanja korpusa u južnoslavenskim jeziÄnim
kontekstima, ukljuÄujuÄi opcije preuzimanja podataka s mreže, dozvola i etiÄkih pitanja, Äimbenika koji
olakÅ”avaju ili otežavaju automatizirano prikupljanje i oznaÄavanje korpusa i moguÄnosti obrade. Studija
otkriva sve veÄe moguÄnosti za rad s danim jezicima, ali i neka uporno siva podruÄja u kojima istraživaÄi
trebaju donositi odluke na temelju istraživaÄkih oÄekivanja. OpÄenito, rad ima za cilj rekapitulirati
vlastito iskustvo sastavljanja korpusa u Ŕirem kontekstu južnoslavenske korpusne lingvistike i korpusnih
lingvistiÄkih pristupa u humanistiÄkim znanostima opÄenito
THE INFLUENCE OF TEXT PREPROCESSING METHODS AND TOOLS ON CALCULATING TEXT SIMILARITY
Text mining to a great extent depends on the various text preprocessing techniques. The preprocessing methods and tools which are used to prepare texts for further mining can be divided into those which are and those which are not language-dependent. The subject matter of this research was the analysis of the inļ¬uence of these methods and tools on further text mining. We ļ¬rst focused on the analysis of the inļ¬uence on the reduction of the vector space model for the multidimensional represen-tation of text documents. We then analyzed the inļ¬uence on calculating text similarity, which is the focus of this research. The conclusion we reached is that the implemen-tation of various text preprocessing methods in the Serbian language, which are used for the reduction of the vector space model for the multidimensional representation of text document, achieves the required results. But, the implementation of various text preprocessing methods speciļ¬c to the Serbian language for the purpose of calculating text similarity can lead to great diļ¬erences in the results
CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages
We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of
the South Slavic languages, which is based on the Stanza natural language
processing pipeline. We describe the main improvements in CLASSLA-Stanza with
respect to Stanza, and give a detailed description of the model training
process for the latest 2.1 release of the pipeline. We also report performance
scores produced by the pipeline for different languages and varieties.
CLASSLA-Stanza exhibits consistently high performance across all the supported
languages and outperforms or expands its parent pipeline Stanza at all the
supported tasks. We also present the pipeline's new functionality enabling
efficient processing of web data and the reasons that led to its
implementation.Comment: 17 pages, 14 tables, 1 figur
MultiLexNorm: A Shared Task on Multilingual Lexical Normalization
Lexical normalization is the task of transforming an utterance into its standardized form. This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation. Such variation is typical for social media on which information is shared in a multitude of ways, including diverse languages and code-switching. Since the seminal work of Han and Baldwin (2011) a decade ago, lexical normalization has attracted attention in English and multiple other languages. However, there exists a lack of a common benchmark for comparison of systems across languages with a homogeneous data and evaluation setup. The MULTILEXNORM shared task sets out to fill this gap. We provide the largest publicly available multilingual lexical normalization benchmark including 12 language variants. We propose a homogenized evaluation setup with both intrinsic and extrinsic evaluation. As extrinsic evaluation, we use dependency parsing and part-of-speech tagging with adapted evaluation metrics (a-LAS, a-UAS, and a-POS) to account for alignment discrepancies. The shared task hosted at W-NUT 2021 attracted 9 participants and 18 submissions. The results show that neural normalization systems outperform the previous state-of-the-art system by a large margin. Downstream parsing and part-of-speech tagging performance is positively affected but to varying degrees, with improvements of up to 1.72 a-LAS, 0.85 a-UAS, and 1.54 a-POS for the winning system
MultiLexNorm: A Shared Task on Multilingual Lexical Normalization
Lexical normalization is the task of transforming an utterance into its standardized form. This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation. Such variation is typical for social media on which information is shared in a multitude of ways, including diverse languages and code-switching. Since the seminal work of Han and Baldwin (2011) a decade ago, lexical normalization has attracted attention in English and multiple other languages. However, there exists a lack of a common benchmark for comparison of systems across languages with a homogeneous data and evaluation setup. The MultiLexNorm shared task sets out to fill this gap. We provide the largest publicly available multilingual lexical normalization benchmark including 13 language variants. We propose a homogenized evaluation setup with both intrinsic and extrinsic evaluation. As extrinsic evaluation, we use dependency parsing and part-of-speech tagging with adapted evaluation metrics (a-LAS, a-UAS, and a-POS) to account for alignment discrepancies. The shared task hosted at W-NUT 2021 attracted 9 participants and 18 submissions. The results show that neural normalization systems outperform the previous state-of-the-art system by a large margin. Downstream parsing and part-of-speech tagging performance is positively affected but to varying degrees, with improvements of up to 1.72 a-LAS, 0.85 a-UAS, and 1.54 a-POS for the winning system
Natural language processing for similar languages, varieties, and dialects: A survey
There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.Non peer reviewe
- ā¦