31 research outputs found
System theoretic approach to sustainable development problems
This paper shows that the concepts and methodology contained in the system theory and operations research are suitable for application in the planning and control of the sustainable development. The sustainable development problems can be represented using the state space concepts, such as the transition of system, from the given initial state to the final state. It is shown that sustainable development represents a specific control problem. The peculiarity of the sustainable development is that the target is to keep the system in the prescribed feasible region of the state space. The analysis of planning and control problems of sustainable development has also shown that methods developed in the operations research area, such as multicriteria optimization, dynamic processes simulation, non-conventional treatment of uncertainty etc. are adequate, exact base, suitable for resolution of these problems
Kompiliranje korpusa u digitalnim humanistiÄkim znanostima u jezicima s ograniÄenim resursima: o praksi kompiliranja tematskih korpusa iz digitalnih medija za srpski, hrvatski i slovenski
The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which
has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing
on some sort of corpus is increasingly resorted to for empiricallyāgrounded socialāscientific analysis
(sometimes dubbed ācorpusāassisted discourse analysisā or ācorpusābased critical discourse analysisā,
cf. HardtāMautner 1995; Baker 2016). In the postāYugoslav space, recent corpus developments have
brought tableāturning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist ā partly due to the fastāchanging
background of these issues, but also due to the fact that there is still a gap in the corpus method, and in
guidelines for corpus compilation, when applied beyond the anglophone contexts.
In this paper we aim to discuss some possible solutions to these difficulties, by presenting one
stepābyāstep account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles
and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic
language contexts, including data scraping options, permissions and ethical issues, the factors that
facilitate or complicate automated collection, and corpus annotation and processing possibilities.
The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of
SouthāSlavic corpus linguistics and corpus linguistic approaches in the humanities more generallyDigitalno doba otvorilo je nove moguÄnosti za sastavljanje korpusa druÅ”tvenog diskursa, Å”to je
korpusnolingvistiÄke metode približilo drugim metodama analize diskursa te humanistiÄkim znanostima.
Äak i kada se ne koriste nikakve specifiÄne tehnike korpusne lingvistike, danas je za empirijski utemeljenu
druÅ”tvenoāznanstvenu analizu sve uÄestalije koriÅ”tenje neke vrste korpusa (ākorpusnoāasistirana analiza
diskursaā ili ākritiÄka korpusna analizaā, HardtāMautner 1995; Baker 2016). U postjugoslavenskom
prostoru, nedavni razvoj korpusne lingvistike donio je prednosti u mnogim podruÄjima istraživanja.
Ipak, za lingviste i analitiÄare diskursa koji se upuÅ”taju u prikupljanje specijaliziranih korpusa za vlastite
istraživaÄke svrhe, i dalje ostaju otvorena mnoga pitanja ā djelomiÄno zbog pozadine korpusne lingvistike
koja se brzo mijenja, ali i zbog Äinjenice da joÅ” uvijek postoji rascjep u poznavanju korpusnih metoda, kao
i metodologije sastavljanja korpusa izvan anglofonskog konteksta. Ovim radom pokuŔavamo smanjiti
spomenuti rascjep predstavljajuÄi jedan postupni prikaz postupka izgradnje korpusa za hrvatski, srpski
i slovenski, kroz primjer sastavljanja tematskog korpusa iz digitalnih medija (novinski Älanci i komentari
Äitatelja). Nakon pregleda tipova korpusa, koriÅ”tenja i prednosti u druÅ”tvenim znanostima i digitalnim
humanistiÄkim znanostima, predstavljamo moguÄnosti sastavljanja korpusa u južnoslavenskim jeziÄnim
kontekstima, ukljuÄujuÄi opcije preuzimanja podataka s mreže, dozvola i etiÄkih pitanja, Äimbenika koji
olakÅ”avaju ili otežavaju automatizirano prikupljanje i oznaÄavanje korpusa i moguÄnosti obrade. Studija
otkriva sve veÄe moguÄnosti za rad s danim jezicima, ali i neka uporno siva podruÄja u kojima istraživaÄi
trebaju donositi odluke na temelju istraživaÄkih oÄekivanja. OpÄenito, rad ima za cilj rekapitulirati
vlastito iskustvo sastavljanja korpusa u Ŕirem kontekstu južnoslavenske korpusne lingvistike i korpusnih
lingvistiÄkih pristupa u humanistiÄkim znanostima opÄenito
Metodologija reÅ”avanja semantiÄkih problema u obradi kratkih tekstova napisanih na prirodnim jezicima sa ograniÄenim resursima
StatistiÄki pristupi obradi prirodnih jezika tipiÄno zahtevaju znaÄajne koliÄine anotiranih
podataka, a Äesto i razliÄite pomoÄne jeziÄke alate, Å”to ograniÄava njihovu primenu u resursno
ograniÄenim situacijama. U ovoj disertaciji predstavljena je metodologija razvoja statistiÄkih reÅ”enja
u semantiÄkoj obradi prirodnih jezika sa ograniÄenim resursima. Ovakvi jezici se odlikuju ne samo
limitiranim brojem postojeÄih jeziÄkih resursa, veÄ i ograniÄenim moguÄnostima za razvoj novih
skupova podataka i namenskih alata i algoritama. Predložena metodologija je usredsreÄena na kratke
tekstove zbog njihove rasprostranjenosti u digitalnoj komunikaciji i zbog veÄe složenosti njihove
semantiÄke obrade.
Metodologija obuhvata sve faze izrade statistiÄkih reÅ”enja, od prikupljanja tekstualnog sadržaja,
preko anotacije podataka, do formulisanja, obuÄavanja i evaluacije modela maÅ”inskog uÄenja. Njena
upotreba je detaljno ilustrovana na dva semantiÄka problema ā analizi sentimenta i odreÄivanju
semantiÄke sliÄnosti. Kao primer jezika sa ograniÄenim resursima koriÅ”Äen je srpski jezik, ali se
predložena metodologija može primeniti i na druge jezike iz ove kategorije.
Pored opŔte metodologije, u doprinose ove disertacije spada razvoj novog, fleksibilnog sistema
oznaÄavanja sentimenta kratkih tekstova, nove metrike za utvrÄivanje ekonomiÄnosti anotacije, kao
i nekoliko novih modela za odreÄivanje semantiÄke sliÄnosti kratkih tekstova. Rezultati disertacije
ukljuÄuju i kreiranje prvih javno dostupnih anotiranih skupova podataka za probleme analize
sentimenta i odreÄivanja semantiÄke sliÄnosti kratkih tekstova na srpskom jeziku, razvoj i evaluaciju
veÄeg broja modela na ovim problemima, i prvu komparativnu evaluaciju veÄeg broja alata za
morfoloŔku normalizaciju na kratkim tekstovima na srpskom jeziku.Statistical approaches to natural language processing typically require considerable
amounts of labeled data, and often various auxiliary language tools as well, limiting their applicability
in resource-limited settings. This thesis presents a methodology for developing statistical solutions in
the semantic processing of natural languages with limited resources. In these languages, not only are
existing language resources limited, but so are the capabilities for developing new datasets and
dedicated tools and algorithms. The proposed methodology focuses on short texts due to their
prevalence in digital communication, as well as the greater complexity regarding their semantic
processing.
The methodology encompasses all phases in the creation of statistical solutions, from the collection
of textual content, to data annotation, to the formulation, training, and evaluation of machine learning
models. Its use is illustrated in detail on two semantic tasks ā sentiment analysis and semantic textual
similarity. The Serbian language is utilized as an example of a language with limited resources, but
the proposed methodology can also be applied to other languages in this category.
In addition to the general methodology, the contributions of this thesis consist of the development of
a new, flexible short-text sentiment annotation system, a new annotation cost-effectiveness metric, as
well as several new semantic textual similarity models. The thesis results also include the creation of
the first publicly available annotated datasets of short texts in Serbian for the tasks of sentiment
analysis and semantic textual similarity, the development and evaluation of numerous models on
these tasks, and the first comparative evaluation of multiple morphological normalization tools on
short texts in Serbian
hr500k ā A Reference Training Corpus of Croatian.
In this paper we present hr500k, a Croatian reference training corpus of 500 thousand tokens, segmented at document, sentence and word level, and annotated for morphosyntax, lemmas, dependency syntax, named entities, and semantic roles. We present each annotation layer via basic label statistics and describe the final encoding of the resource in CoNLL and TEI formats. We also give a description of the rather turbulent history of the resource and give insights into the topic and genre distribution in the corpus. Finally, we discuss further enrichments of the corpus with additional layers, which are already underway
Otvoreni resursi i tehnologije za obradu srpskog jezika
Open language resources and tools are very important for increasing the quality and speeding up the development of technologies for natural language processing. This paper presents a set of open resources available for processing the Serbian language. We describe several manually annotated corpora, as well as a range of computational models, including a web service designed in order to facilitate their use
Approximating Pareto frontier using a hybrid line search approach
This is the post-print version of the final paper published in Information Sciences. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2010 Elsevier B.V.The aggregation of objectives in multiple criteria programming is one of the simplest and widely used approach. But it is well known that this technique sometimes fail in different aspects for determining the Pareto frontier. This paper proposes a new approach for multicriteria optimization, which aggregates the objective functions and uses a line search method in order to locate an approximate efficient point. Once the first Pareto solution is obtained, a simplified version of the former one is used in the context of Pareto dominance to obtain a set of efficient points, which will assure a thorough distribution of solutions on the Pareto frontier. In the current form, the proposed technique is well suitable for problems having multiple objectives (it is not limited to bi-objective problems) and require the functions to be continuous twice differentiable. In order to assess the effectiveness of this approach, some experiments were performed and compared with two recent well known population-based metaheuristics namely ParEGO and NSGA II. When compared to ParEGO and NSGA II, the proposed approach not only assures a better convergence to the Pareto frontier but also illustrates a good distribution of solutions. From a computational point of view, both stages of the line search converge within a short time (average about 150 ms for the first stage and about 20 ms for the second stage). Apart from this, the proposed technique is very simple, easy to implement and use to solve multiobjective problems.CNCSIS IDEI 2412, Romani
Metodologija reÅ”avanja semantiÄkih problema u obradi kratkih tekstova napisanih na prirodnim jezicima sa ograniÄenim resursima
StatistiÄki pristupi obradi prirodnih jezika tipiÄno zahtevaju znaÄajne koliÄine anotiranih
podataka, a Äesto i razliÄite pomoÄne jeziÄke alate, Å”to ograniÄava njihovu primenu u resursno
ograniÄenim situacijama. U ovoj disertaciji predstavljena je metodologija razvoja statistiÄkih reÅ”enja
u semantiÄkoj obradi prirodnih jezika sa ograniÄenim resursima. Ovakvi jezici se odlikuju ne samo
limitiranim brojem postojeÄih jeziÄkih resursa, veÄ i ograniÄenim moguÄnostima za razvoj novih
skupova podataka i namenskih alata i algoritama. Predložena metodologija je usredsreÄena na kratke
tekstove zbog njihove rasprostranjenosti u digitalnoj komunikaciji i zbog veÄe složenosti njihove
semantiÄke obrade.
Metodologija obuhvata sve faze izrade statistiÄkih reÅ”enja, od prikupljanja tekstualnog sadržaja,
preko anotacije podataka, do formulisanja, obuÄavanja i evaluacije modela maÅ”inskog uÄenja. Njena
upotreba je detaljno ilustrovana na dva semantiÄka problema ā analizi sentimenta i odreÄivanju
semantiÄke sliÄnosti. Kao primer jezika sa ograniÄenim resursima koriÅ”Äen je srpski jezik, ali se
predložena metodologija može primeniti i na druge jezike iz ove kategorije.
Pored opŔte metodologije, u doprinose ove disertacije spada razvoj novog, fleksibilnog sistema
oznaÄavanja sentimenta kratkih tekstova, nove metrike za utvrÄivanje ekonomiÄnosti anotacije, kao
i nekoliko novih modela za odreÄivanje semantiÄke sliÄnosti kratkih tekstova. Rezultati disertacije
ukljuÄuju i kreiranje prvih javno dostupnih anotiranih skupova podataka za probleme analize
sentimenta i odreÄivanja semantiÄke sliÄnosti kratkih tekstova na srpskom jeziku, razvoj i evaluaciju
veÄeg broja modela na ovim problemima, i prvu komparativnu evaluaciju veÄeg broja alata za
morfoloŔku normalizaciju na kratkim tekstovima na srpskom jeziku.Statistical approaches to natural language processing typically require considerable
amounts of labeled data, and often various auxiliary language tools as well, limiting their applicability
in resource-limited settings. This thesis presents a methodology for developing statistical solutions in
the semantic processing of natural languages with limited resources. In these languages, not only are
existing language resources limited, but so are the capabilities for developing new datasets and
dedicated tools and algorithms. The proposed methodology focuses on short texts due to their
prevalence in digital communication, as well as the greater complexity regarding their semantic
processing.
The methodology encompasses all phases in the creation of statistical solutions, from the collection
of textual content, to data annotation, to the formulation, training, and evaluation of machine learning
models. Its use is illustrated in detail on two semantic tasks ā sentiment analysis and semantic textual
similarity. The Serbian language is utilized as an example of a language with limited resources, but
the proposed methodology can also be applied to other languages in this category.
In addition to the general methodology, the contributions of this thesis consist of the development of
a new, flexible short-text sentiment annotation system, a new annotation cost-effectiveness metric, as
well as several new semantic textual similarity models. The thesis results also include the creation of
the first publicly available annotated datasets of short texts in Serbian for the tasks of sentiment
analysis and semantic textual similarity, the development and evaluation of numerous models on
these tasks, and the first comparative evaluation of multiple morphological normalization tools on
short texts in Serbian
Sentiment Classification of Documents in Serbian: The Effects of Morphological Normalization and Word Embeddings
An open issue in the sentiment classification of texts written in Serbian is the effect of different forms of morphological normalization and the usefulness of leveraging large amounts of unlabeled texts. In this paper, we assess the impact of lemmatizers and stemmers for Serbian on classifiers trained and evaluated on the Serbian Movie Review Dataset. We also consider the effectiveness of using word embeddings, generated from a large unlabeled corpus, as classification features
Annotated corpus of Serbian language-related news articles MetaLangNEWS-Sr
A comprehensive corpus of news articles on the topic of language, published in major Serbian daily newspapers and news portals in the five-year period of January 1, 2015 - January 1, 2020. The corpus is designed to facilitate research on metalanguage (ālanguage about languageā), linguistic ideologies, language policy and planning, as well as the specific contemporary debates on language defining, naming, and standardisation, ongoing in post-Yugoslav societies.
The corpus has been tagged using the CLASSLA-StanfordNLP models for morphosyntactic annotation and lemmatisation of standard Serbian. The corpus is available in plain text version, XML with full metadata, and tagged CONLL-U format.
MetaLangNEWS-Sr is complemented with a separate corpus of citizen metalanguage comments, i.e. online comments to the news articles, available as MetaLangNEWS-COMMENTS-Sr (http://hdl.handle.net/11356/1372). Parallel versions from Slovenia (http://hdl.handle.net/11356/1360) and Croatia (http://hdl.handle.net/11356/1369) are also available