11 research outputs found


    U radu je dan pregled područja povezanog s procesiranjem prirodnih jezika i njihova međusobnog odnosa, počevši od šire domene kao što je umjetna inteligencija, putem strojnog učenja, računalne lingvistike, metoda strojnog prevođenja te posebice onih zasnovanim na dubokom učenju. Opisane su karakteristike, primjene, faze i glavni problemi obrade prirodnih jezika s leksičke, sintaktičke, semantičke, govorne i pragmatičke perspektive. Opisane su faze prepoznavanja i analize prirodnog jezika kao i faza generiranja prirodnih jezika. Postupci pre-editinga i post-editinga uz korištenje kontroliranih prirodnih jezika dani su kao primjeri prakse kojom se povećava točnost i kvaliteta automatskog prevođenja i općenito procesiranja teksta. Poseban je fokus stavljen na strojno prevođenje te metode strojnog prevođenja. Pristupi strojnom prevođenju kao statistički, temeljen na pravilima, hibridni i pristup temeljen na dubokom učenju opisani su i predstavljeni s obzirom na njihove prednosti i nedostatke i prikladnu primjenu u praksi. Na kraju su dani još uvijek neriješeni izazovi kao smjer daljnjih istraživanja vezanih uz obradu prirodnih jezika te značaj razvoja pristupa temeljenog na dubokom učenju.The paper provides an overview of areas related to the processing of natural languages and their interrelationships, starting from a broader domain such as artificial intelligence, through machine learning, computational linguistics, machine translation methods and especially those based on deep learning. The characteristics, applications, phases and main problems of natural language processing from the lexical, syntactic, semantic, speech and pragmatic perspective are described. The phases of natural language recognition and analysis as well as the natural language generation phase are described. Pre-editing and post-editing procedures using controlled natural languages are given as examples of practices that increase the accuracy and quality of automatic translation and text processing in general. Special focus is given to machine translation and machine translation methods. Approaches to machine translation as statistical, rule-based, example-based, hybrid and deep learning-based approach are described and discussed with regard to their advantages and disadvantages including appropriate application in practice. In the end, still unresolved challenges are given as a direction of future research related to natural language processing and the importance of further development of a deep learning-based approach

    Razvoj modela za predviđanje čvrstoćepri indirektnom zatezanju uzoraka napravljenih od struganog asfaltaprimenom mašinskog učenja

    U okviru ovog rada razvijen je model za predviđanje čvrsoće pri indirektnom zatezanju (Indirect Tensile Strenght –ITS ) uzoraka struganog asfalta (Reclaimed Asphalt Pavement –RAP) baziran na metodama mašinskog učenja (Machine Learning –ML). Analiza glavnih komponenti (Principal Component Analysis –PCA) korišćena je kako bi se smanjio skup podataka koji opisuje granulometrijske sastave uzoraka RAP-a. Razvijeni su različiti modeli višestruke polinomne regresije (Multivariate Polynomial Regression –MPR) koji u obzir uzimaju karakteristike RAP-a (sadržaj i penetracija ostarelog bitumena, granulometrijske krive pre i posle ekstrakcije bitumena), postupak pripreme uzoraka (temperatura zagrevanja) i karakteristike uzoraka (sadržaj šupljina). Analiza je pokazala da se PCA tranformacija pouzdano može koristiti za smanjivanje skupa podataka o granulometrijskom sastavu (74% varijanse podataka opisanojesa prve dve glavne komponente). Takođe je zaključeno da najjednostavniji (linearni) model višestruke regresije pokazuje najveću tačnost od svih analiziranih modelasa koeficijentom determinacije 0.59, što se može smatrati visokim za dati skup podataka (više od 40 uzoraka RAP-a iz različitih izvora)

    Construction of a formal grammar of Serbian using a metagrammar

    Ovaj rad predstavlja proces izrade osnova FBLTAG gramatike srpskog jezika, a zatim i proces izrade njene metagramatike, čija primena na osnovne modele rečenica u srpskom jeziku omogućava njihovu automatsku sintaksičku analizu. Prvo pogavlje rada daje uvod u polje obrade prirodnih jezika, navodeći istorijat discipline i njene podgrane. Veća pažnja posvećuje se automatskoj obradi srpskog jezika, gde se daje pregled dosad ostvarenih rezultata, počevši od analize fonetike i fonologije, pa sve do analize sintakse. Za svaku od navedenih sfera navode se i konkretni alati i resursi koji su dosad razvijeni za srpski jezik. Drugo poglavlje donosi pregled pojma formalna gramatika, da bi se zatim usmerilo na oblast unifikacionih gramatika kao modela koji čini okvir ovog rada. U nastavku se detaljno iznosi struktura unifikacione gramatike koja će u nastavku rada biti primenjena na srpski jezik ‒ FBLTAG. U drugom delu poglavlja uvodi se pojam metagramatike, kao i konkretne metagramatike koja se u ovom radu koristi za sažeto predstavljanje gramatike FBLTAG ‒ XMG. XMG se obrađuje detaljno, pri čemu se opisuje njegova struktura, namena i princip funkcionisanja, kao i perspektive za primenu na srpski jezik...This paper presents the process of creating the basis of FBLTAG grammar of the Serbian language, followed by the process of building its metagrammar, whose application on basic sentence models of Serbian allows for their automatic syntactic analysis. The first chapter of the thesis gives an introduction to the field of natural language processing by outlining the history of the discipline and its subfields. The chapter focuses on automatic processing of the Serbian language, providing an overview of the results achieved so far, spanning from the analysis of phonetics and phonology to parsing. Tools and resources that have been developed for the Serbian language are listed for each of the stated fields. The second chapter offers an overview of the concept of formal grammar, only to focus on the area of unification grammars as a framework for the thesis. The structure of the unification grammar that will later be applied to Serbian ‒ FBLTAG ‒ is presented in detail. The second part of the chapter introduces the concept of metagrammar, as well as the specific metagrammar ‒ XMG ‒ used in the thesis in order to describe FBLTAG in an abstract way. XMG is presented in detail, through describing its structure, purpose and its principles, as well as prospects for its use on the Serbian language..

    Metodologija rešavanja semantičkih problema u obradi kratkih tekstova napisanih na prirodnim jezicima sa ograničenim resursima

    Statistički pristupi obradi prirodnih jezika tipično zahtevaju značajne količine anotiranih podataka, a često i različite pomoćne jezičke alate, što ograničava njihovu primenu u resursno ograničenim situacijama. U ovoj disertaciji predstavljena je metodologija razvoja statističkih rešenja u semantičkoj obradi prirodnih jezika sa ograničenim resursima. Ovakvi jezici se odlikuju ne samo limitiranim brojem postojećih jezičkih resursa, već i ograničenim mogućnostima za razvoj novih skupova podataka i namenskih alata i algoritama. Predložena metodologija je usredsređena na kratke tekstove zbog njihove rasprostranjenosti u digitalnoj komunikaciji i zbog veće složenosti njihove semantičke obrade. Metodologija obuhvata sve faze izrade statističkih rešenja, od prikupljanja tekstualnog sadržaja, preko anotacije podataka, do formulisanja, obučavanja i evaluacije modela mašinskog učenja. Njena upotreba je detaljno ilustrovana na dva semantička problema – analizi sentimenta i određivanju semantičke sličnosti. Kao primer jezika sa ograničenim resursima korišćen je srpski jezik, ali se predložena metodologija može primeniti i na druge jezike iz ove kategorije. Pored opšte metodologije, u doprinose ove disertacije spada razvoj novog, fleksibilnog sistema označavanja sentimenta kratkih tekstova, nove metrike za utvrđivanje ekonomičnosti anotacije, kao i nekoliko novih modela za određivanje semantičke sličnosti kratkih tekstova. Rezultati disertacije uključuju i kreiranje prvih javno dostupnih anotiranih skupova podataka za probleme analize sentimenta i određivanja semantičke sličnosti kratkih tekstova na srpskom jeziku, razvoj i evaluaciju većeg broja modela na ovim problemima, i prvu komparativnu evaluaciju većeg broja alata za morfološku normalizaciju na kratkim tekstovima na srpskom jeziku.Statistical approaches to natural language processing typically require considerable amounts of labeled data, and often various auxiliary language tools as well, limiting their applicability in resource-limited settings. This thesis presents a methodology for developing statistical solutions in the semantic processing of natural languages with limited resources. In these languages, not only are existing language resources limited, but so are the capabilities for developing new datasets and dedicated tools and algorithms. The proposed methodology focuses on short texts due to their prevalence in digital communication, as well as the greater complexity regarding their semantic processing. The methodology encompasses all phases in the creation of statistical solutions, from the collection of textual content, to data annotation, to the formulation, training, and evaluation of machine learning models. Its use is illustrated in detail on two semantic tasks – sentiment analysis and semantic textual similarity. The Serbian language is utilized as an example of a language with limited resources, but the proposed methodology can also be applied to other languages in this category. In addition to the general methodology, the contributions of this thesis consist of the development of a new, flexible short-text sentiment annotation system, a new annotation cost-effectiveness metric, as well as several new semantic textual similarity models. The thesis results also include the creation of the first publicly available annotated datasets of short texts in Serbian for the tasks of sentiment analysis and semantic textual similarity, the development and evaluation of numerous models on these tasks, and the first comparative evaluation of multiple morphological normalization tools on short texts in Serbian

    Uticaj klasifikacije teksta na primene u obradi prirodnih jezika

    The main goal of this dissertation is to put different text classification tasks in the same frame, by mapping the input data into the common vector space of linguistic attributes. Subsequently, several classification problems of great importance for natural language processing are solved by applying the appropriate classification algorithms. The dissertation deals with the problem of validation of bilingual translation pairs, so that the final goal is to construct a classifier which provides a substitute for human evaluation and which decides whether the pair is a proper translation between the appropriate languages by means of applying a variety of linguistic information and methods. In dictionaries it is useful to have a sentence that demonstrates use for a particular dictionary entry. This task is called the classification of good dictionary examples. In this thesis, a method is developed which automatically estimates whether an example is good or bad for a specific dictionary entry. Two cases of short message classification are also discussed in this dissertation. In the first case, classes are the authors of the messages, and the task is to assign each message to its author from that fixed set. This task is called authorship identification. The other observed classification of short messages is called opinion mining, or sentiment analysis. Starting from the assumption that a short message carries a positive or negative attitude about a thing, or is purely informative, classes can be: positive, negative and neutral. These tasks are of great importance in the field of natural language processing and the proposed solutions are language-independent, based on machine learning methods: support vector machines, decision trees and gradient boosting. For all of these tasks, a demonstration of the effectiveness of the proposed methods is shown on for the Serbian language.Osnovni cilj disertacije je stavljanje različitih zadataka klasifikacije teksta u isti okvir, preslikavanjem ulaznih podataka u isti vektorski prostor lingvističkih atributa..

    Realizacija servisa pametnog zdravstva i njihova integracija u koncept pametnih gradova

    The development of information technologies has contributed to the emergence of the concept of a smart city, and one of the components of this concept is smart health. This component is important because smart cities directly or indirectly affect the health of residents. The main contributions of this doctoral dissertation are a proposal of a software platform that integrates existing public health services as well as new e-health services, a proposal for a uniform way of integrating heterogeneous smart health services into a smart city, and a proposal and implementation of methods for labelling terms in medical texts, based on artificial intelligence, without which part of the smart health service could not be realized in an efficient way. The above-mentioned software platform combines the smart health services that will be proposed (services for information and prevention of diseases, epidemic control, search of medical documents, automatic questionnaire processing, air pollution monitoring, labelling of medical texts, organization of screening programs, etc.). These services are created based on input data of different types (sensor data on locations, air pollution, text from documents and medical information systems and data collected by crowdsourcing, data from relational and non-relational databases), so it is necessary to integrate these heterogeneous services in uniform way. Part of the proposed e-health services is based on data processing in medical information systems as well as medical text documents in the Serbian language. Methods for normalization, labelling and classification of terms in the Serbian language are a prerequisite for the successful implementation of these services, and within this dissertation methods are proposed and implemented whose F1-score is 0.9082, which is an excellent result compared to methods for this purpose in other languages. For their implementation, it is necessary to use artificial intelligence methods, such as natural language processing, data and text mining, machine learning, etc. Some of the proposed e-health services are practically implemented and integrated into the smart city concept

    Finite state models in information extraction

    Disertacija je posvećena istraživanju naučne oblasti nazvane ekstrakcija informacija (engl. information extraction), koja predstavlja podoblast veštačke inteligencije, a u sebi kombinuje i koristi tehnike i dostignuća više različitih oblasti računarstva. Termin "ekstrakcija informacija" će biti korišćen u dva različita konteksta. U jednom od njih misli se na ekstrakciju informacije kao naučnu oblast i tada će se koristiti skraćenica IE, preuzeta iz anglosaksonske literature u značenju "Information Extraction". U drugom slučaju, kada se bude mislilo na sam proces i postupak izdvajanja informacija iz teksta, koristiće se oblik "ekstrakcija informacija". Ova disertacija predstavlja, pored pregleda postojećih metoda iz ove oblasti, i jedan originalni pristup i metod za ekstrakciju informacija baziran na konačnim transduktorima. Tokom istraživanja i rada na disertaciji, a primenom pomenutog metoda, kao rezultat formirana je baza podataka o mikroorganizmima koja sadrži fenotipske i genotipske karakteristike za 2412 vrsta i 873 rodova, namenjena za istraživanja iz oblasti bioinformatike i genetike. Baza i korišćeni metod su detaljno prikazani u nekoliko radova, publikovanih u časopisima ili izlaganih na međunarodnim konferencijama (Pajić, 2011; Pajić i sar. 2011a; Pajić i sar. 2011b) U glavi 1 dat je uvod u oblast ekstrakcije informacije, unutar koga je opisan istorijat i razvoj metoda ove oblasti. Dalje je opisana klasifikacija tekstualnih resursa nad kojima se vrši ekstrakcija informacija, kao i klasifikacija samih informacija. Na kraju glave 1 oblast ekstrakcije informacije je upoređena sa drugim srodnim disciplinama računarstva. Glava 2 je posvećena prikazu teorijskih osnova na kojima su zasnovana istraživanja ove disertacije. Razmatrana je teorija formalnih jezika i modela konačnih stanja, kao i njihova uzajamna veza i veza sa ekstrakcijom informacija. Akcenat je stavljen na konačne modele i metode koji su zasnovani na modelima konačnih stanja. Ovi metodi pokazuju veću preciznost od drugih metoda za ekstrakciju informacije, te su nezamenljivi u situacijama kada je tačnost izdvojenih podataka iz teksta od presudnog značaja. Pojedini pojmovi ekstrakcije informacija - jezik relevantnih informacija, jezik izdvojenih informacija, pravila ekstrakcije, definisani su iz ugla teorije formalnih jezika. Formulisano je i dokazano osnovno svojstvo relacije transdukcije za zadato pravilo ekstrakcije. Definisan je i pojam jezika konteksta informacija i dokazano je njegovo svojstvo regularnosti...This dissertation is on research and studying in scientific field called information extraction, which can be seen as a sub-area of artificial intelligence and which combines and uses techniques and achievements of several computer science areas. The term „information extraction“ will be used in two different contexts. In the first one, the term will refer to the scientific area and the acronym IE will be used in that case. In the second case, this term will refer to the very process of extracting information. Beside the IE state-of-the-art survey, an original approach and a method for information extraction based on finite state transducers are presented. A database with microbial phenotype and genotype characteristics, for 2412 species and 873 genera has been created, as a result of the research and the work on the dissertation. The database is intended for research, in bioinformatics and genetics. The method used for the creation of the database and the database itself are described in details and published in several journals and conference proceedings (Pajić, 2011; Pajić et al. 2011a; Pajić et al. 2011b). In the Section 1, the introduction to IE is given, together with the history of development of methods in this area. The classification of textual resources that are used for information extraction and classification of the information itself are described. At the end of the Section 1, IE is compared with other related disciplines of computer science. Section 2 contains some excerpts from formal language theory and abstract automata, on which the dissertation is based. The mutual relationship between these two areas and their connection with IE are described. The emphasis is put on the final state models and methods based on them. These methods show higher precision than other methods for extracting information, and are indispensable in situations where the accuracy of data extracted from the text is of crucial importance. Some specific terms of information extraction - the language of the relevant information, the language of extracted information and extraction rules, are defined from the perspective of formal language theory. The basic feature of the transduction relation for the given rule extraction is formulated and proved. The language of information context is defined and its regularilty is proven..

    An ontology-based model for risk management in mining

    Rudarska proizvodnja obuhvata kompleksne tehnološke sisteme, što nameće potrebu za uspostavljanjem i unapređivanjem sistema upravljanja rizikom. Heterogenost i obim podataka neophodnih za upravljanje rizikom zahtevaju sistem koji ih na fleksibilan način integriše i omogućava njihovo optimalno korišćenje. Osnovni cilj ove disertacije je razvoj ontologije za domen rudarstva i na njoj zasnovanog modela za upravljanje rizikom. Njegova realizacija podrazumeva i implementaciju algoritama ekstrakcije informacija za popunjavanje ontologije, kao i odgovarajuće softversko rešenje. Razvoj modela obuhvata i značajno proširenje rudarskog korpusa, kao i kreiranje terminološke baze podataka, realizovano korišćenjem metoda računarske lingvistike i korpusa dokumenata iz oblasti rudarstva (planova, izveštaja, zakona, udžbenika i monografija). Korišćena je i deskriptivna metoda za sistematizaciju podataka, zatim konačni automati i statističke analize za ekstrakciju informacija, kao i komparativna i analitička istraživačka metoda za vrednovanje i interpretaciju dobijenih rezultata. Za razvoj modela korišćeni su alati informacionih tehnologija: UML za modeliranje koncepata , OWL za razvoj ontologije, SWRL pravila za mehanizam zaključivanja, upitni jezici CQL nad korpusom i SPARQL nad ontologijom . Rezultati istraživanja pokazuju da je moguće formalizovati informacije i znanje o rizicima u rudarstvu, te razviti model koji će unaprediti efikasnost upravljanja rizikom i pomoći menadžmentu rudnika u donošenju odluka o primeni mera za smanjenje uticaja rizika identifikovanih u rudniku. Ostvarenjem ciljeva ove disertacije dat je doprinos povećanju efikasnosti u identifikaciji, analizi i reagovanju na rizik kroz izgradnju specifične domenske ontologije za rizike u rudarstvu.Mining production involves complex technological systems, which calls for the need to create and improve risk management systems. The heterogeneity and scope of data necessary for risk management require a system that integrates them in a flexible way and enables their optimal use. The main goal of this dissertation is to develop an ontology for the mining domain and a risk management model based on it. Its realization includes the implementation of information extraction algorithms for improving the ontology, as well as an appropriate software solution. The development of the model includes a significant expansion of the mining corpus, as well as the creation of a terminological database, realized using methods of computational linguistics and a corpus of documents from the mining domain (plans, reports, laws, textbooks and monographs). For systematization of data a descriptive method was used, finite automata and statistical analyzes for information extraction, and comparative and analytical research methods for evaluation and interpretation of the obtained results. Information technology tools were used for model development: UML for concept modeling, OWL for ontology development, SWRL rules for inference mechanism, query languages CQL for corpus and SPARQL for ontology. The research results show that it is possible to formalize information and knowledge about risks in mining and develop a model that will improve the efficiency of risk management and assist mine management in making decisions on implementing measures to reduce the impact of risks identified in a mine. Achieving the goals of this dissertation has contributed to increasing efficiency in identification, analysis and response to risk by developing a specific domain ontology for risks in mining

    Multimedia databases in managing the intangible cultural heritage

    Motivacija za izradu ove doktorske disertacije je multimedijalna kolekcija koja je rezultat vi²egodi²njih terenskih istraºivanja istraºiva£a iz Balkanolo²kog instituta Srpske akademije nauka i umetnosti. Kolekcija se sastoji od materijala u vidu snimljenih intervjua, snimljenih raznih obi£aja, pridruºenih tekstualnih opisa (protokola) i brojnih drugih dokumenata. Predmet istraºivanja ove disertacije je prou£avanje mogu¢nosti i razvoj novih metoda kojima bi se zapo£elo re²avanje problema upravljanja nematerijalnim kulturnim nasle em Balkana. Podzadaci koji se tom prilikom otvaraju su razvoj adekvatnog dizajna i implementacije multimedijalne baze podataka nematerijalnog kulturnog nasle a koja bi odgovarala potrebama razli£itih vrsta korisnika, automatska semanti£ka anotacija protokola uz pomo¢ metoda obrade prirodnih jezika, kao osnova za polu-automatsku anotaciju multimedijalne kolekcije i uspe²nu pretragu po metapodacima koji su u skladu sa CIDOC CRM standardom, istraºivanje dodatnih mogu¢nosti pretrage ove kolekcije u cilju dobijanja novih znanja, kao i razvoj izabranih metoda. Glavni problem sa dostupnim metodama je u tome ²to jo² uvek nema dovoljno razvijene infrastrukture u kontekstu obrade teksta na prirodnom jeziku, organizacije i upravljanja u domenu kulturnog nasle a na prostoru Balkana i posebno za slu£aj srpskog jezika, koja bi se mogla efektivno koristiti za re²avanje postavljenog problema. Stoga, postoji izraºena potreba za razvojem metoda kojima bi se do²lo do odgovaraju¢eg re²enja. Za polu-automatsku anotaciju multimedijalnih materijala kori²¢ena je automatska semanti£ka anotacija protokola koji su pridruºeni materijalima. Ona je sprovedena metodama ekstrakcije informacija, prepoznavanja imenovanih entiteta i ekstrakcije tema, tehnikama zasnovanim na pravilima uz pomo¢ dodatnih resursa poput elektronskih re£nika, tezaurusa i re£nika re£i iz speci£nog domena. Za klasikaciju tekstualnih protokola u odnosu na tematiku, izvedeno je istraºivanje o metodama koje se mogu primeniti za re²avanje problema klasikacije tekstova na srpskom jeziku, i ponu ena je metoda koja je prilago ena speci£nom domenu koji se obra uje (nematerijalno kulturno nasle e), speci£nim problemima koji se re²avaju (klasikacija protokola u odnosu na tematiku) i srpskom jeziku, kao jednom od morfolo²ki bogatih jezika...The motivation for writing this doctoral dissertation is a multimedia collection that is the result of many years of eld research conducted by researchers from the Institute for Balkan studies of the Serbian Academy of Sciences and Arts. The collection consists of materials in the form of recorded interviews, various recorded customs, associated textual descriptions (protocols) and numerous other documents. The subject of research of this dissertation is the study of possibilities and the development of new methods that could be used as a starting point in solving the problem of managing the intangible cultural heritage of the Balkans. The subtasks that emerge in this endeavor are the development of adequate design and implementation of a multimedia database of intangible cultural heritage that would meet the needs of dierent types of users, automatic semantic annotation of protocols using natural language processing methods, as a basis for semi-automatic annotation of the multimedia collection, and successful search by metadata which comply with the CIDOC CRM standard, study of additional search possibilities of this collection in order to gain new knowledge, as well as development of selected methods. The main problem with the available methods is that there is still not enough developed infrastructure in the context of natural language processing, organization and management in the eld of cultural heritage in the Balkans and especially for the Serbian language, which could be eectively used to solve the proposed problem. There is thus a strong need to develop methods to reach an appropriate solution. For the semi-automatic annotation of multimedia materials, automatic semantic annotation of the protocols associated with the materials was used. It was carried out by methods of information extraction, recognition of named entities and topic extraction, using rule-based techniques with the help of additional resources such as electronic dictionaries, thesauri and vocabularies from a specic domain. To classify textual protocols in relation to the topic, research was conducted on methods that can be used to solve the problem of classifying texts in the Serbian language, and a method was oered that is adapted to the specic domain being processed (intangible cultural heritage), to the specic problems being solved (classi cation of protocols in relation to the topic) and to the Serbian language, as one of the morphologically rich languages..