Search CORE

2 research outputs found

NERosetta for the Named Entity Multi-lingual Space

Author: Krstev Cvetana
Kyriacopoulou Tita
Vitas Duško
Zečević Anđelka
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 07/10/2013
Field of study

International audienceNamed Entity Recognition has been a hot topic in Natural Language Processing for more than fifteen years. A number of systems for various languages have been developed using different approaches and based on different named entity schemes and tagging strategies. We present the NERosetta web application that can be used for comparison of these various approaches applied to aligned texts (bitexts). In order to illustrate its functionalities, we have used one literary text, its 7 bi-texts involving 5 languages and 5 different NER systems. We present some preliminary results and give guidelines for further development

Hal-Diderot

The development of library and lenaguage resources for organizing and finding information on spatial planning ; Развитие библиотечных и языковых ресурсов в целях организации информационного поиска по территориальному планиорованию

Author: Milinković Milena
Publication venue: Универзитет у Београду, Филолошки факултет
Publication date: 27/09/2022
Field of study

Имајући у виду да правовремени приступ релевантним информацијама, као и дефинисање и развој адекватне терминологије представља предуслов за нова истраживања и даљи развој сваке научне области, у дисертацији су приказане могућности проналажења информација и екстраховања термина, на узорном корпусу просторног планирања коришћењем низа савремених технологија. У раду је указано на многобројне погодности али и извесна ограничења приликом претраживања информација коришћењем Библиотечких информационих система, Географских информационих система и репозиторијума РАУмПЛАН. У наставку је описан садржај, поступак израде и потврђена репрезентативност формираног узорног корпуса просторног планирања. Обрада текста, која подразумева токенизацију, лематизацију, обележавање врсте речи, као и екстракцију термина извршена је алатом Unitex. Корпус је потом постављен на пратформу NoSketch где је, на основу постављених упита, потврђен значај претходне обраде текстова која омогућава претраживање са знатно већим индикатором одзива и прецизности. Издвајањем текстова просторних планова из узорног корпуса, формиран је поткорпус PPTXM, на ком су вршена преостала истраживања. Коришћењем напредних метода и технологија, алатом SrpNER извршено је обележавање и екстраховање различитих група именованих ентитета. Значајан допринос ове дисертације огледа се и у повезивању именованих ентитета у INCEpTION окружењу са ставкама из базе знања Википодаци. Поменута база знања омогућила је груписање ставки, креирањем SPARQL упита, према задатим критеријумима. Визуелизација излазних скупова представљена је у виду мапа, графова, табела и оквира са фотографијама. У TXM окружењу хијерахијском анализом је указано на структуралне особине корпуса: број текстова, пасуса, реченица и корпусних речи. Коришћењем морфолошких етикета, у оквиру TXM система утврђена је фреквентност појављивања различитих врста речи и знакова интерпункције у читавом корпусу. Будући да систем TXM дозвољава и приказивање специфичних језичких појава, омогућено је и праћење прогресијe, односно кумулативнe фреквенцијe различитих врста речи, како кроз цео корпус, тако и кроз његове саставне делове.Bearing in mind that timely access to relevant information, as well as defining and developing adequate terminology, is a prerequisite for new research and further development in any scientific field, this dissertation presents the possibilities with regard to retrieving information and extracting terms for the sample corpus of spatial planning using a number of modern technologies. The study points out many benefits, but also certain limitations faced when searching for information using Library Information Systems, Geographic Information Systems and the RAUmPLAN repository. The content, preparation process and confirmed representativeness of the sample corpus formed for spatial planning are described below. Processing the text, which includes tokenization, lemmatization, highlighting types of words, and extracting terms was carried out using the Unitex tool. The corpus was then placed on the NoSketch platform, where, on the basis of set queries, the importance of the previous processing of the text was confirmed, making it possible to search with a significantly higher indicator of response and accuracy. By separating the texts of spatial plans from the sample corpus, the PPTXM sub-corpus was formed, on which the remaining research was conducted. Using advanced methods and technologies, the SrpNER tool highlighted and extracted various groups of named entities. The significant contribution of this dissertation is seen in the way it connects named entities in the INCEpTION environment with items from the Wikidata knowledge base. This knowledge base enabled the grouping of items by creating SPARQL queries, according to the given criteria. The output sets were visualized in the form of maps, graphs, tables and photo frames. Hierarchical analysis in the TXM environment indicated the structural features of the corpus: the number of texts, paragraphs, sentences and corpus words. Using morphological labels, the frequency of occurrence of different types of words and punctuation marks in the entire corpus was determined within the TXM system. Since the TXM system allows the display of specific linguistic phenomena, it was also possible to monitor the progression, i.e., the cumulative frequency of different types of words, both throughout the whole corpus and through its constituent parts

National Repository of Dissertations in Serbia (NaRDuS)

Nardus