2 research outputs found
NERosetta for the Named Entity Multi-lingual Space
International audienceNamed Entity Recognition has been a hot topic in Natural Language Processing for more than fifteen years. A number of systems for various languages have been developed using different approaches and based on different named entity schemes and tagging strategies. We present the NERosetta web application that can be used for comparison of these various approaches applied to aligned texts (bitexts). In order to illustrate its functionalities, we have used one literary text, its 7 bi-texts involving 5 languages and 5 different NER systems. We present some preliminary results and give guidelines for further development
The development of library and lenaguage resources for organizing and finding information on spatial planning ; Π Π°Π·Π²ΠΈΡΠΈΠ΅ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΡΠ½ΡΡ ΠΈ ΡΠ·ΡΠΊΠΎΠ²ΡΡ ΡΠ΅ΡΡΡΡΠΎΠ² Π² ΡΠ΅Π»ΡΡ ΠΎΡΠ³Π°Π½ΠΈΠ·Π°ΡΠΈΠΈ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΠΎΠ³ΠΎ ΠΏΠΎΠΈΡΠΊΠ° ΠΏΠΎ ΡΠ΅ΡΡΠΈΡΠΎΡΠΈΠ°Π»ΡΠ½ΠΎΠΌΡ ΠΏΠ»Π°Π½ΠΈΠΎΡΠΎΠ²Π°Π½ΠΈΡ
ΠΠΌΠ°ΡΡΡΠΈ Ρ Π²ΠΈΠ΄Ρ Π΄Π° ΠΏΡΠ°Π²ΠΎΠ²ΡΠ΅ΠΌΠ΅Π½ΠΈ ΠΏΡΠΈΡΡΡΠΏ ΡΠ΅Π»Π΅Π²Π°Π½ΡΠ½ΠΈΠΌ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡΠ°ΠΌΠ°, ΠΊΠ°ΠΎ ΠΈ Π΄Π΅ΡΠΈΠ½ΠΈΡΠ°ΡΠ΅ ΠΈ ΡΠ°Π·Π²ΠΎΡ Π°Π΄Π΅ΠΊΠ²Π°ΡΠ½Π΅ ΡΠ΅ΡΠΌΠΈΠ½ΠΎΠ»ΠΎΠ³ΠΈΡΠ΅ ΠΏΡΠ΅Π΄ΡΡΠ°Π²ΡΠ° ΠΏΡΠ΅Π΄ΡΡΠ»ΠΎΠ² Π·Π° Π½ΠΎΠ²Π° ΠΈΡΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ° ΠΈ Π΄Π°ΡΠΈ ΡΠ°Π·Π²ΠΎΡ ΡΠ²Π°ΠΊΠ΅ Π½Π°ΡΡΠ½Π΅ ΠΎΠ±Π»Π°ΡΡΠΈ, Ρ Π΄ΠΈΡΠ΅ΡΡΠ°ΡΠΈΡΠΈ ΡΡ ΠΏΡΠΈΠΊΠ°Π·Π°Π½Π΅ ΠΌΠΎΠ³ΡΡΠ½ΠΎΡΡΠΈ ΠΏΡΠΎΠ½Π°Π»Π°ΠΆΠ΅ΡΠ° ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡΠ° ΠΈ Π΅ΠΊΡΡΡΠ°Ρ
ΠΎΠ²Π°ΡΠ° ΡΠ΅ΡΠΌΠΈΠ½Π°, Π½Π° ΡΠ·ΠΎΡΠ½ΠΎΠΌ ΠΊΠΎΡΠΏΡΡΡ ΠΏΡΠΎΡΡΠΎΡΠ½ΠΎΠ³ ΠΏΠ»Π°Π½ΠΈΡΠ°ΡΠ° ΠΊΠΎΡΠΈΡΡΠ΅ΡΠ΅ΠΌ Π½ΠΈΠ·Π° ΡΠ°Π²ΡΠ΅ΠΌΠ΅Π½ΠΈΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΠ°.
Π£ ΡΠ°Π΄Ρ ΡΠ΅ ΡΠΊΠ°Π·Π°Π½ΠΎ Π½Π° ΠΌΠ½ΠΎΠ³ΠΎΠ±ΡΠΎΡΠ½Π΅ ΠΏΠΎΠ³ΠΎΠ΄Π½ΠΎΡΡΠΈ Π°Π»ΠΈ ΠΈ ΠΈΠ·Π²Π΅ΡΠ½Π° ΠΎΠ³ΡΠ°Π½ΠΈΡΠ΅ΡΠ° ΠΏΡΠΈΠ»ΠΈΠΊΠΎΠΌ ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ° ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡΠ° ΠΊΠΎΡΠΈΡΡΠ΅ΡΠ΅ΠΌ ΠΠΈΠ±Π»ΠΈΠΎΡΠ΅ΡΠΊΠΈΡ
ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½ΠΈΡ
ΡΠΈΡΡΠ΅ΠΌΠ°, ΠΠ΅ΠΎΠ³ΡΠ°ΡΡΠΊΠΈΡ
ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½ΠΈΡ
ΡΠΈΡΡΠ΅ΠΌΠ° ΠΈ ΡΠ΅ΠΏΠΎΠ·ΠΈΡΠΎΡΠΈΡΡΠΌΠ° Π ΠΠ£ΠΌΠΠΠΠ.
Π£ Π½Π°ΡΡΠ°Π²ΠΊΡ ΡΠ΅ ΠΎΠΏΠΈΡΠ°Π½ ΡΠ°Π΄ΡΠΆΠ°Ρ, ΠΏΠΎΡΡΡΠΏΠ°ΠΊ ΠΈΠ·ΡΠ°Π΄Π΅ ΠΈ ΠΏΠΎΡΠ²ΡΡΠ΅Π½Π° ΡΠ΅ΠΏΡΠ΅Π·Π΅Π½ΡΠ°ΡΠΈΠ²Π½ΠΎΡΡ ΡΠΎΡΠΌΠΈΡΠ°Π½ΠΎΠ³ ΡΠ·ΠΎΡΠ½ΠΎΠ³ ΠΊΠΎΡΠΏΡΡΠ° ΠΏΡΠΎΡΡΠΎΡΠ½ΠΎΠ³ ΠΏΠ»Π°Π½ΠΈΡΠ°ΡΠ°. ΠΠ±ΡΠ°Π΄Π° ΡΠ΅ΠΊΡΡΠ°, ΠΊΠΎΡΠ° ΠΏΠΎΠ΄ΡΠ°Π·ΡΠΌΠ΅Π²Π° ΡΠΎΠΊΠ΅Π½ΠΈΠ·Π°ΡΠΈΡΡ, Π»Π΅ΠΌΠ°ΡΠΈΠ·Π°ΡΠΈΡΡ, ΠΎΠ±Π΅Π»Π΅ΠΆΠ°Π²Π°ΡΠ΅ Π²ΡΡΡΠ΅ ΡΠ΅ΡΠΈ, ΠΊΠ°ΠΎ ΠΈ Π΅ΠΊΡΡΡΠ°ΠΊΡΠΈΡΡ ΡΠ΅ΡΠΌΠΈΠ½Π° ΠΈΠ·Π²ΡΡΠ΅Π½Π° ΡΠ΅ Π°Π»Π°ΡΠΎΠΌ Unitex. ΠΠΎΡΠΏΡΡ ΡΠ΅ ΠΏΠΎΡΠΎΠΌ ΠΏΠΎΡΡΠ°Π²ΡΠ΅Π½ Π½Π° ΠΏΡΠ°ΡΡΠΎΡΠΌΡ NoSketch Π³Π΄Π΅ ΡΠ΅, Π½Π° ΠΎΡΠ½ΠΎΠ²Ρ ΠΏΠΎΡΡΠ°Π²ΡΠ΅Π½ΠΈΡ
ΡΠΏΠΈΡΠ°, ΠΏΠΎΡΠ²ΡΡΠ΅Π½ Π·Π½Π°ΡΠ°Ρ ΠΏΡΠ΅ΡΡ
ΠΎΠ΄Π½Π΅ ΠΎΠ±ΡΠ°Π΄Π΅ ΡΠ΅ΠΊΡΡΠΎΠ²Π° ΠΊΠΎΡΠ° ΠΎΠΌΠΎΠ³ΡΡΠ°Π²Π° ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ΅ ΡΠ° Π·Π½Π°ΡΠ½ΠΎ Π²Π΅ΡΠΈΠΌ ΠΈΠ½Π΄ΠΈΠΊΠ°ΡΠΎΡΠΎΠΌ ΠΎΠ΄Π·ΠΈΠ²Π° ΠΈ ΠΏΡΠ΅ΡΠΈΠ·Π½ΠΎΡΡΠΈ.
ΠΠ·Π΄Π²Π°ΡΠ°ΡΠ΅ΠΌ ΡΠ΅ΠΊΡΡΠΎΠ²Π° ΠΏΡΠΎΡΡΠΎΡΠ½ΠΈΡ
ΠΏΠ»Π°Π½ΠΎΠ²Π° ΠΈΠ· ΡΠ·ΠΎΡΠ½ΠΎΠ³ ΠΊΠΎΡΠΏΡΡΠ°, ΡΠΎΡΠΌΠΈΡΠ°Π½ ΡΠ΅ ΠΏΠΎΡΠΊΠΎΡΠΏΡΡ PPTXM, Π½Π° ΠΊΠΎΠΌ ΡΡ Π²ΡΡΠ΅Π½Π° ΠΏΡΠ΅ΠΎΡΡΠ°Π»Π° ΠΈΡΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ°. ΠΠΎΡΠΈΡΡΠ΅ΡΠ΅ΠΌ Π½Π°ΠΏΡΠ΅Π΄Π½ΠΈΡ
ΠΌΠ΅ΡΠΎΠ΄Π° ΠΈ ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΠ°, Π°Π»Π°ΡΠΎΠΌ SrpNER ΠΈΠ·Π²ΡΡΠ΅Π½ΠΎ ΡΠ΅ ΠΎΠ±Π΅Π»Π΅ΠΆΠ°Π²Π°ΡΠ΅ ΠΈ Π΅ΠΊΡΡΡΠ°Ρ
ΠΎΠ²Π°ΡΠ΅ ΡΠ°Π·Π»ΠΈΡΠΈΡΠΈΡ
Π³ΡΡΠΏΠ° ΠΈΠΌΠ΅Π½ΠΎΠ²Π°Π½ΠΈΡ
Π΅Π½ΡΠΈΡΠ΅ΡΠ°. ΠΠ½Π°ΡΠ°ΡΠ°Π½ Π΄ΠΎΠΏΡΠΈΠ½ΠΎΡ ΠΎΠ²Π΅ Π΄ΠΈΡΠ΅ΡΡΠ°ΡΠΈΡΠ΅ ΠΎΠ³Π»Π΅Π΄Π° ΡΠ΅ ΠΈ Ρ ΠΏΠΎΠ²Π΅Π·ΠΈΠ²Π°ΡΡ ΠΈΠΌΠ΅Π½ΠΎΠ²Π°Π½ΠΈΡ
Π΅Π½ΡΠΈΡΠ΅ΡΠ° Ρ INCEpTION ΠΎΠΊΡΡΠΆΠ΅ΡΡ ΡΠ° ΡΡΠ°Π²ΠΊΠ°ΠΌΠ° ΠΈΠ· Π±Π°Π·Π΅ Π·Π½Π°ΡΠ° ΠΠΈΠΊΠΈΠΏΠΎΠ΄Π°ΡΠΈ. ΠΠΎΠΌΠ΅Π½ΡΡΠ° Π±Π°Π·Π° Π·Π½Π°ΡΠ° ΠΎΠΌΠΎΠ³ΡΡΠΈΠ»Π° ΡΠ΅ Π³ΡΡΠΏΠΈΡΠ°ΡΠ΅ ΡΡΠ°Π²ΠΊΠΈ, ΠΊΡΠ΅ΠΈΡΠ°ΡΠ΅ΠΌ SPARQL ΡΠΏΠΈΡΠ°, ΠΏΡΠ΅ΠΌΠ° Π·Π°Π΄Π°ΡΠΈΠΌ ΠΊΡΠΈΡΠ΅ΡΠΈΡΡΠΌΠΈΠΌΠ°. ΠΠΈΠ·ΡΠ΅Π»ΠΈΠ·Π°ΡΠΈΡΠ° ΠΈΠ·Π»Π°Π·Π½ΠΈΡ
ΡΠΊΡΠΏΠΎΠ²Π° ΠΏΡΠ΅Π΄ΡΡΠ°Π²ΡΠ΅Π½Π° ΡΠ΅ Ρ Π²ΠΈΠ΄Ρ ΠΌΠ°ΠΏΠ°, Π³ΡΠ°ΡΠΎΠ²Π°, ΡΠ°Π±Π΅Π»Π° ΠΈ ΠΎΠΊΠ²ΠΈΡΠ° ΡΠ° ΡΠΎΡΠΎΠ³ΡΠ°ΡΠΈΡΠ°ΠΌΠ°.
Π£ TXM ΠΎΠΊΡΡΠΆΠ΅ΡΡ Ρ
ΠΈΡΠ΅ΡΠ°Ρ
ΠΈΡΡΠΊΠΎΠΌ Π°Π½Π°Π»ΠΈΠ·ΠΎΠΌ ΡΠ΅ ΡΠΊΠ°Π·Π°Π½ΠΎ Π½Π° ΡΡΡΡΠΊΡΡΡΠ°Π»Π½Π΅ ΠΎΡΠΎΠ±ΠΈΠ½Π΅ ΠΊΠΎΡΠΏΡΡΠ°: Π±ΡΠΎΡ ΡΠ΅ΠΊΡΡΠΎΠ²Π°, ΠΏΠ°ΡΡΡΠ°, ΡΠ΅ΡΠ΅Π½ΠΈΡΠ° ΠΈ ΠΊΠΎΡΠΏΡΡΠ½ΠΈΡ
ΡΠ΅ΡΠΈ. ΠΠΎΡΠΈΡΡΠ΅ΡΠ΅ΠΌ ΠΌΠΎΡΡΠΎΠ»ΠΎΡΠΊΠΈΡ
Π΅ΡΠΈΠΊΠ΅ΡΠ°, Ρ ΠΎΠΊΠ²ΠΈΡΡ TXM ΡΠΈΡΡΠ΅ΠΌΠ° ΡΡΠ²ΡΡΠ΅Π½Π° ΡΠ΅ ΡΡΠ΅ΠΊΠ²Π΅Π½ΡΠ½ΠΎΡΡ ΠΏΠΎΡΠ°Π²ΡΠΈΠ²Π°ΡΠ° ΡΠ°Π·Π»ΠΈΡΠΈΡΠΈΡ
Π²ΡΡΡΠ° ΡΠ΅ΡΠΈ ΠΈ Π·Π½Π°ΠΊΠΎΠ²Π° ΠΈΠ½ΡΠ΅ΡΠΏΡΠ½ΠΊΡΠΈΡΠ΅ Ρ ΡΠΈΡΠ°Π²ΠΎΠΌ ΠΊΠΎΡΠΏΡΡΡ. ΠΡΠ΄ΡΡΠΈ Π΄Π° ΡΠΈΡΡΠ΅ΠΌ TXM Π΄ΠΎΠ·Π²ΠΎΡΠ°Π²Π° ΠΈ ΠΏΡΠΈΠΊΠ°Π·ΠΈΠ²Π°ΡΠ΅ ΡΠΏΠ΅ΡΠΈΡΠΈΡΠ½ΠΈΡ
ΡΠ΅Π·ΠΈΡΠΊΠΈΡ
ΠΏΠΎΡΠ°Π²Π°, ΠΎΠΌΠΎΠ³ΡΡΠ΅Π½ΠΎ ΡΠ΅ ΠΈ ΠΏΡΠ°ΡΠ΅ΡΠ΅ ΠΏΡΠΎΠ³ΡΠ΅ΡΠΈΡe, ΠΎΠ΄Π½ΠΎΡΠ½ΠΎ ΠΊΡΠΌΡΠ»Π°ΡΠΈΠ²Π½e ΡΡΠ΅ΠΊΠ²Π΅Π½ΡΠΈΡe ΡΠ°Π·Π»ΠΈΡΠΈΡΠΈΡ
Π²ΡΡΡΠ° ΡΠ΅ΡΠΈ, ΠΊΠ°ΠΊΠΎ ΠΊΡΠΎΠ· ΡΠ΅ΠΎ ΠΊΠΎΡΠΏΡΡ, ΡΠ°ΠΊΠΎ ΠΈ ΠΊΡΠΎΠ· ΡΠ΅Π³ΠΎΠ²Π΅ ΡΠ°ΡΡΠ°Π²Π½Π΅ Π΄Π΅Π»ΠΎΠ²Π΅.Bearing in mind that timely access to relevant information, as well as defining and developing adequate terminology, is a prerequisite for new research and further development in any scientific field, this dissertation presents the possibilities with regard to retrieving information and extracting terms for the sample corpus of spatial planning using a number of modern technologies.
The study points out many benefits, but also certain limitations faced when searching for information using Library Information Systems, Geographic Information Systems and the RAUmPLAN repository.
The content, preparation process and confirmed representativeness of the sample corpus formed for spatial planning are described below. Processing the text, which includes tokenization, lemmatization, highlighting types of words, and extracting terms was carried out using the Unitex tool. The corpus was then placed on the NoSketch platform, where, on the basis of set queries, the importance of the previous processing of the text was confirmed, making it possible to search with a significantly higher indicator of response and accuracy.
By separating the texts of spatial plans from the sample corpus, the PPTXM sub-corpus was formed, on which the remaining research was conducted. Using advanced methods and technologies, the SrpNER tool highlighted and extracted various groups of named entities. The significant contribution of this dissertation is seen in the way it connects named entities in the INCEpTION environment with items from the Wikidata knowledge base. This knowledge base enabled the grouping of items by creating SPARQL queries, according to the given criteria. The output sets were visualized in the form of maps, graphs, tables and photo frames.
Hierarchical analysis in the TXM environment indicated the structural features of the corpus: the number of texts, paragraphs, sentences and corpus words. Using morphological labels, the frequency of occurrence of different types of words and punctuation marks in the entire corpus was determined within the TXM system. Since the TXM system allows the display of specific linguistic phenomena, it was also possible to monitor the progression, i.e., the cumulative frequency of different types of words, both throughout the whole corpus and through its constituent parts