Search CORE

2 research outputs found

Modelling frequency and attestations for OntoLex-Lemon

Author: Chiarcos Christian
de Does Jesse
Declerck Thierry
Depuydt Katrien
Fahad Khan Anas
Ionov Maxim
McCrae John Philip
Stolk Sander
Publication venue
Publication date: 24/04/2023
Field of study

The OntoLex vocabulary enjoys increasing popularity as a means of publishing lexical resources with RDF and as Linked Data. The recent publication of a new OntoLex module for lexicography, lexicog, reflects its increasing importance for digital lexicography. However, not all aspects of digital lexicography have been covered to the same extent. In particular, supplementary information drawn from corpora such as frequency information, links to attestations, and collocation data were considered to be beyond the scope of lexicog. Therefore, the OntoLex community has put forward the proposal for a novel module for frequency, attestation and corpus information (FrAC), that not only covers the requirements of digital lexicography, but also accommodates essential data structures for lexical information in natural language processing. This paper introduces the current state of the OntoLex-FrAC vocabulary, describes its structure, some selected use cases, elementary concepts and fundamental definitions, with a focus on frequency and attestations

OPUS Augsburg

Dictionaries in digital age - information technology suporrt for Serbian language ; Словари в цифровом возрасте - информационная поддержка для сербский язык

Author: Rujević Biljana Đ.
Publication venue: Универзитет у Београду, Филолошки факултет
Publication date: 31/08/2022
Field of study

Морфолошки речници српског језика представљају електронски језички ресурс који има значајну историју развоја и коришћења за потребе обраде природних језика. С обзиром на то да су чувани у облику датотека чији је број нарастао па је самим тим управљање речницима постало отежано јавила се потреба за смештањем информација из речника у облик лексикографске базе. Како би се омогућио симултани рад на развоју речника за више корисника јавила се потреба за веб-апликацијом заснованој на лексикографској бази. Како би се размотриле функционалности које пружају речници у дигиталном окружењу у циљу проналаска најбољег решења за развој апликације, дескриптивном методом су анализирани различити примери дигиталних речника неколико језика. Са циљем одабира адекватног модела за развој лексикографске базе разматрана су три стандардизована модела за представљање информација из речника: TEI, LMF и lemon. Модел развијене лексикографске базе се заснива на комбинацији модела LMF и lemon. Током разматрања и развоја модела лексикографске базе коришћене су дескриптивна и информатичка научна метода. Употреба лексикографске базе је омогућила напредну претрагу као и успостављање релација између лексичких записа. Успостављање релација се заснива на дефинисању група правила које лексички записи за повезивање треба да задовоље. Захваљујући употреби лексикографске базе и апликације за преглед и у управљање речницима појавила се могућност надградње Морфолошких речника за српски језик као ресурса. Лексички записи су допуњени везама са екстерним лексичким ресурсима као што су Ворднет, Терми, BabelNet, Glosbe и Wikidata. Осим тога, омогућено је повезивање са записима из дигитализованих традиционалних речника српског језика које би могло бити доступно групама корисника који имају право на њихово коришћење у дигиталном облику. Лексички записи су применом регуларних израза и коначних аутомата повезани са корпусима у виду могућности претраге конкорданци које садрже лему записа или предефинисане обрасце појављивања речи. Записима су придружене и информације о фреквенцији појављивања лема и облика речи у одређеним корпусима. Развијене апликација и база су тестиране на речницима ексцерпираним из корпуса из геолошког домена ГеоСрпКор развијеном за потребе овог истраживања.Serbian morphological dictionaries represent an electronic language resource with significant history of development and use in natural language processing. Since they were kept in form of files whose number grew, and thus the management of dictionaries became more difficult, it was necessary to store information from the dictionary in the form of a lexicographic database. In order to enable the dictionary development based on simultaneous work for several users, a web application based on a lexicographic database was needed. In order to consider the functionalities provided by dictionaries in the digital environment towards finding the best solution for application development, various examples of digital dictionaries of several languages were analyzed using the descriptive method. To establish an adequate model for development of the lexicographic database, three standardized models for presenting information from the dictionary were considered: TEI, LMF and lemon. The model of the developed lexicographic database is based on a combination of the LMF and lemon models. During the process of the lexicographic database model development, descriptive and informatics scientific methods were used. The use of lexicographic base enabled advanced search as well as the establishment of relations between lexical entries. Establishing lexical relations is based on the set of rules that define which criteria the lexical entries should meet. The upgrade of Serbian morphological dictionaries came as a result of using the lexicographic database and the application for browsing and managing dictionaries Lexical entries are enriched by links to external lexical resources, some of which are: Wordnet, Termi, BabelNet, Glosbe and Wikidata. It is also possible to set up the connection with lexical entries from digitized printed dictionaries of the Serbian language. This could be available to groups of users who have access to these dictionaries in digital form. Lexical entries are linked with corpora using regular expressions and finite automata. There is a possibility of searching for concordances that contain a lemma of lexical entries or predefined patterns of word occurrence. The lexical entries are extended by information on the frequency of occurrence of lemmas and word forms in certain corpora. The developed application and database were tested on dictionaries excerpted from the corpus from the geological domain - GeoSrpKor that was developed for the purpose of this research

National Repository of Dissertations in Serbia (NaRDuS)

Nardus