8 research outputs found

    D3.8 Lexical-semantic analytics for NLP

    Get PDF
    UIDB/03213/2020 UIDP/03213/2020The present document illustrates the work carried out in task 3.3 (work package 3) of ELEXIS project focused on lexical-semantic analytics for Natural Language Processing (NLP). This task aims at computing analytics for lexical-semantic information such as words, senses and domains in the available resources, investigating their role in NLP applications. Specifically, this task concentrates on three research directions, namely i) sense clustering, in which grouping senses based on their semantic similarity improves the performance of NLP tasks such as Word Sense Disambiguation (WSD), ii) domain labeling of text, in which the lexicographic resources made available by the ELEXIS project for research purposes allow better performances to be achieved, and finally iii) analysing the diachronic distribution of senses, for which a software package is made available.publishersversionpublishe

    The influence of nutrition on the nutritional status of preschool children

    Get PDF
    V diplomskem delu z naslovom Vpliv prehrane na prehranjenost predšolskih otrok je predstavljena zdrava prehrana in njen pomen, smernice za zdravo prehranjevanje otrok, hranilne vrednosti ter energijska in hranilna gostota. Diplomsko delo zajema tudi režim prehranjevanja, prehranjevanje otrok v vrtcu ter sodelovanje medicinske sestre in staršev z vrtcem. V empiričnem delu zajema ugotovitve raziskave, ki smo jih zbrali s pomočjo anketnega vprašalnika za starše otrok v vrtcu Jadvige Golež in vrtcu Podgorci v mesecu maju 2010. Ugotavljali smo, ali se prehranjevalne navade in prehranjenost predšolskih otrok razlikujejo v mestu in na podeželju ter ali so starši otrok dovolj seznanjeni z načeli zdrave prehrane. Rezultati ankete so pokazali, da se industrializacija hitre prehrane bolj kaže pri mestnih otrocih in manj pri otrocih s podeželja. Pri izračunu ITM (indeks telesne mase) rezultati kažejo, da ima večina otrok tako v mestu kot na podeželju normalno telesno težo. Rezultati ankete pa tudi potrjujejo, da so starši dovolj seznanjeni z načeli zdrave prehrane, saj so bili vsi anketirani starši enotnega mnenja. Prehranjevalne navade mestnih in podeželskih otrok se bistveno ne razlikujejo.In this thesis entitled »The influence of nutrition on the nutritional status of preschool children« a healthy nutrition and its significance are presented, as well as the guidelines for a healthy nutrition of children, nutritive values, and energetic and nutritive density. The thesis also includes the regimen of nutrition, nutrition of children and the cooperation of a nurse and the parents with the kindergarten. In the empirical part it includes the findings of a research that were gathered with the help of a questionnaire for parents of children of the kindergarten Jadviga Golež and kindergarten Podgorci in May 2010. We wanted to know whether the eating habits as well as the nutritional status of preschool children differ in the city and the countryside and whether parents have enough information about the principles of a healthy nutrition. The results have shown that the industrialization of fast food is more evident in urban children and less in children from rural areas. In calculating BMI (Body Mass Index) the results show that the majority of children from both urban and rural areas have a normal body weight. Survey results also confirm parents\u27 sufficient acquaintance with the principles of healthy nutrition, because all parents were unanimous. Eating habits of urban and rural children do not differ significantly. Key words: healthy nutrition, nutritional status, preschool child, nurs

    Open Slovene WordNet OSWN 1.0

    No full text
    Open Slovene WordNet (OSWN) is derived from Open English WordNet (https://en-word.net/), which itself is derived from Princeton WordNet by the Open English WordNet Community. OSWN 1.0 contains 95,262 synsets and a total of 164,904 literals. Not all synsets from Open English WordNet were included in this version; approximately 25.000 very terminological synsets (such as those related to specific animal and plant species) require additional analysis and specific expertise. These will be included in later versions. OSWN 1.0 was constructed by using several relevant existing datasets (sloWNet - http://hdl.handle.net/11356/1026; bilingual dictionaries, thesauri etc.) and methodologies (co-occurrence graphs, machine translation, ChatGPT). Finally, all synsets were manually checked by translators and lexicographers in order to remove inadequate literals. Each synset was checked by one annotator. The English definitions of synsets were translated automatically using ChatGPT. OSWN 1.0 includes both the English originals as well as their machine translations into Slovene. For additional information on the content and the format of OSWN, please consult the README file

    Training corpus SUK 1.0

    No full text
    The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with some parts also containing further manually verified annotations. The morphosyntactic tags and (where present) syntactic dependencies are included both in the JOS/MULTEXT-East framework, as well as in the framework of Universal Dependencies. The corpus is composed of several parts: * ssj500k-syn (200,320 words): the syntactically annotated part of the updated ssj500k corpus 2.3 (http://hdl.handle.net/11356/1434), contains also named entity, verbal multiword expression and semantic role label annotations; * ssj500k-tag.xml (299,927 words): the PoS tagged part of the updated ssj500k corpus 2.3 (http://hdl.handle.net/11356/1434), contains also verbal multiword expressions annotations; * Ambiga (13,929 words): this corpus has been constructed to contain many potentially lemma/PoS ambiguous words in order to help in the training of taggers and lemmatizers * ElexisWSD (27,091 words): the Slovenian part of the "Parallel sense-annotated corpus ELEXIS-WSD 1.0" (http://hdl.handle.net/11356/1674) with manually checked lemmatisation, PoS tagging, and syntactic parses; contains also named entity and semantic role label annotations; * SentiCoref (340,401 words): the "Slovene corpus for aspect-based sentiment analysis - SentiCoref 1.0" (http://hdl.handle.net/11356/1285) with manually checked lemmatisation and PoS tagging; contains also named entity and coreference chain annotation. The annotations follow: (1) the MULTEXT-East V6 morphosyntactic specifications for Slovene, https://nl.ijs.si/ME/V6/msd/, (2) the JOS dependency schema, https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf, (3) the Universal Dependencies morphosyntactic specifications and syntactic dependencies for Slovene-SSJ, https://universaldependencies.org/, (4) the Janes annotation guidelines for Slovenian named entities, https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, (5) the Guidelines of the PARSEME shared task on verbal multiword expressions, http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/. The vocabulary of (1) is provided in the back element and (3)-(5) as taxonomies in the teiHeader of the TEI encoded corpus. The semantic role labels are also documented in the teiHeader. In contrast to the previous version ssj500k 2.3, this version has significantly more text, corrects various errors in annotation, annotates more text with syntactic parses, adds new types of annotation, updates the TEI encoding, provides CoNLL-U files with text metadata and distinguishes UD-type CoNLL-U files from JOS-type CoNLL-U files

    Parallel sense-annotated corpus ELEXIS-WSD 1.0

    No full text
    ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene. The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. The sentences were tokenized, lemmatized, and tagged with POS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene The corpus is available in a CONLL-like tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its whitespace information (whether the token is followed by a whitespace or not), the ID of the sense assigned to the token, and the index of the multiword expression (if the token is part of an annotated multiword expression). Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between. For more information, please refer to 00README.txt

    Designing the ELEXIS Parallel Sense-Annotated Dataset in 10 European Languages

    No full text
    Over the course of the last few years, lexicography has witnessed the burgeoning of increasingly reliable automatic approaches supporting the creation of lexicographic resources such as dictionaries, lexical knowledge bases and annotated datasets. In fact, recent achievements in the field of Natural Language Processing and particularly in Word Sense Disambiguation have widely demonstrated their effectiveness not only for the creation of lexicographic resources, but also for enabling a deeper analysis of lexical-semantic data both within and across languages. Nevertheless, we argue that the potential derived from the connections between the two fields is far from exhausted. In this work, we address a serious limitation affecting both lexicography and Word Sense Disambiguation, i.e. the lack of high-quality sense-annotated data and describe our efforts aimed at constructing a novel entirely manually annotated parallel dataset in 10 European languages. For the purposes of the present paper, we concentrate on the annotation of morpho-syntactic features. Finally, unlike many of the currently available sense-annotated datasets, we will annotate semantically by using senses derived from high-quality lexicographic repositories

    Parallel sense-annotated corpus ELEXIS-WSD 1.1

    No full text
    ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.1 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene. The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. The sentences were tokenized, lemmatized, and tagged with POS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, five empty columns (reserved for e.g. dependency parsing, which is absent from this version), and the final MISC column containing the following: the token's whitespace information (whether the token is followed by a whitespace or not), the ID of the sense assigned to the token, and the index of the multiword expression (if the token is part of an annotated multiword expression). Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between. For more information, please refer to 00README.txt. Differences to version 1.0: - Several minor errors were fixed (e.g. a typo in one of the Slovene sense IDs). - The corpus was converted to the true CoNLL-U format (as opposed to the CoNLL-U-like format used in v1.0). - An error was fixed that resulted in missing UPOS tags in version 1.0. - The sentences in all corpora now follow the same order (from 1 to 2024)
    corecore