Search CORE

200 research outputs found

Using the linguistic knowledge in BulTreeBank for the selection of the correct parses

Author: Osenova Petya
Simov Kiril
Publication venue
Publication date: 01/12/2010
Field of study

Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 163-174. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

DSpace at Tartu University Library

Applying A Normalized Compression Metric To The Measurement Of Dialect Distance

Author: Osenova Petya
Simov Kiril
Publication venue: Institute of Mathematics and Informatics Bulgarian Academy of Sciences
Publication date: 01/01/2007
Field of study

The paper discusses the application of a similarity metric based on compression to the measurement of the distance among Bulgarian dia- lects. The similarity metric is de ned on the basis of the notion of Kolmo- gorov complexity of a le (or binary string). The application of Kolmogorov complexity in practice is not possible because its calculation over a le is an undecidable problem. Thus, the actual similarity metric is based on a real life compressor which only approximates the Kolmogorov complexity. To use the metric for distance measurement of Bulgarian dialects we rst represent the dialectological data in such a way that the metric is applicable. We propose two such representations which are compared to a baseline distance between dialects. Then we conclude the paper with an outline of our future work

Bulgarian Digital Mathematics Library at IMI-BAS

The data-driven Bulgarian WordNet: BTBWN

Author: Osenova Petya
Simov Kiril
Publication venue: 'Institute of Slavic Studies Polish Academy of Sciences'
Publication date: 01/01/2018
Field of study

The data-driven Bulgarian WordNet: BTBWNThe paper presents our work towards the simultaneous creation of a data-driven WordNet for Bulgarian and a manually annotated treebank with semantic information. Such an approach requires synchronization of the word senses in both - syntactic and lexical resources, without limiting the WordNet senses to the corpus or vice versa. Our strategy focuses on the identification of senses used in BulTreeBank, but the missing senses of a lemma also have been covered through exploration of bigger corpora. The identified senses have been organized in synsets for the Bulgarian WordNet. Then they have been aligned to the Princeton WordNet synsets. Various types of mappings are considered between both resources in a cross-lingual aspect and with respect to ensuring maximum connectivity and potential for incorporating the language specific concepts. The mapping between the two WordNets (English and Bulgarian) is a basis for applications such as machine translation and multilingual information retrieval. Oparty na danych WordNet bułgarski: BTBWNW artykule przedstawiono naszą pracę na rzecz jednoczesnej budowy opartego na danych wordnetu dla języka bułgarskiego oraz ręcznie oznaczonego informacjami semantycznymi banku drzew. Takie podejście wymaga uzgodnienia znaczeń słów zarówno w zasobach składniowych, jak i leksykalnych, bez ograniczania znaczeń umieszczanych w wordnecie do tych obecnych w korpusie, jak i odwrotnie. Nasza strategia koncentruje się na identyfikacji znaczeń stosowanych w BulTreeBank, przy czym brakujące znaczenia lematu zostały również zbadane przez zgłębienie większych korpusów. Zidentyfikowane znaczenia zostały zorganizowane w synsety bułgarskiego wordnetu, a następnie powiązane z synsetami Princeton WordNet. Rozmaite rodzaje rzutowań są rozpatrywane pomiędzy obydwoma zasobami w kontekście międzyjęzykowym, a także w odniesieniu do zapewnienia maksymalnej łączności i możliwości uwzględnienia pojęć specyficznych dla języka bułgarskiego. Rzutowanie między dwoma wordnetami (angielskim i bułgarskim) jest podstawą dla aplikacji, takich jak tłumaczenie maszynowe i wielojęzyczne wyszukiwanie informacji

Crossref

Biblioteka Nauki - repozytorium artykuÅÃ³w

Directory of Open Access Journals

bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark

Author: Angelova Galia
Atanasova Pepa
Hardalov Momchil
Koychev Ivan
Mihaylov Todor
Nakov Preslav
Osenova Petya
Radev Dragomir
Simov Kiril
Stoyanov Ves
Publication venue
Publication date: 04/06/2023
Field of study

We present bgGLUE (Bulgarian General Language Understanding Evaluation), a benchmark for evaluating language models on Natural Language Understanding (NLU) tasks in Bulgarian. Our benchmark includes NLU tasks targeting a variety of NLP problems (e.g., natural language inference, fact-checking, named entity recognition, sentiment analysis, question answering, etc.) and machine learning tasks (sequence labeling, document-level classification, and regression). We run the first systematic evaluation of pre-trained language models for Bulgarian, comparing and contrasting results across the nine tasks in the benchmark. The evaluation results show strong performance on sequence labeling tasks, but there is a lot of room for improvement for tasks that require more complex reasoning. We make bgGLUE publicly available together with the fine-tuning and the evaluation code, as well as a public leaderboard at https://bgglue.github.io/, and we hope that it will enable further advancements in developing NLU models for Bulgarian.Comment: Accepted to ACL 2023 (Main Conference

arXiv.org e-Print Archive

Cross Disciplinary Overtures with Interview Data: Integrating Digital Practices and Tools in the Scholarly Workflow

Author: Beeken Jeannine
Calamai Silvia
Corti Louise
Draxler Christoph
Eskevich Maria
Karrouche Norah
Scagliola Stef
Simov Kiril
Truong Khiet P.
van den Heuvel Henk
van Hessen Arjan
Publication venue: 'Linkoping University Electronic Press'
Publication date: 01/01/2019
Field of study

There is much talk about the need for multidisciplinary approaches to research and the opportunities that have been created by digital technologies. A good example of this is the CLARIN Portal, that promotes and supports such research by offering a large suite of tools for working with textual and audio-visual data. Yet scholars who work with interview material are largely unaware of this resource and are still predominantly oriented towards familiar traditional research methods. To reach out to these scholars and assess the potential for integration of these new technologies a multidisciplinary international community of experts set out to test CLARIN-type approaches and tools on different scholars by eliciting and documenting their feedback. This was done through a series of workshops held from 2016 to 2019, and funded by CLARIN and affiliated EU funding. This paper presents the goals, the tools that were tested and the evaluation of how they were experienced. It concludes by setting out envisioned pathways for a better use of the CLARIN family of approaches and tools in the area of qualitative and oral history data analysi

University of Essex Research Repository

Crossref

VU Research Portal

Archivio della Ricerca - Università degli Studi di Siena

Radboud Repository

SMT and Hybrid systems of the QTLeap project in the WMT16 IT-task

Author: Agirre Eneko
Branco António
Gaudio Rosa
Gomes Luís
Labaka Gorka
Neale Steven
Oele Dieke
Osenova Petya
Popel Martin
Querido Andreia
Rendeiro Nuno
Rodrigues João
Silva João
Simov Kiril
van Noord Gertjan
Publication venue
Publication date: 01/01/2016
Field of study

This paper presents the description of 12 systems submitted to the WMT16 IT-task, covering six different languages, namely Basque, Bulgarian, Dutch, Czech, Portuguese and Spanish. All these systems were developed under the scope of the QTLeap project, presenting a common strategy. For each language two different systems were submitted, namely a phrase-based MT system built using Moses, and a system exploiting deep language engineering approaches, that in all the languages but Bulgarian was implemented using TectoMT. For 4 of the 6 languages, the TectoMT-based system performs better than the Moses-based one

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Biblio at Institute of Formal and Applied Linguistics

Dissertations of the University of Groningen

D7.4 Validation 4

Author: Armitt Gillian
Berlanga Adriana
Braidman Isobel
Dessus Philippe
Dupre Damien
Greller Wolfgang
Haley Debra
Hensgens Jan
Hoisl Bernhard
Koblische Robert
Loiseau Mathieu
Mauerhofer Christoph
Monachesi Paola
Osenova Petya
Posea Vlad
Rebedea Traian
Salem Hussein
Simov Kiril
Smithies Alisdair
Stoyanov Slavi
Trausan-Matu Stefan
Van Bruggen Jan
Westerhout Eline
Wild Fridolin
Publication venue
Publication date: 02/03/2011
Field of study

Armitt, G., Stoyanov, S., Hensgens, J., Smithies, A., Braidman, I., Mauerhofer, C., Osenova, P., Simov, K., Berlanga, A. J., Van Bruggen, J., Greller, W., Rebedea, T., Posea, V., Trausan-Matu, S., Dupre, D., Salem, H., Dessus, P., Loiseau, M., Westerhout, E., Monachesi, P., Koblische, R., Hoisl, B., Haley, D., & Wild, F. (2011). D7.4 Validation 4. LTfLL-project.This deliverable describes the objectives, approach, planning and results of the third pilot round, in which both individual and threaded services underwent validation. The two goals of this round were to provide input to the LTfLL exploitation plan and roadmap (deliverable 2.5). 531 participants (316 learners) took part in the pilots, which used LTfLL services based on five different languages. The average timespan of the pilots was three weeks and involved learners, tutors, teaching managers, the LTfLL team and Technology Enhanced Learning experts. The validation approach was based on Prototypical Validation Topics derived from the Round 2 validation topics, which refocused the validation topics on exploitation and allowed conclusions to be drawn across all services. Results demonstrated the areas of strength and weakness of each service, informing the selling points and barriers to adoption within the exploitation strategy, as well as suggesting possible further contexts of use. All services were noted to have high relevance in addressing burning issues for organizations, but further improvements to accuracy from a user viewpoint are required. Results on future enhancements to improve likelihood of adoption contribute to the roadmap. Results also provide an indication of each service's current readiness for adoption and provided insights into transferability issues. The overall conclusion is that some LTfLL services are more ready than others for adoption now, with some being currently more suited to sustainability in research settings.The work on this publication has been sponsored by the LTfLL STREP that is funded by the European Commission's 7th Framework Programme. Contract 212578 [http://www.ltfll-project.org

Open University of the Netherlands Research Portal

Multiword expressions: Insights from a multi-lingual perspective

Author: Barbu Mititelu Verginica
Bargmann Sascha
Dimitrova Tsvetana
El Marouf Ismail
Fotopoulou Aggeliki
Giouli Voula
Hanks Patrick
Koeva Svetla
Krstev Cvetana
Kuiper Koenraad
Kyriacopoulou Tita
Laporte Éric
Leseva Svetlozara
Markantonatou Stella
Martineau Claude
Nevado Llopis Almudena
Oakes Michael
Osenova Petya
Parra Escartín Carla
Sailer Manfred
Samaridi Niki
Simov Kiril
Sánchez Martínez Eoghan
Vitas Duško
Publication venue: Language Science Press
Publication date: 16/10/2017
Field of study

Multiword expressions (MWEs) are a challenge for both the natural language applications and the linguistic theory because they often defy the application of the machinery developed for free combinations where the default is that the meaning of an utterance can be predicted from its structure. There is a rich body of primarily descriptive work on MWEs for many European languages but comparative work is little. The volume brings together MWE experts to explore the benefits of a multilingual perspective on MWEs. The ten contributions in this volume look at MWEs in Bulgarian, English, French, German, Maori, Modern Greek, Romanian, Serbian, and Spanish. They discuss prominent issues in MWE research such as classification of MWEs, their formal grammatical modeling, and the description of individual MWE types from the point of view of different theoretical frameworks, such as Dependency Grammar, Generative Grammar, Head-driven Phrase Structure Grammar, Lexical Functional Grammar, Lexicon Grammar

Language Science Press