Search CORE

63 research outputs found

Asistente de composición de música dodecafónica para OpenMusic

Author: Pla Sempere Leopoldo
Publication venue
Publication date: 18/07/2014
Field of study

El serialismo es una técnica de composición basada en la manipulación de distintos elementos musicales a partir de series de valores. En este proyecto se plantea la construcción de algoritmos de asistencia a la composición para generar automáticamente composiciones musicales dodecafónicas a partir de una serie inicial o "semilla" y un conjunto de restricciones introducidas por el compositor

Repositorio Institucional de la Universidad de Alicante

Building Domain-specific Corpora from the Web:the Case of European Digital Service Infrastructures

Author: Esplà-Gomis Miquel
Garcia-Romero Cristian
Pla Sempere Leopoldo
Toral Antonio
van Noord Rik
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/06/2022
Field of study

ARTS repository - University of Groningen

Building Domain-specific Corpora from the Web:the Case of European Digital Service Infrastructures

Author: Esplà-Gomis Miquel
Garcia-Romero Cristian
Pla Sempere Leopoldo
Toral Antonio
van Noord Rik
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/06/2022
Field of study

An important goal of the MaCoCu project is to improve EU-specific NLP systems that concern their Digital Service Infrastructures (DSIs). In this paper we aim at boosting the creation of such domain-specific NLP systems. To do so, we explore the feasibility of building an automatic classifier that allows to identify which segments in a generic (potentially parallel) corpus are relevant for a particular DSI. We create an evaluation data set by crawling DSI-specific web domains and then compare different strategies to build our DSI classifier for text in three languages: English, Spanish and Dutch. We use pre-trained (multilingual) language models to perform the classification, with zero-shot classification for Spanish and Dutch. The results are promising, as we are able to classify DSIs with between 70 and 80% accuracy, even without in-language training data. A manual annotation of the data revealed that we can also find DSI-specific data on crawled texts from general web domains with reasonable accuracy. We publicly release all data, predictions and code, as to allow future investigations in whether exploiting this DSI-specific data actually leads to improved performance on particular applications, such as machine translation

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

Author: Bañón Marta
Chichirău Mălina
Esplà-Gomis Miquel
Forcada Mikel L.
Galiano Jiménez Aarón
Kuzman Taja
Ljubešić Nikola
Pla Sempere Leopoldo
Ramírez Sánchez Gema
Rupnik Peter
Suchomel Vít
Toral Antonio
van Noord Rik
Zaragoza Bernabeu Jaume
Publication venue: European Association for Machine Translation (EAMT)
Publication date: 01/06/2023
Field of study

We present the most relevant results of the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages in its second year. Parallel and monolingual corpora have been produced for eleven low-resourced European languages by crawling large amounts of textual data from selected top-level domains of the Internet; both human and automatic evaluation show its usefulness. In addition, several large language models pretrained on MaCoCu data have been published, as well as the code used to collect and curate the data.This action has received funding from the European Union’s Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341

Repositorio Institucional de la Universidad de Alicante

MaCoCu:Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

Author: Bañón Marta
Esplà-Gomis Miquel
Forcada Mikel L.
García-Romero Cristian
Kuzman Taja
Ljubešić Nikola
Ramírez-Sánchez Gema
Rupnik Peter
Sempere Leopoldo Pla
Suchomel Vít
Toral Antonio
van der Werff Tobias
van Noord Rik
Zaragoza Jaume
Publication venue: European Association for Machine Translation
Publication date: 01/01/2022
Field of study

ARTS repository - University of Groningen

MaCoCu:Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

Author: Bañón Marta
Esplà-Gomis Miquel
Forcada Mikel L.
García-Romero Cristian
Kuzman Taja
Ljubešić Nikola
Ramírez-Sánchez Gema
Rupnik Peter
Sempere Leopoldo Pla
Suchomel Vít
Toral Antonio
van der Werff Tobias
van Noord Rik
Zaragoza Jaume
Publication venue: European Association for Machine Translation
Publication date: 01/01/2022
Field of study

We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release the free/open-source web crawling and curation software used.</p

ARTS repository - University of Groningen