Search CORE

32 research outputs found

Reconeixement Automàtic de Notació Musical

Author: Zaragoza Bernabeu Jaume
Publication venue
Publication date: 22/09/2017
Field of study

A menudo encontramos partituras musicales impresas o manuscritas en papel. Para poder sacar partido de la tecnología actual (distribución, reproducción, gestión) de dicha música, es necesario digitalizarla. Para ello, se puede recurrir a herramientas de edición de partituras. No obstante, para el usuario sería más cómodo que esta tarea se realizara automáticamente. En este trabajo se propone crear un sistema que permita importar la imagen de una partitura para detectar y clasificar automáticamente la información musical que contiene. El trabajo comprende la implementación de técnicas de reconocimiento automático, así como la posterior visualización y codificación digital de la partitura

Repositorio Institucional de la Universidad de Alicante

OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models

Author: Aulamo Mikko
Bogoychev Nikolay
Haddow Barry
Helcl Jindřich
Mateiu Tudor Nicolae
Nail Graeme
Ramírez-Sánchez Gema
van der Linde Jelmer
Weymann Lukas
Zaragoza-Bernabeu Jaume
Publication venue
Publication date: 24/11/2023
Field of study

Developing high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field. We present a pair of tools OpusCleaner and OpusTrainer that aim to simplify the process, reduce the amount of work and lower the entry barrier for newcomers. OpusCleaner is a data downloading, cleaning, and proprocessing toolkit. It is designed to allow researchers to quickly download, visualise and preprocess bilingual (or monolingual) data that comes from many different sources, each of them with different quality, issues, and unique filtering/preprocessing requirements. OpusTrainer is a data scheduling and data augmenting tool aimed at building large scale, robust machine translation systems and large language models. It features deterministic data mixing from many different sources, on-the-fly data augmentation and more. Using these tools, we showcase how we can use it to create high quality machine translation model robust to noisy user input; multilingual models and terminology aware models.Comment: Code on Github: https://github.com/hplt-project/OpusCleaner and https://github.com/hplt-project/OpusTraine

arXiv.org e-Print Archive

MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

Author: Bañón Marta
Chichirău Mălina
Esplà-Gomis Miquel
Forcada Mikel L.
Galiano Jiménez Aarón
Kuzman Taja
Ljubešić Nikola
Pla Sempere Leopoldo
Ramírez Sánchez Gema
Rupnik Peter
Suchomel Vít
Toral Antonio
van Noord Rik
Zaragoza Bernabeu Jaume
Publication venue: European Association for Machine Translation (EAMT)
Publication date: 01/06/2023
Field of study

We present the most relevant results of the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages in its second year. Parallel and monolingual corpora have been produced for eleven low-resourced European languages by crawling large amounts of textual data from selected top-level domains of the Internet; both human and automatic evaluation show its usefulness. In addition, several large language models pretrained on MaCoCu data have been published, as well as the code used to collect and curate the data.This action has received funding from the European Union’s Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341

Repositorio Institucional de la Universidad de Alicante

Generación automática de paráfrasis basada en redes neuronales

Author: Zaragoza Bernabeu Jaume
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 06/11/2019
Field of study

[EN] A paraphrase is a restatement of the meaning of a text or passage using other words. There are many applications of paraphrasing like rewording texts while writing, giving alternative translations for a target sentence, identifying similar sentences, getting synonyms or expanding search queries to match additional information. In order to help all those applications, the aim of the project is to build a system that can provide paraphrases of a given phrase. To build that system, we will explore different state-of-the-art techniques based on neural networks, and more specifically, inspired by neural machine translation recent work. Firstly, we will perform an unsupervised task that focuses in the generation of sentence embeddings (vectors of real numbers) representing semantic information in a continuous space. To generate sentence embeddings we will use large corpora, with millions of sentences of of public available books or subtitles of TV series, films and documentaries. Then, the embeddings will be tested in terms of semantic relatedness (what degree of similarity two sentences have) and paraphrase identification (if two sentences are paraphrases). Finally, we will build a paraphrase generation model using these embeddings to improve its performance.[ES] Entenem com a paràfrasi l'acte de reescriure un text amb paraules diferents mantenint el seu significat. Hi podem trobar moltes aplicacions de la paràfrasi tals com reescriure paraules mentre s'escriu una text, proporcionar traduccions alternatives per a una frase objectiu, identificar frase similars, obtenir sinònims o expandint consultes de cerca per a trobar més informació. Amb l'objectiu d'ajudar a totes aquestes aplicacions, l'objectiu del projecte és construir un sistema que proporcione paràfrasis a partir d'una frase donada. Per a construir aquest sistema, explorarem diferents tècniques de l'estat de l'art basades en xarxes neuronals, més concretament, inspirades en traducció automàtica neuronal. Primerament realitzarem una tasca no supervisada que es centrarà en la generació d'embeddings de frases (vectors de nombres reals) que representen la informació semàntica en un espai continuu. Per a generar aquests embeddings usarem corpus de gran tamany, amb milions de frases de llibres públics o de subtítols de series de televisió, pel·lícules i documentals. Després aquests embeddings seran provats en tasques sobre relació semàntica (quin grau de similitud tenen dues frases) i identificació de paràfrasi (si dues frases són paràfrasi). Finalment, construirem un sistema de generació de paràfrasi usant aquests embeddings per a millorar el seu rendiment.Zaragoza Bernabeu, J. (2019). Neural Paraphrasing Generation System. http://hdl.handle.net/10251/130303TFG

RiuNet

Ukrainian web corpus MaCoCu-uk 1.0

Author: Bañón Marta
Chichirau Malina
Esplà-Gomis Miquel
Forcada Mikel L.
Galiano-Jiménez Aarón
García-Romero Cristian
Kuzman Taja
Ljubešić Nikola
Pla Sempere Leopoldo
Ramírez-Sánchez Gema
Rupnik Peter
Suchomel Vít
Toral Antonio
van Noord Rik
Zaragoza-Bernabeu Jaume
Publication venue: Universitat d'Alacant
Publication date: 24/05/2023
Field of study

The Ukrainian web corpus MaCoCu-uk 1.0 was built by crawling the ".ua" and ".укр" internet top-level domains in 2022, extending the crawl dynamically to other domains as well. The crawler is available at https://github.com/macocu/MaCoCu-crawler. Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies. In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext) and fluency (score between 0 and 1, assigned with the Monocleaner tool - https://github.com/bitextor/monocleaner), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the Biroamer tool - https://github.com/bitextor/biroamer). The corpus can be easily read with the prevert parser (https://pypi.org/project/prevert/). Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains

Common Language Resources and Technology Infrastructure - Slovenia

Macedonian web corpus MaCoCu-mk 2.0

Author: Bañón Marta
Chichirau Malina
Esplà-Gomis Miquel
Forcada Mikel L.
Galiano-Jiménez Aarón
García-Romero Cristian
Kuzman Taja
Ljubešić Nikola
Pla Sempere Leopoldo
Ramírez-Sánchez Gema
Rupnik Peter
Suchomel Vít
Toral Antonio
van Noord Rik
Zaragoza-Bernabeu Jaume
Publication venue: Universitat d'Alacant
Publication date: 20/04/2023
Field of study

The Macedonian web corpus MaCoCu-mk 2.0 was built by crawling the ".mk" and ".мкд" internet top-level domains in 2021, extending the crawl dynamically to other domains as well. The crawler is available at https://github.com/macocu/MaCoCu-crawler. Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies. In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext) and fluency (score between 0 and 1, assigned with the Monocleaner tool - https://github.com/bitextor/monocleaner), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the Biroamer tool - https://github.com/bitextor/biroamer). As opposed to the previous version, this version has more accurate metadata on languages of the texts, which was achieved by using Google's Compact Language Detector 2 (CLD2) (https://github.com/CLD2Owners/cld2), a high-performance language detector supporting many languages. Other tools, used for web corpora creation and curation, have been updated as well, resulting in an even cleaner, as well as larger corpus. The corpus can be easily read with the prevert parser (https://pypi.org/project/prevert/). Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains

Common Language Resources and Technology Infrastructure - Slovenia

Bulgarian web corpus MaCoCu-bg 2.0

Author: Bañón Marta
Chichirau Malina
Esplà-Gomis Miquel
Forcada Mikel L.
Galiano-Jiménez Aarón
García-Romero Cristian
Kuzman Taja
Ljubešić Nikola
Pla Sempere Leopoldo
Ramírez-Sánchez Gema
Rupnik Peter
Suchomel Vít
Toral Antonio
van Noord Rik
Zaragoza-Bernabeu Jaume
Publication venue: Universitat d'Alacant
Publication date: 20/04/2023
Field of study

The Bulgarian web corpus MaCoCu-bg 2.0 was built by crawling the ".bg" and ".бг" internet top-level domains in 2021, extending the crawl dynamically to other domains as well. The crawler is available at https://github.com/macocu/MaCoCu-crawler. Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies. In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext) and fluency (score between 0 and 1, assigned with the Monocleaner tool - https://github.com/bitextor/monocleaner), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the Biroamer tool - https://github.com/bitextor/biroamer). As opposed to the previous version, this version has more accurate metadata on languages of the texts, which was achieved by using Google's Compact Language Detector 2 (CLD2) (https://github.com/CLD2Owners/cld2), a high-performance language detector supporting many languages. Other tools, used for web corpora creation and curation, have been updated as well, resulting in an even cleaner corpus. The corpus can be easily read with the prevert parser (https://pypi.org/project/prevert/). Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains

Common Language Resources and Technology Infrastructure - Slovenia

Serbian web corpus MaCoCu-sr 1.0

Author: Bañón Marta
Chichirau Malina
Esplà-Gomis Miquel
Forcada Mikel L.
Galiano-Jiménez Aarón
García-Romero Cristian
Kuzman Taja
Ljubešić Nikola
Pla Sempere Leopoldo
Ramírez-Sánchez Gema
Rupnik Peter
Suchomel Vít
Toral Antonio
van Noord Rik
Zaragoza-Bernabeu Jaume
Publication venue: Universitat d'Alacant
Publication date: 20/04/2023
Field of study

The Serbian web corpus MaCoCu-sr 1.0 was built by crawling the ".rs" and ".срб" internet top-level domains in 2021 and 2022, extending the crawl dynamically to other domains as well. The crawler is available at https://github.com/macocu/MaCoCu-crawler. Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies. In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext) and fluency (score between 0 and 1, assigned with the Monocleaner tool - https://github.com/bitextor/monocleaner), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the Biroamer tool - https://github.com/bitextor/biroamer). The corpus can be easily read with the prevert parser (https://pypi.org/project/prevert/). Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains

Common Language Resources and Technology Infrastructure - Slovenia

Croatian-English parallel corpus MaCoCu-hr-en 2.0

Author: Bañón Marta
Chichirau Malina
Esplà-Gomis Miquel
Forcada Mikel L.
Galiano-Jiménez Aarón
García-Romero Cristian
Kuzman Taja
Ljubešić Nikola
Pla Sempere Leopoldo
Ramírez-Sánchez Gema
Rupnik Peter
Suchomel Vít
Toral Antonio
van Noord Rik
Zaragoza-Bernabeu Jaume
Publication venue: Universitat d'Alacant
Publication date: 26/04/2023
Field of study

The Croatian-English parallel corpus MaCoCu-hr-en 2.0 was built by crawling the “.hr” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable effort was devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carried out, and Bifixer (https://github.com/bitextor/bifixer) and BicleanerAI (https://github.com/bitextor/bicleaner-ai) were used for fixing, cleaning, and deduplicating the final version of the corpus. The corpus is available in three formats: two sentence-level formats, TXT and TMX, and a document-level TXT format. TMX is an XML-based format and TXT is a tab-separated format. They both consist of pairs of source and target segments (one or several sentences) and additional metadata. The following metadata is included in both sentence-level formats: - source and target document URL; - paragraph ID which includes information on the position of the sentence in the paragraph and in the document (e.g., “p35:77s1/3” which means “paragraph 35 out of 77, sentence 1 out of 3”); - quality score as provided by the tool Bicleaner AI (a likelihood of a pair of sentences being mutual translations, provided with a score between 0 and 1); - similarity score as provided by the sentence alignment tool Bleualign (value between 0 and 1); - personal information identification (“biroamer-entities-detected”): segments containing personal information are flagged, so final users of the corpus can decide whether to use these segments; - translation direction and machine translation identification (“translation-direction”): the source segment in each segment pair was identified by using a probabilistic model (https://github.com/RikVN/TranslationDirection), which also determines if the translation has been produced by a machine-translation system; - a DSI class (“dsi”): information whether the segment is connected to any of Digital Service Infrastructure (DSI) classes (e.g., cybersecurity, e-health, e-justice, open-data-portal), defined by the Connecting Europe Facility (https://github.com/RikVN/DSI); - English language variant: the language variant of English (British or American, using a lexicon-based English variety classifier - https://pypi.org/project/abclf/) was identified on document and domain level. Furthermore, the sentence-level TXT format provides additional metadata: - web domain of the text; - source and target document title; - the date when the original file was retrieved; - the original type of the file (e.g., “html”), from which the sentence was extracted; - paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext); - information whether the sentence is a heading or not in the original document. The document-level TXT format provides pairs of documents identified to contain parallel data. In addition to the parallel documents (in base64 format), the corpus includes the following metadata: source and target document URL, a DSI category and the English language variant (British or American). As opposed to the previous version, this version has more accurate metadata on languages of the texts, which was achieved by using Google's Compact Language Detector 2 (CLD2) (https://github.com/CLD2Owners/cld2), a high-performance language detector supporting many languages. Other tools, used for web corpora creation and curation, have been updated as well, resulting in an even cleaner corpus. The new version also provides additional metadata, such as the position of the sentence in the paragraph and document, and information whether the sentence is related to a DSI. Moreover, the corpus is now also provided in a document-level format. Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains

Common Language Resources and Technology Infrastructure - Slovenia

Turkish-English parallel corpus MaCoCu-tr-en 2.0

Author: Bañón Marta
Chichirau Malina
Esplà-Gomis Miquel
Forcada Mikel L.
Galiano-Jiménez Aarón
García-Romero Cristian
Kuzman Taja
Ljubešić Nikola
Pla Sempere Leopoldo
Ramírez-Sánchez Gema
Rupnik Peter
Suchomel Vít
Toral Antonio
van Noord Rik
Zaragoza-Bernabeu Jaume
Publication venue: Universitat d'Alacant
Publication date: 26/04/2023
Field of study

The Turkish-English parallel corpus MaCoCu-tr-en 2.0 was built by crawling the “.tr” and “.cy” internet top-level domains in 2021, extending the crawl dynamically to other domains as well. All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable effort was devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carried out, and Bifixer (https://github.com/bitextor/bifixer) and BicleanerAI (https://github.com/bitextor/bicleaner-ai) were used for fixing, cleaning, and deduplicating the final version of the corpus. The corpus is available in three formats: two sentence-level formats, TXT and TMX, and a document-level TXT format. TMX is an XML-based format and TXT is a tab-separated format. They both consist of pairs of source and target segments (one or several sentences) and additional metadata. The following metadata is included in both sentence-level formats: - source and target document URL; - paragraph ID which includes information on the position of the sentence in the paragraph and in the document (e.g., “p35:77s1/3” which means “paragraph 35 out of 77, sentence 1 out of 3”); - quality score as provided by the tool Bicleaner AI (a likelihood of a pair of sentences being mutual translations, provided with a score between 0 and 1); - similarity score as provided by the sentence alignment tool Bleualign (value between 0 and 1); - personal information identification (“biroamer-entities-detected”): segments containing personal information are flagged, so final users of the corpus can decide whether to use these segments; - translation direction and machine translation identification (“translation-direction”): the source segment in each segment pair was identified by using a probabilistic model (https://github.com/RikVN/TranslationDirection), which also determines if the translation has been produced by a machine-translation system; - a DSI class (“dsi”): information whether the segment is connected to any of Digital Service Infrastructure (DSI) classes (e.g., cybersecurity, e-health, e-justice, open-data-portal), defined by the Connecting Europe Facility (https://github.com/RikVN/DSI); - English language variant: the language variant of English (British or American, using a lexicon-based English variety classifier - https://pypi.org/project/abclf/) was identified on document and domain level. Furthermore, the sentence-level TXT format provides additional metadata: - web domain of the text; - source and target document title; - the date when the original file was retrieved; - the original type of the file (e.g., “html”), from which the sentence was extracted; - paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext); - information whether the sentence is a heading or not in the original document. The document-level TXT format provides pairs of documents identified to contain parallel data. In addition to the parallel documents (in base64 format), the corpus includes the following metadata: source and target document URL, a DSI category and the English language variant (British or American). As opposed to the previous version, this version has more accurate metadata on languages of the texts, which was achieved by using Google's Compact Language Detector 2 (CLD2) (https://github.com/CLD2Owners/cld2), a high-performance language detector supporting many languages. Other tools, used for web corpora creation and curation, have been updated as well, resulting in an even cleaner corpus. The new version also provides additional metadata, such as the position of the sentence in the paragraph and document, and information whether the sentence is related to a DSI. Moreover, the corpus is now also provided in a document-level format. Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains

Common Language Resources and Technology Infrastructure - Slovenia