Search CORE

68 research outputs found

HindEnCorp – Hindi-English and Hindi-only Corpus for Machine Translation

Author: Bojar Ondřej
Diatka Vojtěch
Rychlý Pavel
Straňák Pavel
Suchomel Vít
Tamchyna Aleš
Zeman Daniel
Publication venue
Publication date: 01/01/2014
Field of study

We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences. Both the corpora are freely available for non-commercial research and their preliminary release has been used by numerous participants of the WMT 2014 shared translation task

Biblio at Institute of Formal and Applied Linguistics

MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

Author: Bañón Marta
Chichirău Mălina
Esplà-Gomis Miquel
Forcada Mikel L.
Galiano Jiménez Aarón
Kuzman Taja
Ljubešić Nikola
Pla Sempere Leopoldo
Ramírez Sánchez Gema
Rupnik Peter
Suchomel Vít
Toral Antonio
van Noord Rik
Zaragoza Bernabeu Jaume
Publication venue: European Association for Machine Translation (EAMT)
Publication date: 01/06/2023
Field of study

We present the most relevant results of the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages in its second year. Parallel and monolingual corpora have been produced for eleven low-resourced European languages by crawling large amounts of textual data from selected top-level domains of the Internet; both human and automatic evaluation show its usefulness. In addition, several large language models pretrained on MaCoCu data have been published, as well as the code used to collect and curate the data.This action has received funding from the European Union’s Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341

Repositorio Institucional de la Universidad de Alicante

MaCoCu:Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

Author: Bañón Marta
Esplà-Gomis Miquel
Forcada Mikel L.
García-Romero Cristian
Kuzman Taja
Ljubešić Nikola
Ramírez-Sánchez Gema
Rupnik Peter
Sempere Leopoldo Pla
Suchomel Vít
Toral Antonio
van der Werff Tobias
van Noord Rik
Zaragoza Jaume
Publication venue: European Association for Machine Translation
Publication date: 01/01/2022
Field of study

ARTS repository - University of Groningen

MaCoCu:Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

Author: Bañón Marta
Esplà-Gomis Miquel
Forcada Mikel L.
García-Romero Cristian
Kuzman Taja
Ljubešić Nikola
Ramírez-Sánchez Gema
Rupnik Peter
Sempere Leopoldo Pla
Suchomel Vít
Toral Antonio
van der Werff Tobias
van Noord Rik
Zaragoza Jaume
Publication venue: European Association for Machine Translation
Publication date: 01/01/2022
Field of study

We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release the free/open-source web crawling and curation software used.</p

ARTS repository - University of Groningen

On-line diagnostic of drives of machine tools that produce gearbox parts for Škoda cars.

Author: Suchomel Vít
Publication venue
Publication date: 20/01/2019
Field of study

DSpace@TUL

Czech Web Corpus 2017 (csTenTen17)

Author: Suchomel Vít
Publication venue: Lexical Computing CZ s.r.o.
Publication date: 07/12/2018
Field of study

The Czech Web Corpus 2017 (csTenTen17) is a Czech corpus made up of texts collected from the Internet, mostly from the Czech national top level domain ".cz". The data was crawled by web crawler SpiderLing (https://corpus.tools/wiki/SpiderLing). The data was cleaned by removing boilerplate (using https://corpus.tools/wiki/Justext), removing near-duplicate paragraphs (by https://corpus.tools/wiki/Onion) and discarding paragraphs not in the target language. The corpus was POS annotated by morphological analyser Majka using this POS tagset: https://www.sketchengine.eu/tagset-reference-for-czech/. Text sources: General web, Wikipedia. Time span of crawling: May, October and November 2017, October and November 2016, October and November 2015. The Czech Wikipedia part was downloaded in November 2017. Data format: Plain text, vertical (one token per line), gzip compressed. There are the following structures in the vertical: Documents (, usually corresponding to web pages), paragraphs (), sentences () and word join markers (, a "glue" tag indicating that there was no space between the surrounding tokens in the original text). Document metadata: src (the source of the data), title (the title of the web page), url (the URL of the document), crawl_date (the date of downloading the document). Paragraph metadata: heading ("1" if the paragraph is a heading, usually to elements in the original HTML data). Block elements in the case of an HTML source or double blank lines in the case of other source formats were used as paragraph separators. An internal heuristic tool was used to mark sentence breaks. The tab-separated positional attributes are: word form, morphological annotation, lem-POS (the base form of the word, i.e. the lemma, with a part of speech suffix) and gender respecting lemma (nouns and adjectives only). Please cite the following paper when using the corpus for your research: Suchomel, Vít. csTenTen17, a Recent Czech Web Corpus. In Recent Advances in Slavonic Natural Language Processing, pp. 111–123. 2018. (https://nlp.fi.muni.cz/raslan/raslan18.pdf#page=119

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Model of a diesel-generator

Author: Suchomel Vít
Publication venue: Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií
Publication date: 01/01/2017
Field of study

This work deals with principles and methods of drive-train regulation. Furthermore it deals with simulation of drive train via Simulink in a program Matlab. The work is done in cooperation with Siemens Drasov. A created model will be used as an educational model for employees of company. The model will allow better and more graphic education. In the beginning part is drive-train generally descripted. Next part is especially about a synchronous generator. This work also explains a working principle of generator and excitation. There is a cooperation of generator with electrical network described and transformation of synchronous generator into a model of general electric machine showed, substitute scheme is in d, q coordinates deduced in the next part. There are simulations of different states of generator solved in a practical part of the work. At first there is simulation of generator’s load change by using the created model done and further there are simulations of short circuits in stator winding realized. Finally there are simulations of connecting the synchronous generator into infinite bus for ideal and unfavorable conditions done. All simulation models include lots of oscilloscopes, which allow time courses to display, currents, shaft’s revolutions, torques and so on

Digital library of Brno University of Technology

National Repository of Grey Literature