41 research outputs found

    OpusFilter : A Configurable Parallel Corpus Filtering Toolbox

    Get PDF
    This paper introduces OpusFilter, a flexible and modular toolbox for filtering parallel corpora. It implements a number of components based on heuristic filters, language identification libraries, character-based language models, and word alignment tools, and it can easily be extended with custom filters. Bitext segments can be ranked according to their quality or domain match using single features or a logistic regression model that can be trained without manually labeled training data. We demonstrate the effectiveness of OpusFilter on the example of a Finnish-English news translation task based on noisy web-crawled training data. Applying our tool leads to improved translation quality while significantly reducing the size of the training data, also clearly outperforming an alternative ranking given in the crawled data set. Furthermore, we show the ability of OpusFilter to perform data selection for domain adaptation.This paper introduces OpusFilter, a flexible and modular toolbox for filtering parallel corpora. It implements a number of components based on heuristic filters, language identification libraries, character-based language models, and word alignment tools, and it can easily be extended with custom filters. Bitext segments can be ranked according to their quality or domain match using single features or a logistic regression model that can be trained without manually labeled training data. We demonstrate the effectiveness of OpusFilter on the example of a Finnish-English news translation task based on noisy web-crawled training data. Applying our tool leads to improved translation quality while significantly reducing the size of the training data, also clearly outperforming an alternative ranking given in the crawled data set. Furthermore, we show the ability of OpusFilter to perform data selection for domain adaptation.Peer reviewe

    OpusTools and Parallel Corpus Diagnostics

    Get PDF
    12th Edition of its Language Resources and Evaluation Conference was cancelled due to Covid 19 pandemic.This paper introduces OpusTools, a package for downloading and processing parallel corpora included in the OPUS corpus collection. The package implements tools for accessing compressed data in their archived release format and make it possible to easily convert between common formats. OpusTools also includes tools for language identification and data filtering as well as tools for importing data from various sources into the OPUS format. We show the use of these tools in parallel corpus creation and data diagnostics. The latter is especially useful for the identification of potential problems and errors in the extensive data set. Using these tools, we can now monitor the validity of data sets and improve the overall quality and consitency of the data collection.Peer reviewe

    Vasikkasaari - The Common island of Helsinki

    Get PDF
    Vasikkasaari on saari Helsingin Kruunuvuorenrannan edustalla, 3,5 kilometrin päässä Kauppatorilta. Vasikkasaari toimi 1940-70-luvuilla erityisesti Kallion työläisväestön eloisana ja yhteisöllisenä kesänviettopaikkana. Nykyisin saarella on jäljellä pienentynyt, mutta aktiivinen mökkiläisyhteisö. Vasikkasaari asemakaavoitettiin 2000-luvun alussa, mutta pääsaarelle ei ole kaavoituksen jälkeen rakennettu. Helsinki on aktivoitunut merellisten alueidensa kehittämisessä; 2010-luvulla useita saaria on avattu yleisölle. Vasikkasaari elää hiljaiseloa, mutta tulevaisuudessa saaren kehittäminen on todennäköistä. Diplomityössä tarkastellaan Vasikkasaaren kehittämisen mahdollisuuksia globaalien ajankohtaisten kysymysten, urbanisaation, kestävyyden ja yhteisöllisyyden näkökulmista. Diplomityö jakautuu neljään osaan. Ensimmäinen osio 1. Mitä nyt? - Suomi-Helsinki-Meri käsittelee urbanisaation ilmiötä sekä Helsingin saariston ja rantojen luonnetta ja paikallista merkitystä. Osiossa 2. Miten nyt? - Kaupunkilaisten saaristo pohditaan yhteisöllisyyden merkitystä Helsingin saaristossa sekä avataan urban commons –toimintaa, sen toimivuutta ja mahdollisuuksia saariympäristössä. Kolmannessa osiossa 3. Minne nyt? – analyysi saadaan käsitys Vasikkasaaresta Helsingin yhteisöllisenä kehityskohteena saaren identiteetin, ympäristön ja nykyisten suunnitelmien tarkastelun avulla. Viimeisessä osassa 4. Tässä nyt. Sovellukset tarjotaan strategisia ideoita Vasikkasaaren sosiaalisesti ja ympäristöllisesti kestävään kehittämiseen. Osiossa hahmotellaan visio Vasikkasaaren tulevaisuudesta nykyisen asemakaavan muuttamisen ja erilaisten saarelle sopivien toimenpiteiden ja niiden vaiheistamisen avulla. Diplomityö on tehty itsenäisenä projektina. Innoitus työhön on kuitenkin lähtenyt tekijän aiemmasta työstä Helsingin kaupungin Merellisessä yleissuunnitelmassa. Diplomityön aiheeseen ja näkemyksiin ovat vaikuttaneet myös Vasikkasaaren ajankohtainen asema saariston kehittämisessä

    Open Translation Models, Tools and Services

    Get PDF
    Publisher Copyright: © 2023, The Author(s).The ambition of the Open Translation Models, Tools and Services (OPUSMT) project is to develop state-of-the art neural machine translation (NMT) models that can freely be distributed and applied in research as well as professional applications. The goal is to pre-train translation models on a large scale on openly available parallel data and to create a catalogue of such resources for streamlined integration and deployment. For the latter we also implement and improve web services and computer-assisted translation (CAT) tools that can be used in on-line interfaces and professional workflows. Furthermore, we want to enable the re-use of models to avoid repeating costly training procedures from scratch and with this contribute to a reduction of the carbon footprint in MT research and development. The ELG pilot project focused on European minority languages and improved translation quality in low resource settings and the integration of MT services in the ELG infrastructure.Peer reviewe

    Paraphrase Detection on Noisy Subtitles in Six Languages

    Get PDF
    We perform automatic paraphrase detection on subtitle data from the Opusparcus corpus comprising six European languages: German, English, Finnish, French, Russian, and Swedish. We train two types of supervised sentence embedding models: a word-averaging (WA) model and a gated recurrent averaging network (GRAN) model. We find out that GRAN outperforms WA and is more robust to noisy training data. Better results are obtained with more and noisier data than less and cleaner data. Additionally, we experiment on other datasets, without reaching the same level of performance, because of domain mismatch between training and test data.Peer reviewe

    The OPUS Resource Repository : An Open Package for Creating Parallel Corpora and Machine Translation Services

    Get PDF
    This paper presents a flexible and powerful system for creating parallel corpora and for running neural machine translation services. Our package provides a scalable data repository backend that offers transparent data pre-processing pipelines and automatic alignment procedures that facilitate the compilation of extensive parallel data sets from a variety of sources. Moreover, we develop a web-based interface that constitutes an intuitive frontend for end-users of the platform. The whole system can easily be distributed over virtual machines and implements a sophisticated permission system with secure connections and a flexible database for storing arbitrary metadata. Furthermore, we also provide an interface for neural machine translation that can run as a service on virtual machines, which also incorporates a connection to the data repository software.Peer reviewe

    The University of Helsinki Submission to the IWSLT2020 Offline Speech Translation Task

    Get PDF
    This paper describes the University of Helsinki Language Technology group’s participation in the IWSLT 2020 offline speech translation task, addressing the translation of English audio into German text. In line with this year’s task objective, we train both cascade and end-to-end systems for spoken language translation. We opt for an end-to-end multitasking architecture with shared internal representations and a cascade approach that follows a standard procedure consisting of ASR, correction, and MT stages. We also describe the experiments that served as a basis for the submitted systems. Our experiments reveal that multitasking training with shared internal representations is not only possible but allows for knowledge-transfer across modalities.Peer reviewe

    Annotation of subtitle paraphrases using a new web tool

    Get PDF
    This paper analyzes the manual annotation effort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authentification as well as open crowdsourced projects, in which anyone can participate and user identification takes place based on IP addresses.Peer reviewe

    The FISKMÖ Project : Resources and Tools for Finnish-Swedish Machine Translation and Cross-Linguistic Research

    Get PDF
    This paper presents FISKMÖ, a project that focuses on the development of resources and tools for cross-linguistic research and machine translation between Finnish and Swedish. The goal of the project is the compilation of a massive parallel corpus out of translated material collected from web sources, public and private organisations and language service providers in Finland with its two official languages. The project also aims at the development of open and freely accessible translation services for those two languages for the general purpose and for domain-specific use. We have released new data sets with over 3 million translation units, a benchmark test set for MT development, pre-trained neural MT models with high coverage and competitive performance and a self-contained MT plugin for a popular CAT tool. The latter enables offline translation without dependencies on external services making it possible to work with highly sensitive data without compromising security concerns.Peer reviewe
    corecore