25 research outputs found

    Translation crowdsourcing: creating a multilingual corpus of online educational content

    Get PDF
    The present work describes a multilingual corpus of online content in the educational domain, i.e. Massive Open Online Course material, ranging from course forum text to subtitles of online video lectures, that has been developed via large-scale crowdsourcing. The English source text is manually translated into 11 European and BRIC languages using the CrowdFlower platform. During the process several challenges arose which mainly involved the in-domain text genre, the large text volume, the idiosyncrasies of each target language, the limitations of the crowdsourcing platform, as well as the quality assurance and workflow issues of the crowdsourcing process. The corpus constitutes a product of the EU-funded TraMOOC project and is utilised in the project in order to train, tune and test machine translation engines

    Enhancing Access to Online Education: Quality Machine Translation of MOOC Content

    Get PDF
    Contains fulltext : 162505.pdf (publisher's version ) (Open Access)The International Conference on Language Resources and Evaluation (LREC) 2016, 23 mei 201

    Improving Machine Translation of Educational Content via Crowdsourcing

    Get PDF
    The limited availability of in-domain training data is a major issue in the training of application-specific neural machine translation models. Professional outsourcing of bilingual data collections is costly and often not feasible. In this paper we analyze the influence of using crowdsourcing as a scalable way to obtain translations of target in-domain data having in mind that the translations can be of a lower quality. We apply crowdsourcing with carefully designed quality controls to create parallel corpora for the educational domain by collecting translations of texts from MOOCs from English to eleven languages, which we then use to fine-tune neural machine translation models previously trained on general-domain data. The results from our research indicate that crowdsourced data collected with proper quality controls consistently yields performance gains over general-domain baseline systems, and systems fine-tuned with pre-existing in-domain corpora

    Ensemble and Deep Learning for Language-Independent Automatic Selection of Parallel Data

    No full text
    Machine translation is used in many applications in everyday life. Due to the increase of translated documents that need to be organized as useful or not (for building a translation model), the automated categorization of texts (classification), is a popular research field of machine learning. This kind of information can be quite helpful for machine translation. Our parallel corpora (English-Greek and English-Italian) are based on educational data, which are quite difficult to translate. We apply two state of the art architectures, Random Forest (RF) and Deeplearnig4j (DL4J), to our data (which constitute three translation outputs). To our knowledge, this is the first time that deep learning architectures are applied to the automatic selection of parallel data. We also propose new string-based features that seem to be effective for the classifier, and we investigate whether an attribute selection method could be used for better classification accuracy. Experimental results indicate an increase of up to 4% (compared to our previous work) using RF and rather satisfactory results using DL4J

    Learning of syntactic dependencies and development of modern greek grammars

    No full text
    The thesis aims firstly at the acquisition of syntactic information (detection of verb complements, acquisition of verb subcategorization frames (SF), detection of the boundaries and the semantic type of clauses) automatically from Modern Greek and English text corpora with the use of various state-of-the-art and novel machine learning techniques, and, secondly, at the theoretical description of the Greek syntax through formal grammatical theories like Unification Grammar and Head-driven Phrase Structure Grammar. The thesis has been based on the following novel axes: 1. Corpus pre-processing has been limited to the use of minimum linguistic resources to ensure the portability of the presented methodologies to languages that are poorly equipped with resources. 2. Due to the low pre-processing level, a significant amount of noise appears in the data, which is dealt with One-sided Sampling. Examples that do not contribute to the learning process are detected and removed. The final data set is clean and learning performance improves significantly. 3. The importance of the acquired information is proven. The importance of complements is shown by the improvement in the performance of the SF acquisition process after the incorporation of complement information. The importance of the acquired SF lexicon is shown by its incorporation in a shallow syntactic parser and the increase of the performance of the latter. 4. The methods are applied on Modern Greek and on English to show their portability across different languages and to allow for an interesting rough comparison between the two languages. The results are very satisfactory, comparable to, and in some cases better than, approaches utilizing sophisticated resources for pre-processing.Η παρούσα διατριβή έχει ως σκοπό της, πρώτον, την ανάκτηση συντακτικής πληροφορίας (αναγνώριση συμπληρωμάτων ρημάτων, ανάκτηση πλαισίων υποκατηγοριοποίησης (ΠΥ) ρημάτων, αναγνώριση των ορίων και του είδους των προτάσεων) αυτόματα μέσα από ελληνικά και αγγλικά σώματα κειμένων με την χρήση ποικίλων και καινοτόμων τεχνικών μηχανικής μάθησης και, δεύτερον, την θεωρητική περιγραφή της ελληνικής σύνταξης μέσω τυπικών γλωσσολογικών φορμαλισμών, όπως η γραμματική Ενοποίησης και η γραμματική Φραστικής Δομής Οδηγούμενη από τον Κύριο Όρο. Η διατριβή κινήθηκε πάνω στους εξής καινοτόμους άξονες: 1. Η προεπεξεργασία των σωμάτων κειμένων βασίστηκε σε ελάχιστους γλωσσολογικούς πόρους για να είναι δυνατή η μεταφορά των μεθόδων σε γλώσσες φτωχές σε υποδομή. 2. Η αντιμετώπιση του θορύβου που υπεισέρχεται στα δεδομένα εξ αιτίας της χρήσης ελάχιστων πόρων πραγματοποιείται με Μονόπλευρη Δειγματοληψία. Εντοπίζονται αυτόματα παραδείγματα δεδομένων που δεν προσφέρουν στην μάθηση και αφαιρούνται. Τα τελικά δεδομένα είναι πιο καθαρά και η απόδοση της μάθησης βελτιώνεται πολύ. 3. Αποδεικνύεται η χρησιμότητα της εξαχθείσας πληροφορίας. Η χρησιμότητα των συμπληρωμάτων φαίνεται από την αύξηση της απόδοσης της διαδικασίας ανάκτησης ΠΥ με την χρήση τους. Η χρησιμότητα των εξαγόμενων ΠΥ φαίνεται από την αύξηση της απόδοσης ενός ρηχού συντακτικού αναλυτή με την χρήση τους. 4. Οι μέθοδοι εφαρμόζονται και στα Αγγλικά και στα Ελληνικά για να φανεί η μεταφερσιμότητά τους σε διαφορετικές γλώσσες και για να πραγματοποιηθεί μια ενδιαφέρουσα σχετική σύγκριση ανάμεσα στις δύο γλώσσες. Τα αποτελέσματα είναι πολύ ενθαρρυντικά, συγκρίσιμα με, και σε πολλές περιπτώσεις καλύτερα από, προσεγγίσεις που χρησιμοποιούν εξελιγμένα εργαλεία προεπεξεργασίας

    The Dual-Coding and Multimedia Learning Theories: Film Subtitles as a Vocabulary Teaching Tool

    No full text
    The use of multimedia has often been suggested as a teaching tool in foreign language teaching and learning. In foreign language education, exciting new multimedia applications have appeared over the last years, especially for young learners, but many of these do not seem to produce the desired effect in language development. This article looks into the theories of dual-coding (DCT) and multimedia learning (CTML) as the theoretical basis for the development of more effective digital tools with the use of films and subtitling. Bilingual dual-coding is also presented as a means of indirect access from one language to another and the different types of subtitling are explored regarding their effectiveness, especially in the field of short-term and long-term vocabulary recall and development. Finally, the article looks into some new alternative audiovisual tools that actively engage learners with films and subtitling, tailored towards vocabulary learning
    corecore