Search CORE

705 research outputs found

Stabilizing knowledge through standards - A perspective for the humanities

Author: Romary Laurent
Publication venue
Publication date: 23/06/2009
Field of study

It is usual to consider that standards generate mixed feelings among scientists. They are often seen as not really reflecting the state of the art in a given domain and a hindrance to scientific creativity. Still, scientists should theoretically be at the best place to bring their expertise into standard developments, being even more neutral on issues that may typically be related to competing industrial interests. Even if it could be thought of as even more complex to think about developping standards in the humanities, we will show how this can be made feasible through the experience gained both within the Text Encoding Initiative consortium and the International Organisation for Standardisation. By taking the specific case of lexical resources, we will try to show how this brings about new ideas for designing future research infrastructures in the human and social sciences

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

Author: Karakanta Alina
Negri Matteo
Papi Sara
Turchi Marco
Publication venue
Publication date: 01/01/2022
Field of study

Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles by inserting subtitle breaks compliant to specific displaying guidelines. Similar to speech translation (ST), model training requires parallel data comprising audio inputs paired with their textual translations. In SubST, however, the text has to be also annotated with subtitle breaks. So far, this requirement has represented a bottleneck for system development, as confirmed by the dearth of publicly available SubST corpora. To fill this gap, we propose a method to convert existing ST corpora into SubST resources without human intervention. We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion, achieving high segmentation quality in zero-shot conditions. Comparative experiments with SubST systems respectively trained on manual and automatic segmentations result in similar performance, showing the effectiveness of our approach.Comment: Accepted to AACL 202

arXiv.org e-Print Archive

Archivio della ricerca - Fondazione Bruno Kessler

Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

Author: Alina Karakanta
Marco Turchi
Matteo Negri
Sara Papi
Publication venue: place:Stroudsburg, PA
Publication date: 01/01/2022
Field of study

Archivio della ricerca - Fondazione Bruno Kessler

Customizing Information Capture and Access

Author: Daniela Rus
Publication venue
Publication date: 01/01/1997
Field of study

This article presents a customizable architecture for software agents that capture and access information in large, heterogeneous, distributed electronic repositories. The key idea is to exploit underlying structure at various levels of granularity to build high-level indices with task-specific interpretations. Information agents construct such indices and are configured as a network of reusable modules called structure detectors and segmenters. We illustrate our architecture with the design and implementation of smart information filters in two contexts: retrieving stock market data from Internet newsgroups and retrieving technical reports from Internet FTP sites

CiteSeerX

MT for subtitling : User evaluation of post-editing productivity

Author: Koponen Maarit
Sulubacak Umut
Tiedemann Jörg
Vitikainen Kaisa
Publication venue: European Association for Machine Translation
Publication date: 10/06/2020
Field of study

This paper presents a user evaluation of machine translation and post-editing for TV subtitles. Based on a process study where 12 professional subtitlers translated and post-edited subtitles, we compare effort in terms of task time and number of keystrokes. We also discuss examples of specific subtitling features like condensation, and how these features may have affected the post-editing results. In addition to overall MT quality, segmentation and timing of the subtitles are found to be important issues to be addressed in future work.Peer reviewe

Helsingin yliopiston digitaalinen arkisto