705 research outputs found
Stabilizing knowledge through standards - A perspective for the humanities
It is usual to consider that standards generate mixed feelings among
scientists. They are often seen as not really reflecting the state of the art
in a given domain and a hindrance to scientific creativity. Still, scientists
should theoretically be at the best place to bring their expertise into
standard developments, being even more neutral on issues that may typically be
related to competing industrial interests. Even if it could be thought of as
even more complex to think about developping standards in the humanities, we
will show how this can be made feasible through the experience gained both
within the Text Encoding Initiative consortium and the International
Organisation for Standardisation. By taking the specific case of lexical
resources, we will try to show how this brings about new ideas for designing
future research infrastructures in the human and social sciences
Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora
Speech translation for subtitling (SubST) is the task of automatically
translating speech data into well-formed subtitles by inserting subtitle breaks
compliant to specific displaying guidelines. Similar to speech translation
(ST), model training requires parallel data comprising audio inputs paired with
their textual translations. In SubST, however, the text has to be also
annotated with subtitle breaks. So far, this requirement has represented a
bottleneck for system development, as confirmed by the dearth of publicly
available SubST corpora. To fill this gap, we propose a method to convert
existing ST corpora into SubST resources without human intervention. We build a
segmenter model that automatically segments texts into proper subtitles by
exploiting audio and text in a multimodal fashion, achieving high segmentation
quality in zero-shot conditions. Comparative experiments with SubST systems
respectively trained on manual and automatic segmentations result in similar
performance, showing the effectiveness of our approach.Comment: Accepted to AACL 202
Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora
Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles by inserting subtitle breaks compliant to specific displaying guidelines. Similar to speech translation (ST), model training requires parallel data comprising audio inputs paired with their textual translations. In SubST, however, the text has to be also annotated with subtitle breaks. So far, this requirement has represented a bottleneck for system development, as confirmed by the dearth of publicly available SubST corpora. To fill this gap, we propose a method to convert existing ST corpora into SubST resources without human intervention. We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion, achieving high segmentation quality in zero-shot conditions. Comparative experiments with SubST systems respectively trained on manual and automatic segmentations result in similar performance, showing the effectiveness of our approach
Customizing Information Capture and Access
This article presents a customizable architecture for software agents that capture and access information in large, heterogeneous, distributed electronic repositories. The key idea is to exploit underlying structure at various levels of granularity to build high-level indices with task-specific interpretations. Information agents construct such indices and are configured as a network of reusable modules called structure detectors and segmenters. We illustrate our architecture with the design and implementation of smart information filters in two contexts: retrieving stock market data from Internet newsgroups and retrieving technical reports from Internet FTP sites
MT for subtitling : User evaluation of post-editing productivity
This paper presents a user evaluation of machine translation and post-editing for TV subtitles. Based on a process study where 12 professional subtitlers translated and post-edited subtitles, we compare effort in terms of task time and number of keystrokes. We also discuss examples of specific subtitling features like condensation, and how these features may have affected the post-editing results. In addition to overall MT quality, segmentation and timing of the subtitles are found to be important issues to be addressed in future work.Peer reviewe
- …