705 research outputs found

    Stabilizing knowledge through standards - A perspective for the humanities

    Get PDF
    It is usual to consider that standards generate mixed feelings among scientists. They are often seen as not really reflecting the state of the art in a given domain and a hindrance to scientific creativity. Still, scientists should theoretically be at the best place to bring their expertise into standard developments, being even more neutral on issues that may typically be related to competing industrial interests. Even if it could be thought of as even more complex to think about developping standards in the humanities, we will show how this can be made feasible through the experience gained both within the Text Encoding Initiative consortium and the International Organisation for Standardisation. By taking the specific case of lexical resources, we will try to show how this brings about new ideas for designing future research infrastructures in the human and social sciences

    Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

    Get PDF
    Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles by inserting subtitle breaks compliant to specific displaying guidelines. Similar to speech translation (ST), model training requires parallel data comprising audio inputs paired with their textual translations. In SubST, however, the text has to be also annotated with subtitle breaks. So far, this requirement has represented a bottleneck for system development, as confirmed by the dearth of publicly available SubST corpora. To fill this gap, we propose a method to convert existing ST corpora into SubST resources without human intervention. We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion, achieving high segmentation quality in zero-shot conditions. Comparative experiments with SubST systems respectively trained on manual and automatic segmentations result in similar performance, showing the effectiveness of our approach.Comment: Accepted to AACL 202

    Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

    Get PDF
    Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles by inserting subtitle breaks compliant to specific displaying guidelines. Similar to speech translation (ST), model training requires parallel data comprising audio inputs paired with their textual translations. In SubST, however, the text has to be also annotated with subtitle breaks. So far, this requirement has represented a bottleneck for system development, as confirmed by the dearth of publicly available SubST corpora. To fill this gap, we propose a method to convert existing ST corpora into SubST resources without human intervention. We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion, achieving high segmentation quality in zero-shot conditions. Comparative experiments with SubST systems respectively trained on manual and automatic segmentations result in similar performance, showing the effectiveness of our approach

    Customizing Information Capture and Access

    Get PDF
    This article presents a customizable architecture for software agents that capture and access information in large, heterogeneous, distributed electronic repositories. The key idea is to exploit underlying structure at various levels of granularity to build high-level indices with task-specific interpretations. Information agents construct such indices and are configured as a network of reusable modules called structure detectors and segmenters. We illustrate our architecture with the design and implementation of smart information filters in two contexts: retrieving stock market data from Internet newsgroups and retrieving technical reports from Internet FTP sites

    MT for subtitling : User evaluation of post-editing productivity

    Get PDF
    This paper presents a user evaluation of machine translation and post-editing for TV subtitles. Based on a process study where 12 professional subtitlers translated and post-edited subtitles, we compare effort in terms of task time and number of keystrokes. We also discuss examples of specific subtitling features like condensation, and how these features may have affected the post-editing results. In addition to overall MT quality, segmentation and timing of the subtitles are found to be important issues to be addressed in future work.Peer reviewe
    • …
    corecore