404 research outputs found

    Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

    Get PDF
    Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles by inserting subtitle breaks compliant to specific displaying guidelines. Similar to speech translation (ST), model training requires parallel data comprising audio inputs paired with their textual translations. In SubST, however, the text has to be also annotated with subtitle breaks. So far, this requirement has represented a bottleneck for system development, as confirmed by the dearth of publicly available SubST corpora. To fill this gap, we propose a method to convert existing ST corpora into SubST resources without human intervention. We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion, achieving high segmentation quality in zero-shot conditions. Comparative experiments with SubST systems respectively trained on manual and automatic segmentations result in similar performance, showing the effectiveness of our approach.Comment: Accepted to AACL 202

    Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

    Get PDF
    Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles by inserting subtitle breaks compliant to specific displaying guidelines. Similar to speech translation (ST), model training requires parallel data comprising audio inputs paired with their textual translations. In SubST, however, the text has to be also annotated with subtitle breaks. So far, this requirement has represented a bottleneck for system development, as confirmed by the dearth of publicly available SubST corpora. To fill this gap, we propose a method to convert existing ST corpora into SubST resources without human intervention. We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion, achieving high segmentation quality in zero-shot conditions. Comparative experiments with SubST systems respectively trained on manual and automatic segmentations result in similar performance, showing the effectiveness of our approach

    実応用を志向した機械翻訳システムの設計と評価

    Get PDF
    Tohoku University博士(情報科学)thesi

    Modeling information structure in a cross-linguistic perspective

    Get PDF
    This study makes substantial contributions to both the theoretical and computational treatment of information structure, with a specific focus on creating natural language processing applications such as multilingual machine translation systems. The present study first provides cross-linguistic findings in regards to information structure meanings and markings. Building upon such findings, the current model represents information structure within the HPSG/MRS framework using Individual Constraints. The primary goal of the present study is to create a multilingual grammar model of information structure for the LinGO Grammar Matrix system. The present study explores the construction of a grammar library for creating customized grammar incorporating information structure and illustrates how the information structure-based model improves performance of transfer-based machine translation

    Multilingual Information Extraction

    Get PDF

    Knowledge organization

    Get PDF
    Since Svenonius analyzed the research base in bibliographic control in 1990, the intervening years have seen major shifts in the focus of information organization in academic libraries. New technologies continue to reshape the nature and content of catalogs, stretch the boundaries of classification research, and provide new alternatives for the organization of information. Research studies have rigorously analyzed the structure of the Anglo- American Cataloguing Rules using entity-relationship modeling and expanded on the bibliographic and authority relationship research to develop new data models (Functional Requirements for Bibliographic Records [FRBR] and Functional Requirements and Numbering of Authority Records [FRANAR]). Applied research into the information organization process has led to the development of cataloguing tools and harvesting ap- plications for bibliographic data collection and automatic record creation. A growing international perspective focused research on multilingual subject access, transliteration problems in surrogate records, and user studies to improve Online Public Access Catalog (OPAC) displays for large retrieval sets resulting from federated searches. The need to organize local and remote electronic resources led to metadata research that developed general and domain-specific metadata schemes. Ongoing research in this area focuses on record structures and architectural models to enable interoperability among the various schemes and differing application platforms. Research in the area of subject access and classification is strong, covering areas such as vocabulary mapping, automatic facet construction and deconstruction for Web resources, development of expert systems for automatic classifica- tion, dynamically altered classificatory structures linked to domain-specific thesauri, crosscultural conceptual structures in classification, identification of semantic relationships for vocabulary mapped to classification systems, and the expanded use of traditional classification systems as switching languages in the global Web environment. Finally, descriptive research into library and information science (LIS) education and curricula for knowl- edge organization continues. All of this research is applicable to knowledge organization in academic and research libraries. This chapter examines this body of research in depth, describes the research methodologies employed, and identifies areas of lacunae in need of further research

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Research in the Language, Information and Computation Laboratory of the University of Pennsylvania

    Get PDF
    This report takes its name from the Computational Linguistics Feedback Forum (CLiFF), an informal discussion group for students and faculty. However the scope of the research covered in this report is broader than the title might suggest; this is the yearly report of the LINC Lab, the Language, Information and Computation Laboratory of the University of Pennsylvania. It may at first be hard to see the threads that bind together the work presented here, work by faculty, graduate students and postdocs in the Computer Science and Linguistics Departments, and the Institute for Research in Cognitive Science. It includes prototypical Natural Language fields such as: Combinatorial Categorial Grammars, Tree Adjoining Grammars, syntactic parsing and the syntax-semantics interface; but it extends to statistical methods, plan inference, instruction understanding, intonation, causal reasoning, free word order languages, geometric reasoning, medical informatics, connectionism, and language acquisition. Naturally, this introduction cannot spell out all the connections between these abstracts; we invite you to explore them on your own. In fact, with this issue it’s easier than ever to do so: this document is accessible on the “information superhighway”. Just call up http://www.cis.upenn.edu/~cliff-group/94/cliffnotes.html In addition, you can find many of the papers referenced in the CLiFF Notes on the net. Most can be obtained by following links from the authors’ abstracts in the web version of this report. The abstracts describe the researchers’ many areas of investigation, explain their shared concerns, and present some interesting work in Cognitive Science. We hope its new online format makes the CLiFF Notes a more useful and interesting guide to Computational Linguistics activity at Penn
    corecore