Search CORE

261 research outputs found

SEM-POS: Grammatically and Semantically Correct Video Captioning

Author: Dawes Robert
Hilton Adrian
Mustafa Armin
Nadeem Asmar
Thomas Graham
Publication venue
Publication date: 26/03/2023
Field of study

Generating grammatically and semantically correct captions in video captioning is a challenging task. The captions generated from the existing methods are either word-by-word that do not align with grammatical structure or miss key information from the input videos. To address these issues, we introduce a novel global-local fusion network, with a Global-Local Fusion Block (GLFB) that encodes and fuses features from different parts of speech (POS) components with visual-spatial features. We use novel combinations of different POS components - 'determinant + subject', 'auxiliary verb', 'verb', and 'determinant + object' for supervision of the POS blocks - Det + Subject, Aux Verb, Verb, and Det + Object respectively. The novel global-local fusion network together with POS blocks helps align the visual features with language description to generate grammatically and semantically correct captions. Extensive qualitative and quantitative experiments on benchmark MSVD and MSRVTT datasets demonstrate that the proposed approach generates more grammatically and semantically correct captions compared to the existing methods, achieving the new state-of-the-art. Ablations on the POS blocks and the GLFB demonstrate the impact of the contributions on the proposed method

arXiv.org e-Print Archive

Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model

Author: Cheng Kanzhi
Ma Zheng
Song Wenpo
Zhang Jianbing
Zhu Wenhao
Zhu Zixuan
Publication venue
Publication date: 02/08/2023
Field of study

Current captioning approaches tend to generate correct but "generic" descriptions that lack real-world knowledge, e.g., named entities and contextual information. Considering that Vision-Language Pre-Training (VLP) models master massive such knowledge from large-scale web-harvested data, it is promising to utilize the generalizability of VLP models to incorporate knowledge into image descriptions. However, using VLP models faces challenges: zero-shot inference suffers from knowledge hallucination that leads to low-quality descriptions, but the generic bias in downstream task fine-tuning hinders the VLP model from expressing knowledge. To address these concerns, we propose a simple yet effective method called Knowledge-guided Replay (K-Replay), which enables the retention of pre-training knowledge during fine-tuning. Our approach consists of two parts: (1) a knowledge prediction task on automatically collected replay exemplars to continuously awaken the VLP model's memory about knowledge, thus preventing the model from collapsing into the generic pattern; (2) a knowledge distillation constraint to improve the faithfulness of generated descriptions hence alleviating the knowledge hallucination. To evaluate knowledge-enhanced descriptions, we construct a novel captioning benchmark KnowCap, containing knowledge of landmarks, famous brands, special foods and movie characters. Experimental results show that our approach effectively incorporates knowledge into descriptions, outperforming strong VLP baseline by 20.9 points (78.7->99.6) in CIDEr score and 20.5 percentage points (34.0%->54.5%) in knowledge recognition accuracy. Our code and data is available at https://github.com/njucckevin/KnowCap.Comment: Accepted at ACM Multimedia (ACMMM) 202

arXiv.org e-Print Archive

Exploring Multiliteracies and Other Approaches to Second Language Teaching

Author: Dunster Saralee
Publication venue: DigitalCommons@USU
Publication date: 01/05/2023
Field of study

This teaching portfolio offers a selection from the author’s graduate coursework, teaching experience, and research undertaken while enrolled in the Utah State University Master of Second Language Teaching (MSLT) program. The documents included are a reflection of her pedagogical approach and teaching practice, developed through varying contexts of professional experiences, including teaching English and French as a second language. This portfolio includes: reflections on the author’s teaching environment, a teaching philosophy statement, a professional development peer observation, a reflection paper that demonstrates the author’s experiences teaching with stories within the context of the multiliteracies framework, specifically multimodal fairy tales with The Fable Cottage platform, and finally, a consideration of future career goals related to language learning and teaching

DigitalCommons@USU

Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

Author: Gatt Albert
Krahmer Emiel
Publication venue
Publication date: 01/01/2017
Field of study

This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past decade or so, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of relatively recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar challenges faced in other areas of Natural Language Processing, with an emphasis on different evaluation methods and the relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118 pages, 8 figures, 1 tabl

arXiv.org e-Print Archive

OAR@UM

Tilburg University Repository

Subtitling for the Deaf and Hard-of-Hearing - The Reception of Moulin Rouge! as a case study

Author: Kuutti Krista TYTTI Tellervo
Publication venue
Publication date: 01/01/2014
Field of study

Audiovisuaalisessa kääntämisessä on aina otettava tilan ja ajan lisäksi erityisessä asemassa oleva katsoja huomioon. Tekstitysten käytännöt vaihtelevat riippuen siitä, tarkasteleeko työn tuloksia vieraskielisiltä DVD-elokuvilta vai suomenkielisiltä televisiokanavilta. Tämän tutkimuksen tarkoituksena on tutkia miten musikaalielokuvan tekstitykset toteutetaan kuuroille ja kuulovammaisille. Materiaalina on käytetty Moulin Rouge! -elokuvan kuuroille ja kuulovammaisille suunnattuja englanninkielisiä tekstityksiä, sekä itse laadittua kyselyä Musikaalien tekstitys kuuroille, joka välitettiin kesällä 2012 suomalaisille kuuroille ja kuulovammaisille katsojille internetin välityksellä. Tutkielmassani analysoidaan DVD-elokuva Moulin Rouge!:n musiikin kuvailua, huudahduksia, taustaääniä sekä tiivistämistä ja uudelleenmuotoilua. Viimeksi mainittu kappale on jaettu kolmeen alaotsikkoon, jotka tutkivat sanojen poisjättämisiä, sanojen uudelleenmuotoilua, faattista viestintää sekä yksinkertaisia aikamuotoja ja lauseiden tiivistämistä. Tutkimuksessa todettiin englannin olevan hallitseva tekstityskieli kuuroille ja kuulovammaisille DVD-markkinoilla. Suomalaiset katsojat joutuvat täten lukemaan elokuvien erityistekstityksiä englanniksi. Toisaalta Ylen kanavat TV1, TV2 ja Yle Teema tarjoavat jo runsaasti kuuroille ja kuulovammaisille suunnattua tekstitystä, joten myös suomalaiset katsojat saavat hyötyä erityistekstityksistä aina tietyissä ohjelmissa.fi=Opinnäytetyö kokotekstinä PDF-muodossa.|en=Thesis fulltext in PDF format.|sv=Lärdomsprov tillgängligt som fulltext i PDF-format

Osuva

Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting

Author: Cui Hejie
Fang Xinyu
Kan Xuan
Li Manling
Liu Xin
Song Yangqiu
Xu Ran
Yang Carl
Yu Yue
Zhang Zihan
Publication venue
Publication date: 28/10/2023
Field of study

Images contain rich relational knowledge that can help machines understand the world. Existing methods on visual knowledge extraction often rely on the pre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation types), restricting the expressiveness of the extracted knowledge. In this work, we take a first exploration to a new paradigm of open visual knowledge extraction. To achieve this, we present OpenVik which consists of an open relational region detector to detect regions potentially containing relational knowledge and a visual knowledge generator that generates format-free knowledge by prompting the large multimodality model with the detected region of interest. We also explore two data enhancement techniques for diversifying the generated format-free visual knowledge. Extensive knowledge quality evaluations highlight the correctness and uniqueness of the extracted open visual knowledge by OpenVik. Moreover, integrating our extracted knowledge across various visual reasoning applications shows consistent improvements, indicating the real-world applicability of OpenVik.Comment: Accepted to NeurIPS 202

arXiv.org e-Print Archive

Sign Languages, Translation, and Interpreting: Creative Practices in Audiovisual Content

Author: Tamayo Masero Ana
Publication venue
Publication date: 01/01/2022
Field of study

This article explores current creative practices involving the representation of sign languages, sign language interpreting, sign language translation (Napier and Leeson 2016; HBB4ALL 2017; CNLSE 2017; Tamayo 2022), and sign language live translation (Tamayo 2022) in audiovisual content. To that end, a review of the concept creative sign language and a review of previous publications on the matter will be provided. Subsequently, the implementation of creativity at different production stages, and the use of different resources when sign languages are present in audiovisual content, will be discussed by analyzing some selected innovative examples (mostly of practices in Spain). Finally, a taxonomy that takes into account not only internal creativity (that is inherent to sign languages), but also collaborative and external creativity. Conclusions will focus on how creative practices can expand our understanding of different art expressions, human communication, and inclusion, and can help establish new and meaningful connections among them.This work is part of the consolidated research group TRALIMA/ITZULIK (IT1209–19), recognized as such by the Basque Government as well as the ALMA research network (RED 2018–102475-T) recognized by the Ministry of Science, Innovation and Universities of Spain. This work is also part of The Quality of Live Subtitling (QuaLiSub), a regional, national, and international study funded by the Spanish Ministry of Science and Innovation (ref. PID2020–117738RB-I00)

Archivo Digital para la Docencia y la Investigación