261 research outputs found

    SEM-POS: Grammatically and Semantically Correct Video Captioning

    Full text link
    Generating grammatically and semantically correct captions in video captioning is a challenging task. The captions generated from the existing methods are either word-by-word that do not align with grammatical structure or miss key information from the input videos. To address these issues, we introduce a novel global-local fusion network, with a Global-Local Fusion Block (GLFB) that encodes and fuses features from different parts of speech (POS) components with visual-spatial features. We use novel combinations of different POS components - 'determinant + subject', 'auxiliary verb', 'verb', and 'determinant + object' for supervision of the POS blocks - Det + Subject, Aux Verb, Verb, and Det + Object respectively. The novel global-local fusion network together with POS blocks helps align the visual features with language description to generate grammatically and semantically correct captions. Extensive qualitative and quantitative experiments on benchmark MSVD and MSRVTT datasets demonstrate that the proposed approach generates more grammatically and semantically correct captions compared to the existing methods, achieving the new state-of-the-art. Ablations on the POS blocks and the GLFB demonstrate the impact of the contributions on the proposed method

    Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model

    Full text link
    Current captioning approaches tend to generate correct but "generic" descriptions that lack real-world knowledge, e.g., named entities and contextual information. Considering that Vision-Language Pre-Training (VLP) models master massive such knowledge from large-scale web-harvested data, it is promising to utilize the generalizability of VLP models to incorporate knowledge into image descriptions. However, using VLP models faces challenges: zero-shot inference suffers from knowledge hallucination that leads to low-quality descriptions, but the generic bias in downstream task fine-tuning hinders the VLP model from expressing knowledge. To address these concerns, we propose a simple yet effective method called Knowledge-guided Replay (K-Replay), which enables the retention of pre-training knowledge during fine-tuning. Our approach consists of two parts: (1) a knowledge prediction task on automatically collected replay exemplars to continuously awaken the VLP model's memory about knowledge, thus preventing the model from collapsing into the generic pattern; (2) a knowledge distillation constraint to improve the faithfulness of generated descriptions hence alleviating the knowledge hallucination. To evaluate knowledge-enhanced descriptions, we construct a novel captioning benchmark KnowCap, containing knowledge of landmarks, famous brands, special foods and movie characters. Experimental results show that our approach effectively incorporates knowledge into descriptions, outperforming strong VLP baseline by 20.9 points (78.7->99.6) in CIDEr score and 20.5 percentage points (34.0%->54.5%) in knowledge recognition accuracy. Our code and data is available at https://github.com/njucckevin/KnowCap.Comment: Accepted at ACM Multimedia (ACMMM) 202

    Exploring Multiliteracies and Other Approaches to Second Language Teaching

    Get PDF
    This teaching portfolio offers a selection from the author’s graduate coursework, teaching experience, and research undertaken while enrolled in the Utah State University Master of Second Language Teaching (MSLT) program. The documents included are a reflection of her pedagogical approach and teaching practice, developed through varying contexts of professional experiences, including teaching English and French as a second language. This portfolio includes: reflections on the author’s teaching environment, a teaching philosophy statement, a professional development peer observation, a reflection paper that demonstrates the author’s experiences teaching with stories within the context of the multiliteracies framework, specifically multimodal fairy tales with The Fable Cottage platform, and finally, a consideration of future career goals related to language learning and teaching

    Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

    Get PDF
    This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past decade or so, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of relatively recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar challenges faced in other areas of Natural Language Processing, with an emphasis on different evaluation methods and the relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118 pages, 8 figures, 1 tabl

    Subtitling for the Deaf and Hard-of-Hearing - The Reception of Moulin Rouge! as a case study

    Get PDF
    Audiovisuaalisessa kääntämisessä on aina otettava tilan ja ajan lisäksi erityisessä asemassa oleva katsoja huomioon. Tekstitysten käytännöt vaihtelevat riippuen siitä, tarkasteleeko työn tuloksia vieraskielisiltä DVD-elokuvilta vai suomenkielisiltä televisiokanavilta. Tämän tutkimuksen tarkoituksena on tutkia miten musikaalielokuvan tekstitykset toteutetaan kuuroille ja kuulovammaisille. Materiaalina on käytetty Moulin Rouge! -elokuvan kuuroille ja kuulovammaisille suunnattuja englanninkielisiä tekstityksiä, sekä itse laadittua kyselyä Musikaalien tekstitys kuuroille, joka välitettiin kesällä 2012 suomalaisille kuuroille ja kuulovammaisille katsojille internetin välityksellä. Tutkielmassani analysoidaan DVD-elokuva Moulin Rouge!:n musiikin kuvailua, huudahduksia, taustaääniä sekä tiivistämistä ja uudelleenmuotoilua. Viimeksi mainittu kappale on jaettu kolmeen alaotsikkoon, jotka tutkivat sanojen poisjättämisiä, sanojen uudelleenmuotoilua, faattista viestintää sekä yksinkertaisia aikamuotoja ja lauseiden tiivistämistä. Tutkimuksessa todettiin englannin olevan hallitseva tekstityskieli kuuroille ja kuulovammaisille DVD-markkinoilla. Suomalaiset katsojat joutuvat täten lukemaan elokuvien erityistekstityksiä englanniksi. Toisaalta Ylen kanavat TV1, TV2 ja Yle Teema tarjoavat jo runsaasti kuuroille ja kuulovammaisille suunnattua tekstitystä, joten myös suomalaiset katsojat saavat hyötyä erityistekstityksistä aina tietyissä ohjelmissa.fi=Opinnäytetyö kokotekstinä PDF-muodossa.|en=Thesis fulltext in PDF format.|sv=Lärdomsprov tillgängligt som fulltext i PDF-format

    Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting

    Full text link
    Images contain rich relational knowledge that can help machines understand the world. Existing methods on visual knowledge extraction often rely on the pre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation types), restricting the expressiveness of the extracted knowledge. In this work, we take a first exploration to a new paradigm of open visual knowledge extraction. To achieve this, we present OpenVik which consists of an open relational region detector to detect regions potentially containing relational knowledge and a visual knowledge generator that generates format-free knowledge by prompting the large multimodality model with the detected region of interest. We also explore two data enhancement techniques for diversifying the generated format-free visual knowledge. Extensive knowledge quality evaluations highlight the correctness and uniqueness of the extracted open visual knowledge by OpenVik. Moreover, integrating our extracted knowledge across various visual reasoning applications shows consistent improvements, indicating the real-world applicability of OpenVik.Comment: Accepted to NeurIPS 202

    Sign Languages, Translation, and Interpreting: Creative Practices in Audiovisual Content

    Get PDF
    This article explores current creative practices involving the representation of sign languages, sign language interpreting, sign language translation (Napier and Leeson 2016; HBB4ALL 2017; CNLSE 2017; Tamayo 2022), and sign language live translation (Tamayo 2022) in audiovisual content. To that end, a review of the concept creative sign language and a review of previous publications on the matter will be provided. Subsequently, the implementation of creativity at different production stages, and the use of different resources when sign languages are present in audiovisual content, will be discussed by analyzing some selected innovative examples (mostly of practices in Spain). Finally, a taxonomy that takes into account not only internal creativity (that is inherent to sign languages), but also collaborative and external creativity. Conclusions will focus on how creative practices can expand our understanding of different art expressions, human communication, and inclusion, and can help establish new and meaningful connections among them.This work is part of the consolidated research group TRALIMA/ITZULIK (IT1209–19), recognized as such by the Basque Government as well as the ALMA research network (RED 2018–102475-T) recognized by the Ministry of Science, Innovation and Universities of Spain. This work is also part of The Quality of Live Subtitling (QuaLiSub), a regional, national, and international study funded by the Spanish Ministry of Science and Innovation (ref. PID2020–117738RB-I00)
    corecore