261 research outputs found
SEM-POS: Grammatically and Semantically Correct Video Captioning
Generating grammatically and semantically correct captions in video
captioning is a challenging task. The captions generated from the existing
methods are either word-by-word that do not align with grammatical structure or
miss key information from the input videos. To address these issues, we
introduce a novel global-local fusion network, with a Global-Local Fusion Block
(GLFB) that encodes and fuses features from different parts of speech (POS)
components with visual-spatial features. We use novel combinations of different
POS components - 'determinant + subject', 'auxiliary verb', 'verb', and
'determinant + object' for supervision of the POS blocks - Det + Subject, Aux
Verb, Verb, and Det + Object respectively. The novel global-local fusion
network together with POS blocks helps align the visual features with language
description to generate grammatically and semantically correct captions.
Extensive qualitative and quantitative experiments on benchmark MSVD and MSRVTT
datasets demonstrate that the proposed approach generates more grammatically
and semantically correct captions compared to the existing methods, achieving
the new state-of-the-art. Ablations on the POS blocks and the GLFB demonstrate
the impact of the contributions on the proposed method
Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model
Current captioning approaches tend to generate correct but "generic"
descriptions that lack real-world knowledge, e.g., named entities and
contextual information. Considering that Vision-Language Pre-Training (VLP)
models master massive such knowledge from large-scale web-harvested data, it is
promising to utilize the generalizability of VLP models to incorporate
knowledge into image descriptions. However, using VLP models faces challenges:
zero-shot inference suffers from knowledge hallucination that leads to
low-quality descriptions, but the generic bias in downstream task fine-tuning
hinders the VLP model from expressing knowledge. To address these concerns, we
propose a simple yet effective method called Knowledge-guided Replay
(K-Replay), which enables the retention of pre-training knowledge during
fine-tuning. Our approach consists of two parts: (1) a knowledge prediction
task on automatically collected replay exemplars to continuously awaken the VLP
model's memory about knowledge, thus preventing the model from collapsing into
the generic pattern; (2) a knowledge distillation constraint to improve the
faithfulness of generated descriptions hence alleviating the knowledge
hallucination. To evaluate knowledge-enhanced descriptions, we construct a
novel captioning benchmark KnowCap, containing knowledge of landmarks, famous
brands, special foods and movie characters. Experimental results show that our
approach effectively incorporates knowledge into descriptions, outperforming
strong VLP baseline by 20.9 points (78.7->99.6) in CIDEr score and 20.5
percentage points (34.0%->54.5%) in knowledge recognition accuracy. Our code
and data is available at https://github.com/njucckevin/KnowCap.Comment: Accepted at ACM Multimedia (ACMMM) 202
Exploring Multiliteracies and Other Approaches to Second Language Teaching
This teaching portfolio offers a selection from the author’s graduate coursework, teaching experience, and research undertaken while enrolled in the Utah State University Master of Second Language Teaching (MSLT) program. The documents included are a reflection of her pedagogical approach and teaching practice, developed through varying contexts of professional experiences, including teaching English and French as a second language. This portfolio includes: reflections on the author’s teaching environment, a teaching philosophy statement, a professional development peer observation, a reflection paper that demonstrates the author’s experiences teaching with stories within the context of the multiliteracies framework, specifically multimodal fairy tales with The Fable Cottage platform, and finally, a consideration of future career goals related to language learning and teaching
Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation
This paper surveys the current state of the art in Natural Language
Generation (NLG), defined as the task of generating text or speech from
non-linguistic input. A survey of NLG is timely in view of the changes that the
field has undergone over the past decade or so, especially in relation to new
(usually data-driven) methods, as well as new applications of NLG technology.
This survey therefore aims to (a) give an up-to-date synthesis of research on
the core tasks in NLG and the architectures adopted in which such tasks are
organised; (b) highlight a number of relatively recent research topics that
have arisen partly as a result of growing synergies between NLG and other areas
of artificial intelligence; (c) draw attention to the challenges in NLG
evaluation, relating them to similar challenges faced in other areas of Natural
Language Processing, with an emphasis on different evaluation methods and the
relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118
pages, 8 figures, 1 tabl
Subtitling for the Deaf and Hard-of-Hearing - The Reception of Moulin Rouge! as a case study
Audiovisuaalisessa kääntämisessä on aina otettava tilan ja ajan lisäksi erityisessä asemassa oleva katsoja huomioon. Tekstitysten käytännöt vaihtelevat riippuen siitä, tarkasteleeko työn tuloksia vieraskielisiltä DVD-elokuvilta vai suomenkielisiltä televisiokanavilta. Tämän tutkimuksen tarkoituksena on tutkia miten musikaalielokuvan tekstitykset toteutetaan kuuroille ja kuulovammaisille. Materiaalina on käytetty Moulin Rouge! -elokuvan kuuroille ja kuulovammaisille suunnattuja englanninkielisiä tekstityksiä, sekä itse laadittua kyselyä Musikaalien tekstitys kuuroille, joka välitettiin kesällä 2012 suomalaisille kuuroille ja kuulovammaisille katsojille internetin välityksellä.
Tutkielmassani analysoidaan DVD-elokuva Moulin Rouge!:n musiikin kuvailua, huudahduksia, taustaääniä sekä tiivistämistä ja uudelleenmuotoilua. Viimeksi mainittu kappale on jaettu kolmeen alaotsikkoon, jotka tutkivat sanojen poisjättämisiä, sanojen uudelleenmuotoilua, faattista viestintää sekä yksinkertaisia aikamuotoja ja lauseiden tiivistämistä.
Tutkimuksessa todettiin englannin olevan hallitseva tekstityskieli kuuroille ja kuulovammaisille DVD-markkinoilla. Suomalaiset katsojat joutuvat täten lukemaan elokuvien erityistekstityksiä englanniksi. Toisaalta Ylen kanavat TV1, TV2 ja Yle Teema tarjoavat jo runsaasti kuuroille ja kuulovammaisille suunnattua tekstitystä, joten myös suomalaiset katsojat saavat hyötyä erityistekstityksistä aina tietyissä ohjelmissa.fi=Opinnäytetyö kokotekstinä PDF-muodossa.|en=Thesis fulltext in PDF format.|sv=Lärdomsprov tillgängligt som fulltext i PDF-format
Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting
Images contain rich relational knowledge that can help machines understand
the world. Existing methods on visual knowledge extraction often rely on the
pre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation
types), restricting the expressiveness of the extracted knowledge. In this
work, we take a first exploration to a new paradigm of open visual knowledge
extraction. To achieve this, we present OpenVik which consists of an open
relational region detector to detect regions potentially containing relational
knowledge and a visual knowledge generator that generates format-free knowledge
by prompting the large multimodality model with the detected region of
interest. We also explore two data enhancement techniques for diversifying the
generated format-free visual knowledge. Extensive knowledge quality evaluations
highlight the correctness and uniqueness of the extracted open visual knowledge
by OpenVik. Moreover, integrating our extracted knowledge across various visual
reasoning applications shows consistent improvements, indicating the real-world
applicability of OpenVik.Comment: Accepted to NeurIPS 202
Sign Languages, Translation, and Interpreting: Creative Practices in Audiovisual Content
This article explores current creative practices involving the representation
of sign languages, sign language interpreting, sign language
translation (Napier and Leeson 2016; HBB4ALL 2017; CNLSE
2017; Tamayo 2022), and sign language live translation (Tamayo
2022) in audiovisual content. To that end, a review of the concept
creative sign language and a review of previous publications on
the matter will be provided. Subsequently, the implementation of
creativity at different production stages, and the use of different resources
when sign languages are present in audiovisual content, will
be discussed by analyzing some selected innovative examples (mostly
of practices in Spain). Finally, a taxonomy that takes into account
not only internal creativity (that is inherent to sign languages), but also
collaborative and external creativity. Conclusions will focus on how creative
practices can expand our understanding of different art expressions,
human communication, and inclusion, and can help establish
new and meaningful connections among them.This work is part of the consolidated research group TRALIMA/ITZULIK (IT1209–19), recognized as such by the Basque
Government as well as the ALMA research network (RED 2018–102475-T) recognized by the Ministry of Science, Innovation and
Universities of Spain. This work is also part of The Quality of Live Subtitling (QuaLiSub), a regional, national, and international study funded by the Spanish Ministry of Science and Innovation (ref. PID2020–117738RB-I00)
- …