Search CORE

1,968 research outputs found

Effectiveness of textually-enhanced captions on Chinese high-school EFL learners’ incidental vocabulary learning

Author: Wu Huizhen
YANG Xiaohu
Publication venue: 'Editorial de la Universidad de Granada'
Publication date: 01/01/2022
Field of study

This study employed mixed-methods approach to investigate the impact of textually-enhanced captions on EFL learners’ incidental vocabulary gains and learners’ perceptions of the captioning usefulness in a multi-modal learning environment. 133 Chinese EFL high school learners of the low-intermediate level were randomly assigned to English captions with highlighted target words and L1 gloss (ECL1), Chinese and English captions (CEC), Chinese and English captions with highlighted target words (CECGW), and no captions (NC). Our quasi-experimental findings did not detect any significant differences among the caption types on vocabulary form recognition while ECL1 was found the most effective in meaning recall and recognition. Captioning types and learners’ language proficiency exerted medium-to-large effects on meaning recall and meaning recognition. Our qualitative data suggested the participants generally viewed captioned videos positively, with variability in perceptions of concurrent presentation of information. The saliency of L1 gloss could direct the viewers’ attention to the semantic features of a word and reinforce sound-form-meaning connections. Videos lacking L1 glosses of target words had relatively little effect on learners’ vocabulary learning while more textual inputs might not necessarily result in vocabulary gains. Pedagogical implications are proposed for teachers’ adoption of L1 in captioned videos to enhance learners’ learning effectiveness.Este estudio investigó el impacto y las percepciones de los estudiantes de los subtítulos mejorados textualmente en las ganancias incidentales de vocabulario de los estudiantes de inglés como lengua extranjera en un entorno de aprendizaje multimodal. 133 estudiantes chinos de inglés como lengua extranjera de nivel intermedio bajo fueron asignados aleatoriamente a subtítulos en inglés con palabras objetivo resaltadas y brillo L1 (ECL1), subtítulos en chino e inglés (CEC), subtítulos en chino e inglés con palabras objetivo resaltadas (CECGW), y sin subtítulos (NC). Nuestros hallazgos cuasi-experimentales no detectaron diferencias significativas entre los tipos de subtítulos en el reconocimiento de formas de vocabulario, mientras que ECL1 resultó ser el más efectivo para recordar y reconocer significados. Nuestros datos cualitativos sugirieron que los participantes generalmente veían los videos subtitulados de manera positiva, con variabilidad en las percepciones de la presentación simultánea de información. La prominencia del brillo L1 podría dirigir la atención de los espectadores a las palabras objetivo y reforzar las conexiones de la forma del sonido y el significado

Repositorio Institucional Universidad de Granada

Portal de revistas de la Universidad de Granada

DIALNET

Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning

Author: Chao Hongyang
Chen Jingwen
Li Yehao
Mei Tao
Pan Yingwei
Yao Ting
Publication venue
Publication date: 03/05/2019
Field of study

It is well believed that video captioning is a fundamental but challenging task in both computer vision and artificial intelligence fields. The prevalent approach is to map an input video to a variable-length output sentence in a sequence to sequence manner via Recurrent Neural Network (RNN). Nevertheless, the training of RNN still suffers to some degree from vanishing/exploding gradient problem, making the optimization difficult. Moreover, the inherently recurrent dependency in RNN prevents parallelization within a sequence during training and therefore limits the computations. In this paper, we present a novel design --- Temporal Deformable Convolutional Encoder-Decoder Networks (dubbed as TDConvED) that fully employ convolutions in both encoder and decoder networks for video captioning. Technically, we exploit convolutional block structures that compute intermediate states of a fixed number of inputs and stack several blocks to capture long-term relationships. The structure in encoder is further equipped with temporal deformable convolution to enable free-form deformation of temporal sampling. Our model also capitalizes on temporal attention mechanism for sentence generation. Extensive experiments are conducted on both MSVD and MSR-VTT video captioning datasets, and superior results are reported when comparing to conventional RNN-based encoder-decoder techniques. More remarkably, TDConvED increases CIDEr-D performance from 58.8% to 67.2% on MSVD.Comment: AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Video summarisation: A conceptual framework and survey of the state of the art

Author: Arthur G. Money
Babaguchi
Boyatzis
Cernekova
Chang
Chang
Crockford
Dey
Dimitrova
Ekin
Ferman
Gianluigi
Hanjalic
Hanjalic
Harry Agius
Joffe
Kim
Lee
Lew
Li
Li
Lienhart
Ma
Moriyama
Ngo
Otsuka
Shih
Silverman
Taylor
Tjondronegoro
Tseng
Wang
Zhu
Publication venue: 'Elsevier BV'
Publication date: 01/02/2008
Field of study

This is the post-print (final draft post-refereeing) version of the article. Copyright @ 2007 Elsevier Inc.Video summaries provide condensed and succinct representations of the content of a video stream through a combination of still images, video segments, graphical representations and textual descriptors. This paper presents a conceptual framework for video summarisation derived from the research literature and used as a means for surveying the research literature. The framework distinguishes between video summarisation techniques (the methods used to process content from a source video stream to achieve a summarisation of that stream) and video summaries (outputs of video summarisation techniques). Video summarisation techniques are considered within three broad categories: internal (analyse information sourced directly from the video stream), external (analyse information not sourced directly from the video stream) and hybrid (analyse a combination of internal and external information). Video summaries are considered as a function of the type of content they are derived from (object, event, perception or feature based) and the functionality offered to the user for their consumption (interactive or static, personalised or generic). It is argued that video summarisation would benefit from greater incorporation of external information, particularly user based information that is unobtrusively sourced, in order to overcome longstanding challenges such as the semantic gap and providing video summaries that have greater relevance to individual users

Crossref

Brunel University Research Archive

Multi-modal Dense Video Captioning

Author: Iashin Vladimir
Rahtu Esa
Publication venue
Publication date: 01/01/2020
Field of study

Dense video captioning is a task of localizing interesting events from an untrimmed video and producing textual description (captions) for each localized event. Most of the previous works in dense video captioning are solely based on visual information and completely ignore the audio track. However, audio, and speech, in particular, are vital cues for a human observer in understanding an environment. In this paper, we present a new dense video captioning approach that is able to utilize any number of modalities for event description. Specifically, we show how audio and speech modalities may improve a dense video captioning model. We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track. We formulate the captioning task as a machine translation problem and utilize recently proposed Transformer architecture to convert multi-modal input data into textual descriptions. We demonstrate the performance of our model on ActivityNet Captions dataset. The ablation studies indicate a considerable contribution from audio and speech components suggesting that these modalities contain substantial complementary information to video frames. Furthermore, we provide an in-depth analysis of the ActivityNet Caption results by leveraging the category tags obtained from original YouTube videos. Code is publicly available: github.com/v-iashin/MDVCComment: To appear in the proceedings of CVPR Workshops 2020; Code: https://github.com/v-iashin/MDVC Project Page: https://v-iashin.github.io/mdv

arXiv.org e-Print Archive

Crossref

Trepo - Institutional Repository of Tampere University

Structuring lecture videos for distance learning applications. ISMSE

Author: Chong-wah Ngo
Feng Wang
Ting-chuen Pong
Publication venue
Publication date: 01/01/2003
Field of study

This paper presents an automatic and novel approach in structuring and indexing lecture videos for distance learning applications. By structuring video content, we can support both topic indexing and semantic querying of multimedia documents. In this paper, our aim is to link the discussion topics extracted from the electronic slides with their associated video and audio segments. Two major techniques in our proposed approach include video text analysis and speech recognition. Initially, a video is partitioned into shots based on slide transitions. For each shot, the embedded video texts are detected, reconstructed and segmented as high-resolution foreground texts for commercial OCR recognition. The recognized texts can then be matched with their associated slides for video indexing. Meanwhile, both phrases (title) and keywords (content) are also extracted from the electronic slides to spot the speech signals. The spotted phrases and keywords are further utilized as queries to retrieve the most similar slide for speech indexing. 1

CiteSeerX

Institutional Knowledge at Singapore Management University

Hong Kong University of Science and Technology Institutional Repository

Subtitles for the Deaf and Hard of Hearing : immersion through creative language [the Stranger Things case]

Author: Garrido Olga
Publication venue: Bellaterra: Universitat Autònoma de Barcelona,
Publication date: 01/01/2023
Field of study

L'objectiu d'aquest estudi és explorar les característiques lingüístiques dels Subtítols per a Persones Sordes i amb Discapacitat Auditiva (SPS), posant un especial èmfasi en la seva creativitat. Concretament, busquem determinar com la creativitat lingüística dels SPS que descriuen música i efectes de so pot afectar l'experiència de gaudi. Per aconseguir aquest propòsit, hem classificat els SPS dels efectes de so i la música del darrer episodi de la sèrie de Netflix Stranger Things, titulat "The Piggyback", segons la taxonomia proposada per Tsaousi (2015), tenint en compte la seva funció exegètica, narrativa, contextual i emotiva. Amb aquest anàlisi, pretenem identificar patrons que puguin establir una relació entre la descripció creativa i la immersió, tenint en compte els comentaris del públic a les xarxes socials i les expectatives de les noves audiències. En última instància, el nostre objectiu final és abordar els SPS des d'una nova perspectiva, explorant les seves possibilitats creatives a nivell lingüístic i demostrant com això es tradueix en un major gaudi i una experiència més immersiva.El objetivo de este estudio es explorar las características lingüísticas de los Subtítulos para Personas Sordas y con Discapacidad Auditiva (SPS), poniendo un especial enfoque en su creatividad. En concreto, buscamos determinar cómo la creatividad lingüística de los SPS que describen música y efectos de sonido puede afectar la experiencia de disfrute. Para lograr este propósito, hemos clasificado los SPS de los efectos de sonido y las músicas del último episodio de la serie de Netflix Stranger Things, titulado "The Piggyback", según la taxonomía propuesta por Tsaousi (2015), teniendo en cuenta su función exegética, narrativa, contextual y emotiva. Con este análisis, pretendemos identificar patrones que puedan establecer una relación entre la descripción creativa y la inmersión, teniendo en cuenta los comentarios del público en las redes sociales y las expectativas de las nuevas audiencias. En última instancia, nuestro objetivo final es abordar los SPS desde una nueva perspectiva, explorando sus posibilidades creativas a nivel lingüístico y demostrando cómo ello se traduce en un mayor disfrute y una experiencia más inmersiva.The objective of this study is to explore the linguistic characteristics of Subtitles for the Deaf and Hard of Hearing (SDH) with a focus on its creativity. Specifically, the study aims to ascertain how the use of linguistic creativity in the description of music and sound effects affects enjoyment. To accomplish this objective, the ensemble of all sound and music descriptors from the final episode of the Netflix show Stranger Things, "The Piggyback", were classified based on Tsaousi's taxonomy according to their exegetic, narrative, contextual, and emotive functions. This analysis aims to identify possible patterns that could establish a relationship between creative description and immersion while taking into account the feedback provided by the audience through social media and the expectations of the new public. The ultimate objective of this research is to approach SDH from a new angle by exploring the possibilities of linguistic creativity and demonstrating how they result in a more immersive and enjoyable experience

Diposit Digital de Documents de la UAB