Search CORE

1,069 research outputs found

A Dataset for Movie Description

Author: Rohrbach Anna
Rohrbach Marcus
Schiele Bernt
Tandon Niket
Publication venue
Publication date: 01/01/2015
Field of study

Descriptive video service (DVS) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed DVS, which is temporally aligned to full length HD movies. In addition we also collected the aligned movie scripts which have been used in prior work and compare the two different sources of descriptions. In total the Movie Description dataset contains a parallel corpus of over 54,000 sentences and video snippets from 72 HD movies. We characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing DVS to scripts, we find that DVS is far more visual and describes precisely what is shown rather than what should happen according to the scripts created prior to movie production

arXiv.org e-Print Archive

CiteSeerX

CISPA – Helmholtz-Zentrum für Informationssicherheit

MPG.PuRe

Movie Description

Author: Courville Aaron
Larochelle Hugo
Pal Christopher
Rohrbach Anna
Rohrbach Marcus
Schiele Bernt
Tandon Niket
Torabi Atousa
Publication venue
Publication date: 12/05/2016
Field of study

Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. In total the Large Scale Movie Description Challenge (LSMDC) contains a parallel corpus of 118,114 sentences and video clips from 202 movies. First we characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing ADs to scripts, we find that ADs are indeed more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, we present and compare the results of several teams who participated in a challenge organized in the context of the workshop "Describing and Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at ICCV 2015

arXiv.org e-Print Archive

CISPA – Helmholtz-Zentrum für Informationssicherheit

Springer - Publisher Connector

PolyPublie

MPG.PuRe

Ambient awareness on a sidewalk for visually impaired

Author: Ahmed Faruk
Publication venue: University of Memphis Digital Commons
Publication date: 01/01/2019
Field of study

Safe navigation by avoiding obstacles is vital for visually impaired while walking on a sidewalk. There are both static and dynamic obstacles to avoid. Detection, monitoring, and estimating the threat posed by obstacles remain challenging. Also, it is imperative that the design of the system must be energy efficient and low cost. An additional challenge in designing an interactive system capable of providing useful feedback is to minimize users\u27 cognitive load. We started the development of the prototype system through classifying obstacles and providing feedback. To overcome the limitations of the classification-based system, we adopted the image annotation framework in describing the scene, which may or may not include the obstacles. Both solutions partially solved the safe navigation but were found to be ineffective in providing meaningful feedback and issues with the diurnal cycle. To address such limitations, we introduce the notion of free-path and threat level imposed by the static or dynamic obstacles. This solution reduced the overhead of obstacle detection and helped in designing meaningful feedback. Affording users a natural conversation through an interactive dialog enabled interface was found to promote safer navigation. In this dissertation, we modeled the free-path and threat level using a reinforcement learning (RL) framework.We built the RL model in the Gazebo robot simulation environment and implanted that in a handheld device. A natural conversation model was created using data collected through a Wizard of OZ approach. The RL model and conversational agent model together resulted in the handheld assistive device called Augmented Guiding Torch (AGT). The AGT provides improved mobility over white cane by providing ambient awareness through natural conversation. It can inform the visually impaired about the obstacles which are helpful to be warned about ahead of time, e.g., construction site, scooter, crowd, car, bike, or big hole. Using the RL framework, the robot avoided over 95% obstacles. The visually impaired avoided over 85% obstacles with the help of AGT on a 500 feet U-shape sidewalk. Findings of this dissertation support the effectiveness of augmented guiding through RL for navigation and obstacle avoidance of visually impaired users

University of Memphis Digital Commons

TS-RGBD Dataset: a Novel Dataset for Theatre Scenes Description for People with Visual Impairments

Author: Benhamida Leyla
Delloul Khadidja
Larabi Slimane
Publication venue
Publication date: 02/08/2023
Field of study

Computer vision was long a tool used for aiding visually impaired people to move around their environment and avoid obstacles and falls. Solutions are limited to either indoor or outdoor scenes, which limits the kind of places and scenes visually disabled people can be in, including entertainment places such as theatres. Furthermore, most of the proposed computer-vision-based methods rely on RGB benchmarks to train their models resulting in a limited performance due to the absence of the depth modality. In this paper, we propose a novel RGB-D dataset containing theatre scenes with ground truth human actions and dense captions annotations for image captioning and human action recognition: TS-RGBD dataset. It includes three types of data: RGB, depth, and skeleton sequences, captured by Microsoft Kinect. We test image captioning models on our dataset as well as some skeleton-based human action recognition models in order to extend the range of environment types where a visually disabled person can be, by detecting human actions and textually describing appearances of regions of interest in theatre scenes

arXiv.org e-Print Archive