Search CORE

4,191 research outputs found

YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus

Author: Georg Manfred
Tanzer Garrett
Uthus David
Publication venue
Publication date: 26/06/2023
Field of study

Machine learning for sign languages is bottlenecked by data. In this paper, we present YouTube-ASL, a large-scale, open-domain corpus of American Sign Language (ASL) videos and accompanying English captions drawn from YouTube. With ~1000 hours of videos and >2500 unique signers, YouTube-ASL is ~3x as large and has ~10x as many unique signers as the largest prior ASL dataset. We train baseline models for ASL to English translation on YouTube-ASL and evaluate them on How2Sign, where we achieve a new finetuned state of the art of 12.39 BLEU and, for the first time, report zero-shot results

arXiv.org e-Print Archive

AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description

Author: Bain Max
Han Tengda
Nagrani Arsha
Varol Gül
Xie Weidi
Zisserman Andrew
Publication venue
Publication date: 10/10/2023
Field of study

Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences. For movies, this presents notable challenges -- AD must occur only during existing pauses in dialogue, should refer to characters by name, and ought to aid understanding of the storyline as a whole. To this end, we develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech; addressing all three of the 'who', 'when', and 'what' questions: (i) who -- we introduce a character bank consisting of the character's name, the actor that played the part, and a CLIP feature of their face, for the principal cast of each movie, and demonstrate how this can be used to improve naming in the generated AD; (ii) when -- we investigate several models for determining whether an AD should be generated for a time interval or not, based on the visual content of the interval and its neighbours; and (iii) what -- we implement a new vision-language model for this task, that can ingest the proposals from the character bank, whilst conditioning on the visual features using cross-attention, and demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison.Comment: ICCV2023. Project page: https://www.robots.ox.ac.uk/vgg/research/autoad

arXiv.org e-Print Archive

Using serious games for learning sign language combining video, enhanced interactivity and VR technology

Author: Bouki V.
Bouki V.
Doumanis I.
Doumanis I.
Economou D.
Economou D.
Mentzelopoulos M.
Mentzelopoulos M.
Publication venue: Graz University of Technology, Institut für Informationssysteme und Computer Medien
Publication date: 01/01/2020
Field of study

One in every six persons in the UK suffers a hearing loss, either as a condition they have been born with or they disordered they acquired during their life. 900,000 people in the UK are severely or profoundly deaf and based on a study by Action On Hearing Loss UK in 2013 only 17 percent of this population, can use the British Sign Language (BSL). That leaves a massive proportion of people with a hearing impediment who do not use sign language struggling in social interaction and suffering from emotional distress, and an even larger proportion of Hearing people who cannot communicate with those of the deaf community. This paper presents a serious game (SG) that aims to close the communication gap between able hearing people and people with a hearing impediment by providing a tool that facilitates BSL learning targeting adult population. The paper presents the theoretical framework supporting adult learning based on which a SG game using Virtual Reality (VR) technology has been developed. It explains the experimental framework of the study and presents the creation of the research instruments to facilitate the study comprising of a SG that integrates video and conventional video based educational material. It reports and analyses the study results that demonstrate the advantage of the SG in effectively supporting users learning a set of BSL signs and it presents qualitative outcomes that inform the further development of the game to serve learning needs. The paper closes with conclusions, directions for further development of this educational resource and future studies

WestminsterResearch

Using Serious Games for Learning British Sign Language Combining Video, Enhanced Interactivity, and VR Technology

Author: Bouki V.
Bouki V.
Doumanis I.
Doumanis I.
Ferguson J.
Ferguson J.
Ferguson J.
Ferguson J.
Gonzalez Russi M.
Gonzalez Russi M.
Mentzelopoulos M.
Mentzelopoulos M.
Publication venue: Graz University of Technology, Institut für Informationssysteme und Computer Medien
Publication date: 01/01/2020
Field of study

WestminsterResearch

Towards Student Engagement Analytics: Applying Machine Learning to Student Posts in Online Lecture Videos

Author: Stepanek Nicholas R.
Publication venue: DigitalCommons@UNO
Publication date: 01/05/2017
Field of study

The use of online learning environments in higher education is becoming ever more prevalent with the inception of MOOCs (Massive Open Online Courses) and the increase in online and flipped courses at universities. Although the online systems used to deliver course content make education more accessible, students often express frustration with the lack of assistance during online lecture videos. Instructors express concern that students are not engaging with the course material in online environments, and rely on affordances within these systems to figure out what students are doing. With many online learning environments storing log data about students usage of these systems, research into learning analytics, the measurement, collection, analysis, and reporting data about learning and their contexts, can help inform instructors about student learning in the online context. This thesis aims to lay the groundwork for learning analytics that provide instructors high-level student engagement data in online learning environments. Recent research has shown that instructors using these systems are concerned about their lack of awareness about student engagement, and educational psychology has shown that engagement is necessary for student success. Specifically, this thesis explores the feasibility of applying machine learning to categorize student posts by their level of engagement. These engagement categories are derived from the ICAP framework, which categorizes overt student behaviors into four tiers of engagement: Interactive, Constructive, Active, and Passive. Contributions include showing what natural language features are most indicative of engagement, exploring whether this machine learning method can be generalized to many courses, and using previous research to develop mockups of what analytics using data from this machine learning method might look like

The University of Nebraska, Omaha

TPA-Net: Generate A Dataset for Text to Physics-based Animation

Author: Gao Feng
Jiang Chenfanfu
Li Minchen
Qiu Yuxing
Thattai Govind
Yang Yin
Publication venue
Publication date: 24/11/2022
Field of study

Recent breakthroughs in Vision-Language (V&L) joint research have achieved remarkable results in various text-driven tasks. High-quality Text-to-video (T2V), a task that has been long considered mission-impossible, was proven feasible with reasonably good results in latest works. However, the resulting videos often have undesired artifacts largely because the system is purely data-driven and agnostic to the physical laws. To tackle this issue and further push T2V towards high-level physical realism, we present an autonomous data generation technique and a dataset, which intend to narrow the gap with a large number of multi-modal, 3D Text-to-Video/Simulation (T2V/S) data. In the dataset, we provide high-resolution 3D physical simulations for both solids and fluids, along with textual descriptions of the physical phenomena. We take advantage of state-of-the-art physical simulation methods (i) Incremental Potential Contact (IPC) and (ii) Material Point Method (MPM) to simulate diverse scenarios, including elastic deformations, material fractures, collisions, turbulence, etc. Additionally, high-quality, multi-view rendering videos are supplied for the benefit of T2V, Neural Radiance Fields (NeRF), and other communities. This work is the first step towards fully automated Text-to-Video/Simulation (T2V/S). Live examples and subsequent work are at https://sites.google.com/view/tpa-net

arXiv.org e-Print Archive

Generative Disco: Text-to-Video Generation for Music Visualization

Author: Chilton Lydia
Liu Vivian
Long Tao
Raw Nathan
Publication venue
Publication date: 17/04/2023
Field of study

Visuals are a core part of our experience of music, owing to the way they can amplify the emotions and messages conveyed through the music. However, creating music visualization is a complex, time-consuming, and resource-intensive process. We introduce Generative Disco, a generative AI system that helps generate music visualizations with large language models and text-to-image models. Users select intervals of music to visualize and then parameterize that visualization by defining start and end prompts. These prompts are warped between and generated according to the beat of the music for audioreactive video. We introduce design patterns for improving generated videos: "transitions", which express shifts in color, time, subject, or style, and "holds", which encourage visual emphasis and consistency. A study with professionals showed that the system was enjoyable, easy to explore, and highly expressive. We conclude on use cases of Generative Disco for professionals and how AI-generated content is changing the landscape of creative work

arXiv.org e-Print Archive

Recommended from our members

Multimodal Indexing of Presentation Videos

Author: Merler Michele
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2013
Field of study

This thesis presents four novel methods to help users efficiently and effectively retrieve information from unstructured and unsourced multimedia sources, in particular the increasing amount and variety of presentation videos such as those in e-learning, conference recordings, corporate talks, and student presentations. We demonstrate a system to summarize, index and cross-reference such videos, and measure the quality of the produced indexes as perceived by the end users. We introduce four major semantic indexing cues: text, speaker faces, graphics, and mosaics, going beyond standard tag based searches and simple video playbacks. This work aims at recognizing visual content "in the wild", where the system cannot rely on any additional information besides the video itself. For text, within a scene text detection and recognition framework, we present a novel locally optimal adaptive binarization algorithm, implemented with integral histograms. It determines of an optimal threshold that maximizes the between-classes variance within a subwindow, with computational complexity independent from the size of the window itself. We obtain character recognition rates of 74%, as validated against ground truth of 8 presentation videos spanning over 1 hour and 45 minutes, which almost doubles the baseline performance of an open source OCR engine. For speaker faces, we detect, track, match, and finally select a humanly preferred face icon per speaker, based on three quality measures: resolution, amount of skin, and pose. We register a 87% accordance (51 out of 58 speakers) between the face indexes automatically generated from three unstructured presentation videos of approximately 45 minutes each, and human preferences recorded through Mechanical Turk experiments. For diagrams, we locate graphics inside frames showing a projected slide, cluster them according to an on-line algorithm based on a combination of visual and temporal information, and select and color-correct their representatives to match human preferences recorded through Mechanical Turk experiments. We register 71% accuracy (57 out of 81 unique diagrams properly identified, selected and color-corrected) on three hours of videos containing five different presentations. For mosaics, we combine two existing suturing measures, to extend video images into in-the-world coordinate system. A set of frames to be registered into a mosaic are sampled according to the PTZ camera movement, which is computed through least square estimation starting from the luminance constancy assumption. A local features based stitching algorithm is then applied to estimate the homography among a set of video frames and median blending is used to render pixels in overlapping regions of the mosaic. For two of these indexes, namely faces and diagrams, we present two novel MTurk-derived user data collections to determine viewer preferences, and show that they are matched in selection by our methods. The net result work of this thesis allows users to search, inside a video collection as well as within a single video clip, for a segment of presentation by professor X on topic Y, containing graph Z

Columbia University Academic Commons

Exploring a Culture of Learning with Technology: An Ethnographic Content Analysis of the Activity of Learning with Educational iPad Apps

Author: Yamanaka Akio
Publication venue: Scholarship & Creative Works @ Digital UNC
Publication date: 01/08/2015
Field of study

This study explored the culture of learning with educational iPad apps using activity theory as a guiding framework. First, the top nine educational apps were tracked in the Top Charts section of Apple’s App Store for a duration of four months. The nine sampled apps, selected based on their frequency of appearance, included Toca Hair Salon 2, Stack the States, Endless Alphabet, Mickey Mouse Clubhouse: Wildlife Count Along, Wild Kratts Creature Power World Adventure, Wallykazam! Letter and Word Magic, Starfall Learn to Read, Dr. Panda’s Restaurant 2, and Bug Art. The descriptions, version updates, app content, and customer reviews for each app were digitized, coded, and analyzed in Dedoose using the Activity Checklist. Additionally instructional analysis diagrams were developed to provide insight into the user interface and actions. Results of the study were presented in the form of nine portraits. The overview and relevant instructional characteristics were detailed for each app. The final chapter examined the broader implications of the app experience. The technology, the instruction, the adult guide, and the App Store were identified as mediating factors that contributed to the dynamic app culture

University of Northern Colorado

EBook Exploration: How EBooks Support Emergent Literacy

Author: Flynn Amy
Publication venue: Digital Commons @Brockport
Publication date: 01/05/2013
Field of study

Abstract This research study explores how eBooks support young children’s emergent literacy development. Specifically, it focuses on what kinds and modes are available in eBooks for young children, how eBooks motivate or engage students to read and write and how they support students’ decoding and comprehension skills through a home-based qualitative active inquiry. This study took place during hour long tutoring sessions held twice per week with two elementary aged siblings in an Upstate New York middle class home. The collected data included informal and field notes, student artifacts, comprehension conversations, and student interviews. One student enjoyed reading the eBooks and was motivated by them while the other enjoyed reading paper books better and was not motivated by the eBooks. It was found that some features of eBooks support student’s decoding and comprehension, while some modes of eBooks did not. Pre-teaching of eReader features and previewing the eBook help student comprehend the stories. Student comprehension was aided by the narration features of the eReaders, however animations in TumbleBooks interfered with one student’s comprehension. Use of the Table of Contents and picture cues also contributed to their understanding of eBooks. Finding an eBook at Student One’s reading level was challenging. Both students lost track of the words on the page at times. Technological issues interfered with book reading several times. The Read to Me narration options helped both students with word decoding, especially the beginning reader. More research is needed on how eBooks support student’s decoding and on how beneficial the narration features on eBooks are to beginning readers

The College at Brockport, State University of New York: Digital Commons @Brockport