4,191 research outputs found
YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus
Machine learning for sign languages is bottlenecked by data. In this paper,
we present YouTube-ASL, a large-scale, open-domain corpus of American Sign
Language (ASL) videos and accompanying English captions drawn from YouTube.
With ~1000 hours of videos and >2500 unique signers, YouTube-ASL is ~3x as
large and has ~10x as many unique signers as the largest prior ASL dataset. We
train baseline models for ASL to English translation on YouTube-ASL and
evaluate them on How2Sign, where we achieve a new finetuned state of the art of
12.39 BLEU and, for the first time, report zero-shot results
AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description
Audio Description (AD) is the task of generating descriptions of visual
content, at suitable time intervals, for the benefit of visually impaired
audiences. For movies, this presents notable challenges -- AD must occur only
during existing pauses in dialogue, should refer to characters by name, and
ought to aid understanding of the storyline as a whole. To this end, we develop
a new model for automatically generating movie AD, given CLIP visual features
of the frames, the cast list, and the temporal locations of the speech;
addressing all three of the 'who', 'when', and 'what' questions: (i) who -- we
introduce a character bank consisting of the character's name, the actor that
played the part, and a CLIP feature of their face, for the principal cast of
each movie, and demonstrate how this can be used to improve naming in the
generated AD; (ii) when -- we investigate several models for determining
whether an AD should be generated for a time interval or not, based on the
visual content of the interval and its neighbours; and (iii) what -- we
implement a new vision-language model for this task, that can ingest the
proposals from the character bank, whilst conditioning on the visual features
using cross-attention, and demonstrate how this improves over previous
architectures for AD text generation in an apples-to-apples comparison.Comment: ICCV2023. Project page:
https://www.robots.ox.ac.uk/vgg/research/autoad
Using serious games for learning sign language combining video, enhanced interactivity and VR technology
One in every six persons in the UK suffers a hearing loss, either as a condition they have been born with or they disordered they acquired during their life. 900,000 people in the UK are severely or profoundly deaf and based on a study by Action On Hearing Loss UK in 2013 only 17 percent of this population, can use the British Sign Language (BSL). That leaves a massive proportion of people with a hearing impediment who do not use sign language struggling in social interaction and suffering from emotional distress, and an even larger proportion of Hearing people who cannot communicate with those of the deaf community. This paper presents a serious game (SG) that aims to close the communication gap between able hearing people and people with a hearing impediment by providing a tool that facilitates BSL learning targeting adult population. The paper presents the theoretical framework supporting adult learning based on which a SG game using Virtual Reality (VR) technology has been developed. It explains the experimental framework of the study and presents the creation of the research instruments to facilitate the study comprising of a SG that integrates video and conventional video based educational material. It reports and analyses the study results that demonstrate the advantage of the SG in effectively supporting users learning a set of BSL signs and it presents qualitative outcomes that inform the further development of the game to serve learning needs. The paper closes with conclusions, directions for further development of this educational resource and future studies
Using Serious Games for Learning British Sign Language Combining Video, Enhanced Interactivity, and VR Technology
One in every six persons in the UK suffers a hearing loss, either as a condition they have been born with or they disordered they acquired during their life. 900,000 people in the UK are severely or profoundly deaf and based on a study by Action On Hearing Loss UK in 2013 only 17 percent of this population, can use the British Sign Language (BSL). That leaves a massive proportion of people with a hearing impediment who do not use sign language struggling in social interaction and suffering from emotional distress, and an even larger proportion of Hearing people who cannot communicate with those of the deaf community. This paper presents a serious game (SG) that aims to close the communication gap between able hearing people and people with a hearing impediment by providing a tool that facilitates BSL learning targeting adult population. The paper presents the theoretical framework supporting adult learning based on which a SG game using Virtual Reality (VR) technology has been developed. It explains the experimental framework of the study and presents the creation of the research instruments to facilitate the study comprising of a SG that integrates video and conventional video based educational material. It reports and analyses the study results that demonstrate the advantage of the SG in effectively supporting users learning a set of BSL signs and it presents qualitative outcomes that inform the further development of the game to serve learning needs. The paper closes with conclusions, directions for further development of this educational resource and future studies
Towards Student Engagement Analytics: Applying Machine Learning to Student Posts in Online Lecture Videos
The use of online learning environments in higher education is becoming ever more prevalent with the inception of MOOCs (Massive Open Online Courses) and the increase in online and flipped courses at universities. Although the online systems used to deliver course content make education more accessible, students often express frustration with the lack of assistance during online lecture videos. Instructors express concern that students are not engaging with the course material in online environments, and rely on affordances within these systems to figure out what students are doing. With many online learning environments storing log data about students usage of these systems, research into learning analytics, the measurement, collection, analysis, and reporting data about learning and their contexts, can help inform instructors about student learning in the online context.
This thesis aims to lay the groundwork for learning analytics that provide instructors high-level student engagement data in online learning environments. Recent research has shown that instructors using these systems are concerned about their lack of awareness about student engagement, and educational psychology has shown that engagement is necessary for student success. Specifically, this thesis explores the feasibility of applying machine learning to categorize student posts by their level of engagement. These engagement categories are derived from the ICAP framework, which categorizes overt student behaviors into four tiers of engagement: Interactive, Constructive, Active, and Passive. Contributions include showing what natural language features are most indicative of engagement, exploring whether this machine learning method can be generalized to many courses, and using previous research to develop mockups of what analytics using data from this machine learning method might look like
TPA-Net: Generate A Dataset for Text to Physics-based Animation
Recent breakthroughs in Vision-Language (V&L) joint research have achieved
remarkable results in various text-driven tasks. High-quality Text-to-video
(T2V), a task that has been long considered mission-impossible, was proven
feasible with reasonably good results in latest works. However, the resulting
videos often have undesired artifacts largely because the system is purely
data-driven and agnostic to the physical laws. To tackle this issue and further
push T2V towards high-level physical realism, we present an autonomous data
generation technique and a dataset, which intend to narrow the gap with a large
number of multi-modal, 3D Text-to-Video/Simulation (T2V/S) data. In the
dataset, we provide high-resolution 3D physical simulations for both solids and
fluids, along with textual descriptions of the physical phenomena. We take
advantage of state-of-the-art physical simulation methods (i) Incremental
Potential Contact (IPC) and (ii) Material Point Method (MPM) to simulate
diverse scenarios, including elastic deformations, material fractures,
collisions, turbulence, etc. Additionally, high-quality, multi-view rendering
videos are supplied for the benefit of T2V, Neural Radiance Fields (NeRF), and
other communities. This work is the first step towards fully automated
Text-to-Video/Simulation (T2V/S). Live examples and subsequent work are at
https://sites.google.com/view/tpa-net
Generative Disco: Text-to-Video Generation for Music Visualization
Visuals are a core part of our experience of music, owing to the way they can
amplify the emotions and messages conveyed through the music. However, creating
music visualization is a complex, time-consuming, and resource-intensive
process. We introduce Generative Disco, a generative AI system that helps
generate music visualizations with large language models and text-to-image
models. Users select intervals of music to visualize and then parameterize that
visualization by defining start and end prompts. These prompts are warped
between and generated according to the beat of the music for audioreactive
video. We introduce design patterns for improving generated videos:
"transitions", which express shifts in color, time, subject, or style, and
"holds", which encourage visual emphasis and consistency. A study with
professionals showed that the system was enjoyable, easy to explore, and highly
expressive. We conclude on use cases of Generative Disco for professionals and
how AI-generated content is changing the landscape of creative work
Recommended from our members
Multimodal Indexing of Presentation Videos
This thesis presents four novel methods to help users efficiently and effectively retrieve information from unstructured and unsourced multimedia sources, in particular the increasing amount and variety of presentation videos such as those in e-learning, conference recordings, corporate talks, and student presentations. We demonstrate a system to summarize, index and cross-reference such videos, and measure the quality of the produced indexes as perceived by the end users. We introduce four major semantic indexing cues: text, speaker faces, graphics, and mosaics, going beyond standard tag based searches and simple video playbacks. This work aims at recognizing visual content "in the wild", where the system cannot rely on any additional information besides the video itself. For text, within a scene text detection and recognition framework, we present a novel locally optimal adaptive binarization algorithm, implemented with integral histograms. It determines of an optimal threshold that maximizes the between-classes variance within a subwindow, with computational complexity independent from the size of the window itself. We obtain character recognition rates of 74%, as validated against ground truth of 8 presentation videos spanning over 1 hour and 45 minutes, which almost doubles the baseline performance of an open source OCR engine. For speaker faces, we detect, track, match, and finally select a humanly preferred face icon per speaker, based on three quality measures: resolution, amount of skin, and pose. We register a 87% accordance (51 out of 58 speakers) between the face indexes automatically generated from three unstructured presentation videos of approximately 45 minutes each, and human preferences recorded through Mechanical Turk experiments. For diagrams, we locate graphics inside frames showing a projected slide, cluster them according to an on-line algorithm based on a combination of visual and temporal information, and select and color-correct their representatives to match human preferences recorded through Mechanical Turk experiments. We register 71% accuracy (57 out of 81 unique diagrams properly identified, selected and color-corrected) on three hours of videos containing five different presentations. For mosaics, we combine two existing suturing measures, to extend video images into in-the-world coordinate system. A set of frames to be registered into a mosaic are sampled according to the PTZ camera movement, which is computed through least square estimation starting from the luminance constancy assumption. A local features based stitching algorithm is then applied to estimate the homography among a set of video frames and median blending is used to render pixels in overlapping regions of the mosaic. For two of these indexes, namely faces and diagrams, we present two novel MTurk-derived user data collections to determine viewer preferences, and show that they are matched in selection by our methods. The net result work of this thesis allows users to search, inside a video collection as well as within a single video clip, for a segment of presentation by professor X on topic Y, containing graph Z
Exploring a Culture of Learning with Technology: An Ethnographic Content Analysis of the Activity of Learning with Educational iPad Apps
This study explored the culture of learning with educational iPad apps using
activity theory as a guiding framework. First, the top nine educational apps were tracked
in the Top Charts section of Apple’s App Store for a duration of four months. The nine
sampled apps, selected based on their frequency of appearance, included Toca Hair Salon
2, Stack the States, Endless Alphabet, Mickey Mouse Clubhouse: Wildlife Count Along,
Wild Kratts Creature Power World Adventure, Wallykazam! Letter and Word Magic,
Starfall Learn to Read, Dr. Panda’s Restaurant 2, and Bug Art. The descriptions, version
updates, app content, and customer reviews for each app were digitized, coded, and
analyzed in Dedoose using the Activity Checklist. Additionally instructional analysis
diagrams were developed to provide insight into the user interface and actions. Results of
the study were presented in the form of nine portraits. The overview and relevant
instructional characteristics were detailed for each app. The final chapter examined the
broader implications of the app experience. The technology, the instruction, the adult
guide, and the App Store were identified as mediating factors that contributed to the
dynamic app culture
EBook Exploration: How EBooks Support Emergent Literacy
Abstract
This research study explores how eBooks support young children’s emergent literacy development. Specifically, it focuses on what kinds and modes are available in eBooks for young children, how eBooks motivate or engage students to read and write and how they support students’ decoding and comprehension skills through a home-based qualitative active inquiry. This study took place during hour long tutoring sessions held twice per week with two elementary aged siblings in an Upstate New York middle class home. The collected data included informal and field notes, student artifacts, comprehension conversations, and student interviews. One student enjoyed reading the eBooks and was motivated by them while the other enjoyed reading paper books better and was not motivated by the eBooks.
It was found that some features of eBooks support student’s decoding and comprehension, while some modes of eBooks did not. Pre-teaching of eReader features and previewing the eBook help student comprehend the stories. Student comprehension was aided by the narration features of the eReaders, however animations in TumbleBooks interfered with one student’s comprehension. Use of the Table of Contents and picture cues also contributed to their understanding of eBooks. Finding an eBook at Student One’s reading level was challenging. Both students lost track of the words on the page at times. Technological issues interfered with book reading several times. The Read to Me narration options helped both students with word decoding, especially the beginning reader. More research is needed on how eBooks support student’s decoding and on how beneficial the narration features on eBooks are to beginning readers
- …