Search CORE

163 research outputs found

audialText: Improving communication accessibility for the deaf through automatic voice-recognition and wearable smart-technology.

Author: Roszkowski Ernest
Publication venue: RIT Scholar Works
Publication date: 01/12/2017
Field of study

Whether you are ordering food at a restaurant, asking for directions, or receiving a phone call from a family member, it is apparent that human communication is an important part of everyday life. Those who are deaf have limited communication accessibility compared to their hearing counterparts, and by default, obtain less public information and face more obstacles during social interactions. This thesis project will attempt to bridge this communication gap through the exploration of human interactions with user interface (UI) and user experience (UX) design. The goal is to design and develop an application concept for wearable smart-technology that will utilize voice-recognition software to improve common communication interactions for the deaf. It will also play a role towards improving incidental learning, literacy, and language comprehension for the deaf. This research will validate the need for increased accessibility, study human interactions, explore existing applications, and visualize potential technological solutions. It will also explore the language and literacy developments of deaf individuals. It will be user-centered in its approach using polls and surveys to help drive certain aspects of the application’s concept, user experience, and features. As a result of the research discoveries, an application concept will be designed strategically, developed conceptually, communicated visually, and finally prototyped through a digital platform in the form of a motion graphic

RIT Scholar Works

Word Importance Modeling to Enhance Captions Generated by Automatic Speech Recognition for Deaf and Hard of Hearing Users

Author: Kafle Sushant
Publication venue: RIT Scholar Works
Publication date: 01/11/2019
Field of study

People who are deaf or hard-of-hearing (DHH) benefit from sign-language interpreting or live-captioning (with a human transcriptionist), to access spoken information. However, such services are not legally required, affordable, nor available in many settings, e.g., impromptu small-group meetings in the workplace or online video content that has not been professionally captioned. As Automatic Speech Recognition (ASR) systems improve in accuracy and speed, it is natural to investigate the use of these systems to assist DHH users in a variety of tasks. But, ASR systems are still not perfect, especially in realistic conversational settings, leading to the issue of trust and acceptance of these systems from the DHH community. To overcome these challenges, our work focuses on: (1) building metrics for accurately evaluating the quality of automatic captioning systems, and (2) designing interventions for improving the usability of captions for DHH users. The first part of this dissertation describes our research on methods for identifying words that are important for understanding the meaning of a conversational turn within transcripts of spoken dialogue. Such knowledge about the relative importance of words in spoken messages can be used in evaluating ASR systems (in part 2 of this dissertation) or creating new applications for DHH users of captioned video (in part 3 of this dissertation). We found that models which consider both the acoustic properties of spoken words as well as text-based features (e.g., pre-trained word embeddings) are more effective at predicting the semantic importance of a word than models that utilize only one of these types of features. The second part of this dissertation describes studies to understand DHH users\u27 perception of the quality of ASR-generated captions; the goal of this work was to validate the design of automatic metrics for evaluating captions in real-time applications for these users. Such a metric could facilitate comparison of various ASR systems, for determining the suitability of specific ASR systems for supporting communication for DHH users. We designed experimental studies to elicit feedback on the quality of captions from DHH users, and we developed and evaluated automatic metrics for predicting the usability of automatically generated captions for these users. We found that metrics that consider the importance of each word in a text are more effective at predicting the usability of imperfect text captions than the traditional Word Error Rate (WER) metric. The final part of this dissertation describes research on importance-based highlighting of words in captions, as a way to enhance the usability of captions for DHH users. Similar to highlighting in static texts (e.g., textbooks or electronic documents), highlighting in captions involves changing the appearance of some texts in caption to enable readers to attend to the most important bits of information quickly. Despite the known benefits of highlighting in static texts, research on the usefulness of highlighting in captions for DHH users is largely unexplored. For this reason, we conducted experimental studies with DHH participants to understand the benefits of importance-based highlighting in captions, and their preference on different design configurations for highlighting in captions. We found that DHH users subjectively preferred highlighting in captions, and they reported higher readability and understandability scores and lower task-load scores when viewing videos with captions containing highlighting compared to the videos without highlighting. Further, in partial contrast to recommendations in prior research on highlighting in static texts (which had not been based on experimental studies with DHH users), we found that DHH participants preferred boldface, word-level, non-repeating highlighting in captions

RIT Scholar Works

Introducing Handwriting into a Multimodal LATEX Formula Editor

Author: Diaz Yancarlos
Publication venue: RIT Scholar Works
Publication date: 01/05/2021
Field of study

Handwriting has been shown to be a useful input modality for math. However, math recognizers are imperfect, especially when recognizing complex expressions. Instead of improving the recognizer itself, we explore ways to best visualize the recognizer\u27s output to help the user fix recognition mistakes more efficiently. To do this, we propose changes to the visual editing operations in MathDeck, a math-aware search engine and formula editor, as well as the addition of an n-best list of results for each symbol in the recognizer\u27s output. We present two experiments to help us find good ways to help users fix errors in the recognizer, and to test whether these changes help novices input formulas more efficiently than they would if they did not have handwriting as an input modality. In the first experiment, users had the option to fix errors with an in-place drop-down menu of alternate symbols, a side symbol correction panel, or by typing the symbols themselves or dragging them from a symbol palette. In our experiment, most users preferred to fix the errors manually by typing the correct symbols or using the symbol palette. In the second experiment, participants entered formulas using handwriting and/or LaTeX. We found evidence that suggests that novices can input formulas faster when they have access to handwriting, but experts still do better when they can just type LaTeX

RIT Scholar Works

Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Author: Liang Paul Pu
Morency Louis-Philippe
Zadeh Amir
Publication venue
Publication date: 07/09/2022
Field of study

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. With the recent interest in video understanding, embodied autonomous agents, text-to-image generation, and multisensor fusion in application domains such as healthcare and robotics, multimodal machine learning has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this paper is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. We start by defining two key principles of modality heterogeneity and interconnections that have driven subsequent innovations, and propose a taxonomy of 6 core technical challenges: representation, alignment, reasoning, generation, transference, and quantification covering historical and recent trends. Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches. We end by motivating several open problems for future research as identified by our taxonomy

arXiv.org e-Print Archive

The Effects of Audiovisual Input on Second Language Learning: A Meta-Analysis

Author: Sutton Dru M
Publication venue: Scholarship@Western
Publication date: 21/08/2023
Field of study

This meta-analysis investigates the contributions of viewing audiovisual input on second language (L2) learning. We calculated 75 effect sizes from 56 experiments (n = 1954). We assessed the effects of audiovisual input on language learning using a within-group (pre-post) meta-analytic approach. The extent to which fifteen moderator variables influenced results was assessed. Several methodologically and pedagogically relevant results were found. Results showed that a) there was a medium effect of audiovisual input on L2 learning (g = 1.01); b) no differences were found between the effects of viewing audiovisual input on different areas of L2 learning (vocabulary, grammar, pronunciation, speaking, listening proficiency); and c) video category had a significant impact on L2 learning with entertainment-focused videos (e.g., TV series, movies, and mixed videos) yielding lower effects than educational videos (e.g., TED Talks, documentaries, and language-focused). These findings along with future research directions for L2 learning through audiovisual input are discussed

Scholarship@Western

Language and Perceptual Categorization in Computational Visual Recognition

Author: Ordonez Roman Vicente
Publication venue: University of North Carolina at Chapel Hill Graduate School
Publication date: 01/01/2015
Field of study

Computational visual recognition or giving computers the ability to understand images as well as humans do is a core problem in Computer Vision. Traditional recognition systems often describe visual content by producing a set of isolated labels, object locations, or by even trying to annotate every pixel in an image with a category. People instead describe the visual world using language. The rich visually descriptive language produced by people incorporates information from human intuition, world knowledge, visual saliency, and common sense that go beyond detecting individual visual concepts like objects, attributes, or scenes. Moreover, due to the rising popularity of social media, there exist billions of images with associated text on the web, yet systems that can leverage this type of annotations or try to connect language and vision are scarce. In this dissertation, we propose new approaches that explore the connections between language and vision at several levels of detail by combining techniques from Computer Vision and Natural Language Understanding. We first present a data-driven technique for understanding and generating image descriptions using natural language, including automatically collecting a big-scale dataset of images with visually descriptive captions. Then we introduce a system for retrieving short visually descriptive phrases for describing some part or aspect of an image, and a simple technique to generate full image descriptions by stitching short phrases. Next we introduce an approach for collecting and generating referring expressions for objects in natural scenes at a much larger scale than previous studies. Finally, we describe methods for learning how to name objects by using intuitions from perceptual categorization related to basic-level and entry-level categories. The main contribution of this thesis is in advancing our knowledge on how to leverage language and intuitions from human perception to create visual recognition systems that can better learn from and communicate with people.Doctor of Philosoph

Carolina Digital Repository

Salient stills

Author: Teodosio Laura A
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1992
Field of study

Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Architecture, 1992.Includes bibliographical references (leaves 67-70).by Laura A. Teodosio.M.S

CiteSeerX

DSpace@MIT

Multi-modal surrogates for retrieving and making sense of videos: is synchronization between the multiple modalities optimal?

Author: Song Yaxiao
Publication venue: University of North Carolina at Chapel Hill
Publication date: 01/12/2010
Field of study

Video surrogates can help people quickly make sense of the content of a video before downloading or seeking more detailed information. Visual and audio features of a video are primary information carriers and might become important components of video retrieval and video sense-making. In the past decades, most research and development efforts on video surrogates have focused on visual features of the video, and comparatively little work has been done on audio surrogates and examining their pros and cons in aiding users' retrieval and sense-making of digital videos. Even less work has been done on multi-modal surrogates, where more than one modality are employed for consuming the surrogates, for example, the audio and visual modalities. This research examined the effectiveness of a number of multi-modal surrogates, and investigated whether synchronization between the audio and visual channels is optimal. A user study was conducted to evaluate six different surrogates on a set of six recognition and inference tasks to answer two main research questions: (1) How do automatically-generated multi-modal surrogates compare to manually-generated ones in video retrieval and video sense-making? and (2) Does synchronization between multiple surrogate channels enhance or inhibit video retrieval and video sense-making? Forty-eight participants participated in the study, in which the surrogates were measured on the the time participants spent on experiencing the surrogates, the time participants spent on doing the tasks, participants' performance accuracy on the tasks, participants' confidence in their task responses, and participants' subjective ratings on the surrogates. On average, the uncoordinated surrogates were more helpful than the coordinated ones, but the manually-generated surrogates were only more helpful than the automatically-generated ones in terms of task completion time. Participants' subjective ratings were more favorable for the coordinated surrogate C2 (Magic A + V) and the uncoordinated surrogate U1 (Magic A + Storyboard V) with respect to usefulness, usability, enjoyment, and engagement. The post-session questionnaire comments demonstrated participants' preference for the coordinated surrogates, but the comments also revealed the value of having uncoordinated sensory channels

Carolina Digital Repository