17,094 research outputs found

    A survey on mouth modeling and analysis for Sign Language recognition

    Get PDF
    © 2015 IEEE.Around 70 million Deaf worldwide use Sign Languages (SLs) as their native languages. At the same time, they have limited reading/writing skills in the spoken language. This puts them at a severe disadvantage in many contexts, including education, work, usage of computers and the Internet. Automatic Sign Language Recognition (ASLR) can support the Deaf in many ways, e.g. by enabling the development of systems for Human-Computer Interaction in SL and translation between sign and spoken language. Research in ASLR usually revolves around automatic understanding of manual signs. Recently, ASLR research community has started to appreciate the importance of non-manuals, since they are related to the lexical meaning of a sign, the syntax and the prosody. Nonmanuals include body and head pose, movement of the eyebrows and the eyes, as well as blinks and squints. Arguably, the mouth is one of the most involved parts of the face in non-manuals. Mouth actions related to ASLR can be either mouthings, i.e. visual syllables with the mouth while signing, or non-verbal mouth gestures. Both are very important in ASLR. In this paper, we present the first survey on mouth non-manuals in ASLR. We start by showing why mouth motion is important in SL and the relevant techniques that exist within ASLR. Since limited research has been conducted regarding automatic analysis of mouth motion in the context of ALSR, we proceed by surveying relevant techniques from the areas of automatic mouth expression and visual speech recognition which can be applied to the task. Finally, we conclude by presenting the challenges and potentials of automatic analysis of mouth motion in the context of ASLR

    Detection of major ASL sign types in continuous signing for ASL recognition

    Get PDF
    In American Sign Language (ASL) as well as other signed languages, different classes of signs (e.g., lexical signs, fingerspelled signs, and classifier constructions) have different internal structural properties. Continuous sign recognition accuracy can be improved through use of distinct recognition strategies, as well as different training datasets, for each class of signs. For these strategies to be applied, continuous signing video needs to be segmented into parts corresponding to particular classes of signs. In this paper we present a multiple instance learning-based segmentation system that accurately labels 91.27% of the video frames of 500 continuous utterances (including 7 different subjects) from the publicly accessible NCSLGR corpus (Neidle and Vogler, 2012). The system uses novel feature descriptors derived from both motion and shape statistics of the regions of high local motion. The system does not require a hand tracker

    Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

    Get PDF
    Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 seconds, showing activities of 267 people from three continents. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community

    Zero-Shot Sign Language Recognition: Can Textual Data Uncover Sign Languages?

    Full text link
    We introduce the problem of zero-shot sign language recognition (ZSSLR), where the goal is to leverage models learned over the seen sign class examples to recognize the instances of unseen signs. To this end, we propose to utilize the readily available descriptions in sign language dictionaries as an intermediate-level semantic representation for knowledge transfer. We introduce a new benchmark dataset called ASL-Text that consists of 250 sign language classes and their accompanying textual descriptions. Compared to the ZSL datasets in other domains (such as object recognition), our dataset consists of limited number of training examples for a large number of classes, which imposes a significant challenge. We propose a framework that operates over the body and hand regions by means of 3D-CNNs, and models longer temporal relationships via bidirectional LSTMs. By leveraging the descriptive text embeddings along with these spatio-temporal representations within a zero-shot learning framework, we show that textual data can indeed be useful in uncovering sign languages. We anticipate that the introduced approach and the accompanying dataset will provide a basis for further exploration of this new zero-shot learning problem.Comment: To appear in British Machine Vision Conference (BMVC) 201

    Watch, read and lookup: learning to spot signs from multiple supervisors

    Full text link
    The focus of this work is sign spotting - given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video. To achieve this sign spotting task, we train a model using multiple types of available supervision by: (1) watching existing sparsely labelled footage; (2) reading associated subtitles (readily available translations of the signed content) which provide additional weak-supervision; (3) looking up words (for which no co-articulated labelled examples are available) in visual sign language dictionaries to enable novel sign spotting. These three tasks are integrated into a unified learning framework using the principles of Noise Contrastive Estimation and Multiple Instance Learning. We validate the effectiveness of our approach on low-shot sign spotting benchmarks. In addition, we contribute a machine-readable British Sign Language (BSL) dictionary dataset of isolated signs, BSLDict, to facilitate study of this task. The dataset, models and code are available at our project page.Comment: Appears in: Asian Conference on Computer Vision 2020 (ACCV 2020) - Oral presentation. 29 page

    BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues

    Get PDF
    Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality. A key stumbling block in making progress towards this goal is a lack of appropriate training data, stemming from the high complexity of sign annotation and a limited supply of qualified annotators. In this work, we introduce a new scalable approach to data collection for sign recognition in continuous videos. We make use of weakly-aligned subtitles for broadcast footage together with a keyword spotting method to automatically localise sign-instances for a vocabulary of 1,000 signs in 1,000 hours of video. We make the following contributions: (1) We show how to use mouthing cues from signers to obtain high-quality annotations from video data - the result is the BSL-1K dataset, a collection of British Sign Language (BSL) signs of unprecedented scale; (2) We show that we can use BSL-1K to train strong sign recognition models for co-articulated signs in BSL and that these models additionally form excellent pretraining for other sign languages and benchmarks - we exceed the state of the art on both the MSASL and WLASL benchmarks. Finally, (3) we propose new large-scale evaluation sets for the tasks of sign recognition and sign spotting and provide baselines which we hope will serve to stimulate research in this area.Comment: Appears in: European Conference on Computer Vision 2020 (ECCV 2020). 28 page

    Sign language video retrieval with free-form textual queries

    Get PDF
    Systems that can efficiently search collections of sign language videos have been highlighted as a useful application of sign language technology. However, the problem of searching videos beyond individual keywords has received limited attention in the literature. To address this gap, in this work we introduce the task of sign language retrieval with textual queries: given a written query (e.g. a sentence) and a large collection of sign language videos, the objective is to find the signing video that best matches the written query. We propose to tackle this task by learning cross-modal embeddings on the recently introduced large-scale How2Sign dataset of American Sign Language (ASL). We identify that a key bottleneck in the performance of the system is the quality of the sign video embedding which suffers from a scarcity of labelled training data. We, therefore, propose SPOT-ALIGN, a framework for interleaving iterative rounds of sign spotting and feature alignment to expand the scope and scale of available training data. We validate the effectiveness of SPOT-ALIGN for learning a robust sign video embedding through improvements in both sign recognition and the proposed video retrieval task.This work was supported by the project PID2020-117142GB-I00, funded by MCIN/ AEI /10.13039/501100011033, ANR project CorVis ANR-21-CE23-0003- 01, and gifts from Google and Adobe. AD received support from la Caixa Foundation (ID 100010434), fellowship code LCF/BQ/IN18/11660029.Peer ReviewedObjectius de Desenvolupament Sostenible::10 - Reducció de les DesigualtatsObjectius de Desenvolupament Sostenible::10 - Reducció de les Desigualtats::10.2 - Per a 2030, potenciar i promoure la inclusió social, econòmica i política de totes les persones, independentment de l’edat, sexe, discapacitat, raça, ètnia, origen, religió, situació econòmica o altra condicióPostprint (author's final draft

    Polysemous English phrasal verbs: EFL textbook distribution, students' receptive and productive knowledge and teachers' beliefs in the Greek Cypriot context

    Get PDF
    Formulaic sequences such as idioms, collocations and phrasal verbs constitute an essential part of English vocabulary and a crucial element of foreign language learners’ communicative competence. While substantial research has been carried out on idioms and collocations comparatively fewer studies have focused on phrasal verbs despite the great difficulty, they possess to foreign language learners and although phrasal verbs are considered necessary for native-like fluency. This thesis aims to fill in this gap by exploring, i) phrasal verb distribution in English foreign language textbooks, ii) English language learners’ knowledge of phrasal verbs and iii) English foreign language teachers’ beliefs about phrasal verb learning and teaching. This first study examined the occurrence and recurrence of phrasal verbs in six English foreign language textbooks in order to shed some light in what seems to be an under-researched area. Research has shown that phrasal verbs are polysemous and can have more than one meaning sense. Gardner and Davies (2007) estimated that each of the 100 most frequently used phrasal verbs in the British National Corpus has on average 5.6 meaning senses, while, Garnier and Schmitt (2015) concluded that each of the 150 most frequently used phrasal verbs in the Corpus of Contemporary American English has on average two meaning senses. Nonetheless, no research, so far, explored the way the various phrasal verb meaning senses are treated in contemporary English foreign language textbooks. To fill in this gap, the first study, explored the distribution of phrasal verbs and their frequently used meaning senses (based on native speakers’ corpus indications) in the textbooks. The results of this study highlight the need for textbook writers to i) adopt a more scientific based and systematic selection process, taking into consideration the polysemous nature of phrasal verbs and ii) provide more opportunities for repetition, an essential component of vocabulary acquisition. The second study explored 100 English foreign language learners’ productive and receptive knowledge of a sample of high frequency phrasal verbs and phrasal verb meaning senses. Participants were tested at form-recall and form-recognition level of mastery and the effect of frequency (based on textbooks and corpus indications) and a number of language engagement factors on knowledge were examined. Twenty participants also took part in an interview to validate the form-recall test items. Results showed that participants had a rather weak knowledge of phrasal verbs. Consistent with previous findings the robust effect of frequency and engagement in leisure activities, such as reading and watching English films, was further supported. The third study investigated English foreign language teachers’ beliefs about phrasal verb teaching and learning. Following a qualitative approach, twenty teachers took part in semi-structured interviews in order to gain insights into their beliefs about phrasal verbs. Analysis of the results indicated that all teachers considered phrasal verbs to be one of the most challenging feature of English vocabulary. Nonetheless, conflicting results about phrasal verb importance were found, as non-native speaker teachers seemed to consider phrasal verbs a less important element of English vocabulary, while all native-speaker teachers stressed the importance of learning phrasal verbs. This study concluded that teachers’ beliefs about phrasal verbs were differentially affected by the numbers of teaching experience, L1 background and students’ proficiency level. Overall, the results of these studies stress the lack of foreign language learners’ phrasal verb knowledge and highlight the need for better treatment of this word combination in foreign language teaching contexts. My research results may prove useful, to second language researchers, textbook writers, and material designers as well as to foreign language teachers. It is hoped that polysemous phrasal verbs will receive more attention in the field of Applied Linguistics and future efforts will try to improve the quality of textbooks and provide foreign language teachers with the necessary support for phrasal verb teaching
    corecore