83 research outputs found

    A new framework for sign language recognition based on 3D handshape identification and linguistic modeling

    Full text link
    Current approaches to sign recognition by computer generally have at least some of the following limitations: they rely on laboratory conditions for sign production, are limited to a small vocabulary, rely on 2D modeling (and therefore cannot deal with occlusions and off-plane rotations), and/or achieve limited success. Here we propose a new framework that (1) provides a new tracking method less dependent than others on laboratory conditions and able to deal with variations in background and skin regions (such as the face, forearms, or other hands); (2) allows for identification of 3D hand configurations that are linguistically important in American Sign Language (ASL); and (3) incorporates statistical information reflecting linguistic constraints in sign production. For purposes of large-scale computer-based sign language recognition from video, the ability to distinguish hand configurations accurately is critical. Our current method estimates the 3D hand configuration to distinguish among 77 hand configurations linguistically relevant for ASL. Constraining the problem in this way makes recognition of 3D hand configuration more tractable and provides the information specifically needed for sign recognition. Further improvements are obtained by incorporation of statistical information about linguistic dependencies among handshapes within a sign derived from an annotated corpus of almost 10,000 sign tokens

    Linguistically-driven framework for computationally efficient and scalable sign recognition

    Full text link
    We introduce a new general framework for sign recognition from monocular video using limited quantities of annotated data. The novelty of the hybrid framework we describe here is that we exploit state-of-the art learning methods while also incorporating features based on what we know about the linguistic composition of lexical signs. In particular, we analyze hand shape, orientation, location, and motion trajectories, and then use CRFs to combine this linguistically significant information for purposes of sign recognition. Our robust modeling and recognition of these sub-components of sign production allow an efficient parameterization of the sign recognition problem as compared with purely data-driven methods. This parameterization enables a scalable and extendable time-series learning approach that advances the state of the art in sign recognition, as shown by the results reported here for recognition of isolated, citation-form, lexical signs from American Sign Language (ASL)

    Scalable ASL sign recognition using model-based machine learning and linguistically annotated corpora

    Get PDF
    We report on the high success rates of our new, scalable, computational approach for sign recognition from monocular video, exploiting linguistically annotated ASL datasets with multiple signers. We recognize signs using a hybrid framework combining state-of-the-art learning methods with features based on what is known about the linguistic composition of lexical signs. We model and recognize the sub-components of sign production, with attention to hand shape, orientation, location, motion trajectories, plus non-manual features, and we combine these within a CRF framework. The effect is to make the sign recognition problem robust, scalable, and feasible with relatively smaller datasets than are required for purely data-driven methods. From a 350-sign vocabulary of isolated, citation-form lexical signs from the American Sign Language Lexicon Video Dataset (ASLLVD), including both 1- and 2-handed signs, we achieve a top-1 accuracy of 93.3% and a top-5 accuracy of 97.9%. The high probability with which we can produce 5 sign candidates that contain the correct result opens the door to potential applications, as it is reasonable to provide a sign lookup functionality that offers the user 5 possible signs, in decreasing order of likelihood, with the user then asked to select the desired sign

    NEW shared & interconnected ASL resources: SignStream® 3 Software; DAI 2 for web access to linguistically annotated video corpora; and a sign bank

    Get PDF
    2017 marked the release of a new version of SignStream® software, designed to facilitate linguistic analysis of ASL video. SignStream® provides an intuitive interface for labeling and time-aligning manual and non-manual components of the signing. Version 3 has many new features. For example, it enables representation of morpho-phonological information, including display of handshapes. An expanding ASL video corpus, annotated through use of SignStream®, is shared publicly on the Web. This corpus (video plus annotations) is Web-accessible—browsable, searchable, and downloadable—thanks to a new, improved version of our Data Access Interface: DAI 2. DAI 2 also offers Web access to a brand new Sign Bank, containing about 10,000 examples of about 3,000 distinct signs, as produced by up to 9 different ASL signers. This Sign Bank is also directly accessible from within SignStream®, thereby boosting the efficiency and consistency of annotation; new items can also be added to the Sign Bank. Soon to be integrated into SignStream® 3 and DAI 2 are visualizations of computer-generated analyses of the video: graphical display of eyebrow height, eye aperture, an

    Detection of major ASL sign types in continuous signing for ASL recognition

    Get PDF
    In American Sign Language (ASL) as well as other signed languages, different classes of signs (e.g., lexical signs, fingerspelled signs, and classifier constructions) have different internal structural properties. Continuous sign recognition accuracy can be improved through use of distinct recognition strategies, as well as different training datasets, for each class of signs. For these strategies to be applied, continuous signing video needs to be segmented into parts corresponding to particular classes of signs. In this paper we present a multiple instance learning-based segmentation system that accurately labels 91.27% of the video frames of 500 continuous utterances (including 7 different subjects) from the publicly accessible NCSLGR corpus (Neidle and Vogler, 2012). The system uses novel feature descriptors derived from both motion and shape statistics of the regions of high local motion. The system does not require a hand tracker

    The Importance of 3D Motion Trajectories for Computer-based Sign Recognition

    Get PDF
    Computer-based sign language recognition from video is a challenging problem because of the spatiotemporal complexities inherent in sign production and the variations within and across signers. However, linguistic information can help constrain sign recognition to make it a more feasible classification problem. We have previously explored recognition of linguistically significant 3D hand configurations, as start and end handshapes represent one major component of signs; others include hand orientation, place of articulation in space, and movement. Thus, although recognition of handshapes (on one or both hands) at the start and end of a sign is essential for sign identification, it is not sufficient. Analysis of hand and arm movement trajectories can provide additional information critical for sign identification. In order to test the discriminative potential of the hand motion analysis, we performed sign recognition based exclusively on hand trajectories while holding the handshape constant. To facilitate this evaluation, we captured a collection of videos involving signs with a constant handshape produced by multiple subjects; and we automatically annotated the 3D motion trajectories. 3D hand locations are normalized in accordance with invariant properties of ASL movements. We trained time-series learning-based models for different signs of constant handshape in our dataset using the normalized 3D motion trajectories. Results show significant computer-based sign recognition accuracy across subjects and across a diverse set of signs. Our framework demonstrates the discriminative power and importance of 3D hand motion trajectories for sign recognition, given known handshapes

    Exploiting phonological constraints for handshape recognition in sign language video

    Full text link
    The ability to recognize handshapes in signing video is essential in algorithms for sign recognition and retrieval. Handshape recognition from isolated images is, however, an insufficiently constrained problem. Many handshapes share similar 3D configurations and are indistinguishable for some hand orientations in 2D image projections. Additionally, significant differences in handshape appearance are induced by the articulated structure of the hand and variants produced by different signers. Linguistic rules involved in the production of signs impose strong constraints on the articulations of the hands, yet, little attention has been paid towards exploiting these constraints in previous works on sign recognition. Among the different classes of signs in any signed language, lexical signs constitute the prevalent class. Morphemes (or, meaningful units) for signs in this class involve a combination of particular handshapes, palm orientations, locations for articulation, and movement type. These are thus analyzed by many sign linguists as analogues of phonemes in spoken languages. Phonological constraints govern the ways in which phonemes combine in American Sign Language (ASL), as in other signed and spoken languages; utilizing these constraints for handshape recognition in ASL is the focus of the proposed thesis. Handshapes in monomorphemic lexical signs are specified at the start and end of the sign. The handshape transition within a sign are constrained to involve either closing or opening of the hand (i.e., constrained to exclusively use either folding or unfolding of the palm and one or more fingers). Furthermore, akin to allophonic variations in spoken languages, both inter- and intra- signer variations in the production of specific handshapes are observed. We propose a Bayesian network formulation to exploit handshape co-occurrence constraints also utilizing information about allophonic variations to aid in handshape recognition. We propose a fast non-rigid image alignment method to gain improved robustness to handshape appearance variations during computation of observation likelihoods in the Bayesian network. We evaluate our handshape recognition approach on a large dataset of monomorphemic lexical signs. We demonstrate that leveraging linguistic constraints on handshapes results in improved handshape recognition accuracy. As part of the overall project, we are collecting and preparing for dissemination a large corpus (three thousand signs from three native signers) of ASL video annotated with linguistic information such as glosses, morphological properties and variations, and start/end handshapes associated with each ASL sign

    Towards an Integrative Information Society: Studies on Individuality in Speech and Sign

    Get PDF
    The flow of information within modern information society has increased rapidly over the last decade. The major part of this information flow relies on the individual’s abilities to handle text or speech input. For the majority of us it presents no problems, but there are some individuals who would benefit from other means of conveying information, e.g. signed information flow. During the last decades the new results from various disciplines have all suggested towards the common background and processing for sign and speech and this was one of the key issues that I wanted to investigate further in this thesis. The basis of this thesis is firmly within speech research and that is why I wanted to design analogous test batteries for widely used speech perception tests for signers – to find out whether the results for signers would be the same as in speakers’ perception tests. One of the key findings within biology – and more precisely its effects on speech and communication research – is the mirror neuron system. That finding has enabled us to form new theories about evolution of communication, and it all seems to converge on the hypothesis that all communication has a common core within humans. In this thesis speech and sign are discussed as equal and analogical counterparts of communication and all research methods used in speech are modified for sign. Both speech and sign are thus investigated using similar test batteries. Furthermore, both production and perception of speech and sign are studied separately. An additional framework for studying production is given by gesture research using cry sounds. Results of cry sound research are then compared to results from children acquiring sign language. These results show that individuality manifests itself from very early on in human development. Articulation in adults, both in speech and sign, is studied from two perspectives: normal production and re-learning production when the apparatus has been changed. Normal production is studied both in speech and sign and the effects of changed articulation are studied with regards to speech. Both these studies are done by using carrier sentences. Furthermore, sign production is studied giving the informants possibility for spontaneous speech. The production data from the signing informants is also used as the basis for input in the sign synthesis stimuli used in sign perception test battery. Speech and sign perception were studied using the informants’ answers to questions using forced choice in identification and discrimination tasks. These answers were then compared across language modalities. Three different informant groups participated in the sign perception tests: native signers, sign language interpreters and Finnish adults with no knowledge of any signed language. This gave a chance to investigate which of the characteristics found in the results were due to the language per se and which were due to the changes in modality itself. As the analogous test batteries yielded similar results over different informant groups, some common threads of results could be observed. Starting from very early on in acquiring speech and sign the results were highly individual. However, the results were the same within one individual when the same test was repeated. This individuality of results represented along same patterns across different language modalities and - in some occasions - across language groups. As both modalities yield similar answers to analogous study questions, this has lead us to providing methods for basic input for sign language applications, i.e. signing avatars. This has also given us answers to questions on precision of the animation and intelligibility for the users – what are the parameters that govern intelligibility of synthesised speech or sign and how precise must the animation or synthetic speech be in order for it to be intelligible. The results also give additional support to the well-known fact that intelligibility in fact is not the same as naturalness. In some cases, as shown within the sign perception test battery design, naturalness decreases intelligibility. This also has to be taken into consideration when designing applications. All in all, results from each of the test batteries, be they for signers or speakers, yield strikingly similar patterns, which would indicate yet further support for the common core for all human communication. Thus, we can modify and deepen the phonetic framework models for human communication based on the knowledge obtained from the results of the test batteries within this thesis.Siirretty Doriast

    A Representation of Selected Nonmanual Signals in American Sign Language

    Get PDF
    Computer-generated three-dimensional animation holds great promise for synthesizing utterances in American Sign Language (ASL) that are not only grammatical, but believable by members of the Deaf community. Animation poses several challenges stemming from the massive amounts of data necessary to specify the movement of three-dimensional geometry, and there is no current system that facilitates the synthesis of nonmanual signals. However, the linguistics of ASL can aid in surmounting the challenge by providing structure and rules for organizing the data. This work presents a first method for representing ASL linguistic and extralinguistic processes that involve the face. Any such representation must be capable of expressing the subtle nuances of ASL. Further, it must be able to represent co-occurrences because many ASL signs require that two or more nonmanual signals be used simultaneously. In fact simultaneity of multiple nonmanual signals can occur on the same facial feature. Additionally, such a system should allow both binary and incremental nonmanual signals to display the full range of adjectival and adverbial modifiers. Validating such a representation requires both the affirmation that nonmanual signals are indeed necessary in the animation of ASL, and the evaluation of the effectiveness of the new representation in synthesizing nonmanual signals. In this study, members of the Deaf community viewed animations created with the new representation and answered questions concerning the influence of selected nonmanual signals on the perceived meaning of the synthesized utterances. Results reveal that, not only is the representation capable of effectively portraying nonmanual signals, but also that it can be used to combine various nonmanual signals in the synthesis of complete ASL sentences. In a study with Deaf users, participants viewing synthesized animations consistently identified the intended nonmanual signals correctly
    • …
    corecore