6 research outputs found

    Scalable ASL sign recognition using model-based machine learning and linguistically annotated corpora

    Get PDF
    We report on the high success rates of our new, scalable, computational approach for sign recognition from monocular video, exploiting linguistically annotated ASL datasets with multiple signers. We recognize signs using a hybrid framework combining state-of-the-art learning methods with features based on what is known about the linguistic composition of lexical signs. We model and recognize the sub-components of sign production, with attention to hand shape, orientation, location, motion trajectories, plus non-manual features, and we combine these within a CRF framework. The effect is to make the sign recognition problem robust, scalable, and feasible with relatively smaller datasets than are required for purely data-driven methods. From a 350-sign vocabulary of isolated, citation-form lexical signs from the American Sign Language Lexicon Video Dataset (ASLLVD), including both 1- and 2-handed signs, we achieve a top-1 accuracy of 93.3% and a top-5 accuracy of 97.9%. The high probability with which we can produce 5 sign candidates that contain the correct result opens the door to potential applications, as it is reasonable to provide a sign lookup functionality that offers the user 5 possible signs, in decreasing order of likelihood, with the user then asked to select the desired sign

    Linguistically-driven framework for computationally efficient and scalable sign recognition

    Full text link
    We introduce a new general framework for sign recognition from monocular video using limited quantities of annotated data. The novelty of the hybrid framework we describe here is that we exploit state-of-the art learning methods while also incorporating features based on what we know about the linguistic composition of lexical signs. In particular, we analyze hand shape, orientation, location, and motion trajectories, and then use CRFs to combine this linguistically significant information for purposes of sign recognition. Our robust modeling and recognition of these sub-components of sign production allow an efficient parameterization of the sign recognition problem as compared with purely data-driven methods. This parameterization enables a scalable and extendable time-series learning approach that advances the state of the art in sign recognition, as shown by the results reported here for recognition of isolated, citation-form, lexical signs from American Sign Language (ASL)

    A benchmark of dynamic versus static methods for facial action unit detection

    Get PDF
    Action Units activation is a set of local individual facial muscle parts that occur in time constituting a natural facial expression event. AUs occurrence activation detection can be inferred as temporally consecutive evolving movements of these parts. Detecting AUs automatically can provide explicit benefits since it considers both static and dynamic facial features. Our work is divided into three contributions: first, we extracted the features from Local Binary Patterns, Local Phase Quantisation, and dynamic texture descriptor LPQTOP with two distinct leveraged network models from different CNN architectures for local deep visual learning for AU image analysis. Second, cascading the LPQTOP feature vector with Long Short-Term Memory is used for coding longer term temporal information. Next, we discovered the importance of stacking LSTM on top of CNN for learning temporal information in combining the spatially and temporally schemes simultaneously. Also, we hypothesised that using an unsupervised Slow Feature Analysis method is able to leach invariant information from dynamic textures. Third, we compared continuous scoring predictions between LPQTOP and SVM, LPQTOP with LSTM, and AlexNet. A competitive substantial performance evaluation was carried out on the Enhanced CK dataset. Overall, the results indicate that CNN is very promising and surpassed all other method

    Variable-state latent conditional random fields for facial expression recognition and action unit detection

    No full text
    Abstract—Automatic recognition of facial expressions of emotions, and detection of facial action units (AUs), from videos depends critically on modeling of their dynamics. These dynamics are characterized by changes in temporal phases (onset-apex-offset) and intensity of emotion/AUs, the appear-ance of which vary considerably among subjects, making the recognition/detection task very challenging. While state-of-the-art Latent Conditional Random Fields (LCRF) allow one to efficiently encode these dynamics via modeling of structural information (e.g., temporal consistency and ordinal constraints), their latent states are restricted to either unordered (nominal) or fully ordered (ordinal). However, such an approach is often too restrictive since, for instance, in the case of AU detection, the sequences of an active AU may better be described using ordinal latent states (corresponding to the AU intensity levels), while the sequences of this AU not being active may better be described using unordered (nominal) latent states. To this end, we propose the Variable-state LCRF model that automatically selects the optimal latent states (nominal or ordinal) for each sequence from each target class. This unsupervised adaptation of the model to individual sequence or subject contexts opens the possibility for improved model fitting and, subsequently, enhanced predictive performance. Our experiments on four public expression databases (CK+, AFEW, MMI and GEMEP-FERA) show that the proposed model consistently outperforms the state-of-the-art methods for both facial expression recogni-tion and action unit detection from image sequences. I
    corecore