2,518 research outputs found

    A Non-Anatomical Graph Structure for isolated hand gesture separation in continuous gesture sequences

    Full text link
    Continuous Hand Gesture Recognition (CHGR) has been extensively studied by researchers in the last few decades. Recently, one model has been presented to deal with the challenge of the boundary detection of isolated gestures in a continuous gesture video [17]. To enhance the model performance and also replace the handcrafted feature extractor in the presented model in [17], we propose a GCN model and combine it with the stacked Bi-LSTM and Attention modules to push the temporal information in the video stream. Considering the breakthroughs of GCN models for skeleton modality, we propose a two-layer GCN model to empower the 3D hand skeleton features. Finally, the class probabilities of each isolated gesture are fed to the post-processing module, borrowed from [17]. Furthermore, we replace the anatomical graph structure with some non-anatomical graph structures. Due to the lack of a large dataset, including both the continuous gesture sequences and the corresponding isolated gestures, three public datasets in Dynamic Hand Gesture Recognition (DHGR), RKS-PERSIANSIGN, and ASLVID, are used for evaluation. Experimental results show the superiority of the proposed model in dealing with isolated gesture boundaries detection in continuous gesture sequence

    A Neural Multi-sequence Alignment TeCHnique (NeuMATCH)

    Full text link
    The alignment of heterogeneous sequential data (video to text) is an important and challenging problem. Standard techniques for this task, including Dynamic Time Warping (DTW) and Conditional Random Fields (CRFs), suffer from inherent drawbacks. Mainly, the Markov assumption implies that, given the immediate past, future alignment decisions are independent of further history. The separation between similarity computation and alignment decision also prevents end-to-end training. In this paper, we propose an end-to-end neural architecture where alignment actions are implemented as moving data between stacks of Long Short-term Memory (LSTM) blocks. This flexible architecture supports a large variety of alignment tasks, including one-to-one, one-to-many, skipping unmatched elements, and (with extensions) non-monotonic alignment. Extensive experiments on semi-synthetic and real datasets show that our algorithm outperforms state-of-the-art baselines.Comment: Accepted at CVPR 2018 (Spotlight). arXiv file includes the paper and the supplemental materia

    INTERACTIVE EMIRATE SIGN LANGUAGE E-DICTIONARY BASED ON DEEP LEARNING RECOGNITION MODELS

    Get PDF
    According to the ministry of community development database in the United Arab Emirates (UAE) about 3065 people with disabilities are hearing disabled (Emirates News Agency - Ministry of Community Development). Hearing-impaired people find it difficult to communicate with the rest of society. They usually need Sign Language (SL) interpreters but as the number of hearing-impaired individuals grows the number of Sign Language interpreters can almost be non-existent. In addition, specialized schools lack a unified Sign Language (SL) dictionary, which can be linked to the Arabic language being of a diglossia nature, hence many dialects of the language co-exist. Moreover, there are not sufficient research work in Arabic SL in general, which can be linked to the lack of unification in the Arabic Sign Language. Hence, presenting an Emirate Sign Language (ESL) electronic Dictionary (e-Dictionary), consisting of four features, namely Dictation, Alpha Webcam, Vocabulary, and Spell, and two datasets (letters and vocabulary/sentences) to help the community in exploring and unifying the ESL. The vocabulary/sentences dataset was recorded by Azure Kinect and includes 127 signs and 50 sentences, making a total of 708 clips, performed by 4 Emirate signers with hearing loss. All the signs were reviewed by the head of the Community Development Authority in UAE for compliance. ESL e-Dictionary integrates state-of-the-art methods i.e., Automatic Speech recognition API by Google, YOLOv8 model trained on our dataset, and an algorithm inspired by bag of words model. Experimental results proved the usability of the e-Dictionary in real-time on laptops. The vocabulary/sentences dataset will be publicly offered in the near future for research purposes

    Deep Architectures for Visual Recognition and Description

    Get PDF
    In recent times, digital media contents are inherently of multimedia type, consisting of the form text, audio, image and video. Several of the outstanding computer Vision (CV) problems are being successfully solved with the help of modern Machine Learning (ML) techniques. Plenty of research work has already been carried out in the field of Automatic Image Annotation (AIA), Image Captioning and Video Tagging. Video Captioning, i.e., automatic description generation from digital video, however, is a different and complex problem altogether. This study compares various existing video captioning approaches available today and attempts their classification and analysis based on different parameters, viz., type of captioning methods (generation/retrieval), type of learning models employed, the desired output description length generated, etc. This dissertation also attempts to critically analyze the existing benchmark datasets used in various video captioning models and the evaluation metrics for assessing the final quality of the resultant video descriptions generated. A detailed study of important existing models, highlighting their comparative advantages as well as disadvantages are also included. In this study a novel approach for video captioning on the Microsoft Video Description (MSVD) dataset and Microsoft Video-to-Text (MSR-VTT) dataset is proposed using supervised learning techniques to train a deep combinational framework, for achieving better quality video captioning via predicting semantic tags. We develop simple shallow CNN (2D and 3D) as feature extractors, Deep Neural Networks (DNNs and Bidirectional LSTMs (BiLSTMs) as tag prediction models and Recurrent Neural Networks (RNNs) (LSTM) model as the language model. The aim of the work was to provide an alternative narrative to generating captions from videos via semantic tag predictions and deploy simpler shallower deep model architectures with lower memory requirements as solution so that it is not very memory extensive and the developed models prove to be stable and viable options when the scale of the data is increased. This study also successfully employed deep architectures like the Convolutional Neural Network (CNN) for speeding up automation process of hand gesture recognition and classification of the sign languages of the Indian classical dance form, ‘Bharatnatyam’. This hand gesture classification is primarily aimed at 1) building a novel dataset of 2D single hand gestures belonging to 27 classes that were collected from (i) Google search engine (Google images), (ii) YouTube videos (dynamic and with background considered) and (iii) professional artists under staged environment constraints (plain backgrounds). 2) exploring the effectiveness of CNNs for identifying and classifying the single hand gestures by optimizing the hyperparameters, and 3) evaluating the impacts of transfer learning and double transfer learning, which is a novel concept explored for achieving higher classification accuracy

    Project Hermes: The Socially Assistive Tour-Guiding Robot

    Get PDF
    With the reduced amount of availability of a labor force for non-technical tasks, service robotics has grown to be used in place of human labor to handle these tasks. There have been various studies on the impact of using robotics in a sociological context. The use of service robots in a social and labor environment recognizes the need of cohesive Human-Robot Interaction (HRI). In this senior design project, we delve into the thought process of using a service robot in place of a human for tasks that are normally reserved for humans. These tasks outline design considerations when performing emotional-centric activities and the need to deliver an effective and efficient service. Codenamed as Project Hermes, we developed a guided tour robot that will provide an interactive routine. Using the robot’s array of sensors and motors, the routine consists of navigating from one room to another, providing an audible explanation of each room, answering visitor questions, and moving on. With the robot’s embedded microphones, the robot is capable of limited interactions with humans, providing feedback and performing tasks accordingly. Once the core functionalities are developed, Hermes will be evaluated in a real-world environment to garner data and feedback. With all these considerations in hand, the design of the service robot needs to cover many of these areas for our framework. To address this need, we outline the ideas and considerations for the task

    Dwell-free input methods for people with motor impairments

    Full text link
    Millions of individuals affected by disorders or injuries that cause severe motor impairments have difficulty performing compound manipulations using traditional input devices. This thesis first explores how effective various assistive technologies are for people with motor impairments. The following questions are studied: (1) What activities are performed? (2) What tools are used to support these activities? (3) What are the advantages and limitations of these tools? (4) How do users learn about and choose assistive technologies? (5) Why do users adopt or abandon certain tools? A qualitative study of fifteen people with motor impairments indicates that users have strong needs for efficient text entry and communication tools that are not met by existing technologies. To address these needs, this thesis proposes three dwell-free input methods, designed to improve the efficacy of target selection and text entry based on eye-tracking and head-tracking systems. They yield: (1) the Target Reverse Crossing selection mechanism, (2) the EyeSwipe eye-typing interface, and (3) the HGaze Typing interface. With Target Reverse Crossing, a user moves the cursor into a target and reverses over a goal to select it. This mechanism is significantly more efficient than dwell-time selection. Target Reverse Crossing is then adapted in EyeSwipe to delineate the start and end of a word that is eye-typed with a gaze path connecting the intermediate characters (as with traditional gesture typing). When compared with a dwell-based virtual keyboard, EyeSwipe affords higher text entry rates and a more comfortable interaction. Finally, HGaze Typing adds head gestures to gaze-path-based text entry to enable simple and explicit command activations. Results from a user study demonstrate that HGaze Typing has better performance and user satisfaction than a dwell-time method
    • …
    corecore