293 research outputs found

    Hand tracking and bimanual movement understanding

    Get PDF
    Bimanual movements are a subset ot human movements in which the two hands move together in order to do a task or imply a meaning A bimanual movement appearing in a sequence of images must be understood in order to enable computers to interact with humans in a natural way This problem includes two main phases, hand tracking and movement recognition. We approach the problem of hand tracking from a neuroscience point ot view First the hands are extracted and labelled by colour detection and blob analysis algorithms In the presence of the two hands one hand may occlude the other occasionally Therefore, hand occlusions must be detected in an image sequence A dynamic model is proposed to model the movement of each hand separately Using this model in a Kalman filtering proccss the exact starting and end points of hand occlusions are detected We exploit neuroscience phenomena to understand the beha\ tour of the hands during occlusion periods Based on this, we propose a general hand tracking algorithm to track and reacquire the hands over a movement including hand occlusion The advantages of the algorithm and its generality are demonstrated in the experiments. In order to recognise the movements first we recognise the movement of a hand Using statistical pattern recognition methods (such as Principal Component Analysis and Nearest Neighbour) the static shape of each hand appearing in an image is recognised A Graph- Matching algorithm and Discrete Midden Markov Models (DHMM) as two spatio-temporal pattern recognition techniques are imestigated tor recognising a dynamic hand gesture For recognising bimanual movements we consider two general forms ot these movements, single and concatenated periodic We introduce three Bayesian networks for recognising die movements The networks are designed to recognise and combinc the gestures of the hands in order to understand the whole movement Experiments on different types ot movement demonstrate the advantages and disadvantages of each network

    BEST: BERT Pre-Training for Sign Language Recognition with Coupling Tokenization

    Full text link
    In this work, we are dedicated to leveraging the BERT pre-training success and modeling the domain-specific statistics to fertilize the sign language recognition~(SLR) model. Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone in a frame-wise manner. Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence, which learns the hierarchical correlation context cues among internal and external triplet units. Notably, different from the highly semantic word token in BERT, the pose unit is a low-level signal originally located in continuous space, which prevents the direct adoption of the BERT cross-entropy objective. To this end, we bridge this semantic gap via coupling tokenization of the triplet unit. It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture/body state. After pre-training, we fine-tune the pre-trained encoder on the downstream SLR task, jointly with the newly added task-specific layer. Extensive experiments are conducted to validate the effectiveness of our proposed method, achieving new state-of-the-art performance on all four benchmarks with a notable gain.Comment: Accepted by AAAI 2023 (Oral

    Human-robot interaction and computer-vision-based services for autonomous robots

    Get PDF
    L'Aprenentatge per Imitació (IL), o Programació de robots per Demostració (PbD), abasta mètodes pels quals un robot aprèn noves habilitats a través de l'orientació humana i la imitació. La PbD s'inspira en la forma en què els éssers humans aprenen noves habilitats per imitació amb la finalitat de desenvolupar mètodes pels quals les noves tasques es poden transferir als robots. Aquesta tesi està motivada per la pregunta genèrica de "què imitar?", Que es refereix al problema de com extreure les característiques essencials d'una tasca. Amb aquesta finalitat, aquí adoptem la perspectiva del Reconeixement d'Accions (AR) per tal de permetre que el robot decideixi el què cal imitar o inferir en interactuar amb un ésser humà. L'enfoc proposat es basa en un mètode ben conegut que prové del processament del llenguatge natural: és a dir, la bossa de paraules (BoW). Aquest mètode s'aplica a grans bases de dades per tal d'obtenir un model entrenat. Encara que BoW és una tècnica d'aprenentatge de màquines que s'utilitza en diversos camps de la investigació, en la classificació d'accions per a l'aprenentatge en robots està lluny de ser acurada. D'altra banda, se centra en la classificació d'objectes i gestos en lloc d'accions. Per tant, en aquesta tesi es demostra que el mètode és adequat, en escenaris de classificació d'accions, per a la fusió d'informació de diferents fonts o de diferents assajos. Aquesta tesi fa tres contribucions: (1) es proposa un mètode general per fer front al reconeixement d'accions i per tant contribuir a l'aprenentatge per imitació; (2) la metodologia pot aplicar-se a grans bases de dades, que inclouen diferents modes de captura de les accions; i (3) el mètode s'aplica específicament en un projecte internacional d'innovació real anomenat Vinbot.El Aprendizaje por Imitación (IL), o Programación de robots por Demostración (PbD), abarca métodos por los cuales un robot aprende nuevas habilidades a través de la orientación humana y la imitación. La PbD se inspira en la forma en que los seres humanos aprenden nuevas habilidades por imitación con el fin de desarrollar métodos por los cuales las nuevas tareas se pueden transferir a los robots. Esta tesis está motivada por la pregunta genérica de "qué imitar?", que se refiere al problema de cómo extraer las características esenciales de una tarea. Con este fin, aquí adoptamos la perspectiva del Reconocimiento de Acciones (AR) con el fin de permitir que el robot decida lo que hay que imitar o inferir al interactuar con un ser humano. El enfoque propuesto se basa en un método bien conocido que proviene del procesamiento del lenguaje natural: es decir, la bolsa de palabras (BoW). Este método se aplica a grandes bases de datos con el fin de obtener un modelo entrenado. Aunque BoW es una técnica de aprendizaje de máquinas que se utiliza en diversos campos de la investigación, en la clasificación de acciones para el aprendizaje en robots está lejos de ser acurada. Además, se centra en la clasificación de objetos y gestos en lugar de acciones. Por lo tanto, en esta tesis se demuestra que el método es adecuado, en escenarios de clasificación de acciones, para la fusión de información de diferentes fuentes o de diferentes ensayos. Esta tesis hace tres contribuciones: (1) se propone un método general para hacer frente al reconocimiento de acciones y por lo tanto contribuir al aprendizaje por imitación; (2) la metodología puede aplicarse a grandes bases de datos, que incluyen diferentes modos de captura de las acciones; y (3) el método se aplica específicamente en un proyecto internacional de innovación real llamado Vinbot.Imitation Learning (IL), or robot Programming by Demonstration (PbD), covers methods by which a robot learns new skills through human guidance and imitation. PbD takes its inspiration from the way humans learn new skills by imitation in order to develop methods by which new tasks can be transmitted to robots. This thesis is motivated by the generic question of “what to imitate?” which concerns the problem of how to extract the essential features of a task. To this end, here we adopt Action Recognition (AR) perspective in order to allow the robot to decide what has to be imitated or inferred when interacting with a human kind. The proposed approach is based on a well-known method from natural language processing: namely, Bag of Words (BoW). This method is applied to large databases in order to obtain a trained model. Although BoW is a machine learning technique that is used in various fields of research, in action classification for robot learning it is far from accurate. Moreover, it focuses on the classification of objects and gestures rather than actions. Thus, in this thesis we show that the method is suitable in action classification scenarios for merging information from different sources or different trials. This thesis makes three contributions: (1) it proposes a general method for dealing with action recognition and thus to contribute to imitation learning; (2) the methodology can be applied to large databases which include different modes of action captures; and (3) the method is applied specifically in a real international innovation project called Vinbot

    Feature Extraction and Recognition for Human Action Recognition

    Get PDF
    How to automatically label videos containing human motions is the task of human action recognition. Traditional human action recognition algorithms use the RGB videos as input, and it is a challenging task because of the large intra-class variations of actions, cluttered background, possible camera movement, and illumination variations. Recently, the introduction of cost-effective depth cameras provides a new possibility to address difficult issues. However, it also brings new challenges such as noisy depth maps and time alignment. In this dissertation, effective and computationally efficient feature extraction and recognition algorithms are proposed for human action recognition. At the feature extraction step, two novel spatial-temporal feature descriptors are proposed which can be combined with local feature detectors. The first proposed descriptor is the Shape and Motion Local Ternary Pattern (SMltp) descriptor which can dramatically reduced the number of features generated by dense sampling without sacrificing the accuracy. In addition, the Center-Symmetric Motion Local Ternary Pattern (CS-Mltp) descriptor is proposed, which describes the spatial and temporal gradients-like features. Both descriptors (SMltp and CS-Mltp) take advantage of the Local Binary Pattern (LBP) texture operator in terms of tolerance to illumination change, robustness in homogeneous region and computational efficiency. For better feature representation, this dissertation presents a new Dictionary Learning (DL) method to learn an overcomplete set of representative vectors (atoms) so that any input feature can be approximated by a linear combination of these atoms with minimum reconstruction error. Instead of simultaneously learning one overcomplete dictionary for all classes, we learn class-specific sub-dictionaries to increase the discrimination. In addition, the group sparsity and the geometry constraint are added to the learning process to further increase the discriminative power, so that features are well reconstructed by atoms from the same class and features from the same class with high similarity will be forced to have similar coefficients. To evaluate the proposed algorithms, three applications including single view action recognition, distributed multi-view action recognition, and RGB-D action recognition have been explored. Experimental results on benchmark datasets and comparative analyses with the state-of-the-art methods show the effectiveness and merits of the proposed algorithms

    Automatic Video-based Analysis of Human Motion

    Get PDF

    Automatic Sign Language Recognition from Image Data

    Get PDF
    Tato práce se zabývá problematikou automatického rozpoznávání znakového jazyka z obrazových dat. Práce představuje pět hlavních přínosů v oblasti tvorby systému pro rozpoznávání, tvorby korpusů, extrakci příznaků z rukou a obličeje s využitím metod pro sledování pozice a pohybu rukou (tracking) a modelování znaků s využitím menších fonetických jednotek (sub-units). Metody využité v rozpoznávacím systému byly využity i k tvorbě vyhledávacího nástroje "search by example", který dokáže vyhledávat ve videozáznamech podle obrázku ruky. Navržený systém pro automatické rozpoznávání znakového jazyka je založen na statistickém přístupu s využitím skrytých Markovových modelů, obsahuje moduly pro analýzu video dat, modelování znaků a dekódování. Systém je schopen rozpoznávat jak izolované, tak spojité promluvy. Veškeré experimenty a vyhodnocení byly provedeny s vlastními korpusy UWB-06-SLR-A a UWB-07-SLR-P, první z nich obsahuje 25 znaků, druhý 378. Základní extrakce příznaků z video dat byla provedena na nízkoúrovňových popisech obrazu. Lepších výsledků bylo dosaženo s příznaky získaných z popisů vyšší úrovně porozumění obsahu v obraze, které využívají sledování pozice rukou a metodu pro segmentaci rukou v době překryvu s obličejem. Navíc, využitá metoda dokáže interpolovat obrazy s obličejem v době překryvu a umožňuje tak využít metody pro extrakci příznaků z obličeje, které by během překryvu nefungovaly, jako např. metoda active appearance models (AAM). Bylo porovnáno několik různých metod pro extrakci příznaků z rukou, jako např. local binary patterns (LBP), histogram of oriented gradients (HOG), vysokoúrovnové lingvistické příznaky a nové navržená metoda hand shape radial distance function (hRDF). Bylo také zkoumáno využití menších fonetických jednotek, než jsou celé znaky, tzv. sub-units. Pro první krok tvorby těchto jednotek byl navržen iterativní algoritmus, který tyto jednotky automaticky vytváří analýzou existujících dat. Bylo ukázáno, že tento koncept je vhodný pro modelování a rozpoznávání znaků. Kromě systému pro rozpoznávání je v práci navržen a představen systém "search by example", který funguje jako vyhledávací systém pro videa se záznamy znakového jazyka a může být využit například v online slovnících znakového jazyka, kde je v současné době složité či nemožné v takovýchto datech vyhledávat. Tento nástroj využívá metody, které byly použity v rozpoznávacím systému. Výstupem tohoto vyhledávacího nástroje je seřazený seznam videí, které obsahují stejný nebo podobný tvar ruky, které zadal uživatel, např. přes webkameru.Katedra kybernetikyObhájenoThis thesis addresses several issues of automatic sign language recognition, namely the creation of vision based sign language recognition framework, sign language corpora creation, feature extraction, making use of novel hand tracking with face occlusion handling, data-driven creation of sub-units and "search by example" tool for searching in sign language corpora using hand images as a search query. The proposed sign language recognition framework, based on statistical approach incorporating hidden Markov models (HMM), consists of video analysis, sign modeling and decoding modules. The framework is able to recognize both isolated signs and continuous utterances from video data. All experiments and evaluations were performed on two own corpora, UWB-06-SLR-A and UWB-07-SLR-P, the first containing 25 signs and second 378. As a baseline feature descriptors, low level image features are used. It is shown that better performance is gained by higher level features that employ hand tracking, which resolve occlusions of hands and face. As a side effect, the occlusion handling method interpolates face area in the frames during the occlusion and allows to use face feature descriptors that fail in such a case, for instance features extracted from active appearance models (AAM) tracker. Several state-of-the-art appearance-based feature descriptors were compared for tracked hands, such as local binary patterns (LBP), histogram of oriented gradients (HOG), high-level linguistic features or newly proposed hand shape radial distance function (denoted as hRDF) that enhances the feature description of hand-shape like concave regions. The concept of sub-units, that uses HMM models based on linguistic units smaller than whole sign and covers inner structures of the signs, was investigated in the proposed iterative method that is a first required step for data-driven construction of sub-units, and shows that such a concept is suitable for sign modeling and recognition tasks. Except of experiments in the sign language recognition, additional tool \textit{search by example} was created and evaluated. This tool is a search engine for sign language videos. Such a system can be incorporated into an online sign language dictionary where it is difficult to search in the sign language data. This proposed tool employs several methods which were examined in the sign language recognition task and allows to search in the video corpora based on an user-given query that consists of one or multiple images of hands. As a result, an ordered list of videos that contain the same or similar hand configurations is returned

    Action-Gons: Action recognition with a discriminative dictionary of structured elements with varying granularity

    Get PDF
    LNCS v. 9007 entitled: Computer Vision -- ACCV 2014: 12th Asian Conference on Computer ..., Part 5This paper presents “Action-Gons”, a middle level representation for action recognition in videos. Actions in videos exhibit a reasonable level of regularity seen in human behavior, as well as a large degree of variation. One key property of action, compared with image scene, might be the amount of interaction among body parts, although scenes also observe structured patterns in 2D images. Here, we study highorder statistics of the interaction among regions of interest in actions and propose a mid-level representation for action recognition, inspired by the Julesz school of n-gon statistics. We propose a systematic learning process to build an over-complete dictionary of “Action-Gons”. We first extract motion clusters, named as action units, then sequentially learn a pool of action-gons with different granularities modeling different degree of interactions among action units. We validate the discriminative power of our learned action-gons on three challenging video datasets and show evident advantages over the existing methods. © Springer International Publishing Switzerland 2015.postprin

    Spatiotemporal visual analysis of human actions

    No full text
    In this dissertation we propose four methods for the recognition of human activities. In all four of them, the representation of the activities is based on spatiotemporal features that are automatically detected at areas where there is a significant amount of independent motion, that is, motion that is due to ongoing activities in the scene. We propose the use of spatiotemporal salient points as features throughout this dissertation. The algorithms presented, however, can be used with any kind of features, as long as the latter are well localized and have a well-defined area of support in space and time. We introduce the utilized spatiotemporal salient points in the first method presented in this dissertation. By extending previous work on spatial saliency, we measure the variations in the information content of pixel neighborhoods both in space and time, and detect the points at the locations and scales for which this information content is locally maximized. In this way, an activity is represented as a collection of spatiotemporal salient points. We propose an iterative linear space-time warping technique in order to align the representations in space and time and propose to use Relevance Vector Machines (RVM) in order to classify each example into an action category. In the second method proposed in this dissertation we propose to enhance the acquired representations of the first method. More specifically, we propose to track each detected point in time, and create representations based on sets of trajectories, where each trajectory expresses how the information engulfed by each salient point evolves over time. In order to deal with imperfect localization of the detected points, we augment the observation model of the tracker with background information, acquired using a fully automatic background estimation algorithm. In this way, the tracker favors solutions that contain a large number of foreground pixels. In addition, we perform experiments where the tracked templates are localized on specific parts of the body, like the hands and the head, and we further augment the tracker’s observation model using a human skin color model. Finally, we use a variant of the Longest Common Subsequence algorithm (LCSS) in order to acquire a similarity measure between the resulting trajectory representations, and RVMs for classification. In the third method that we propose, we assume that neighboring salient points follow a similar motion. This is in contrast to the previous method, where each salient point was tracked independently of its neighbors. More specifically, we propose to extract a novel set of visual descriptors that are based on geometrical properties of three-dimensional piece-wise polynomials. The latter are fitted on the spatiotemporal locations of salient points that fall within local spatiotemporal neighborhoods, and are assumed to follow a similar motion. The extracted descriptors are invariant in translation and scaling in space-time. Coupling the neighborhood dimensions to the scale at which the corresponding spatiotemporal salient points are detected ensures the latter. The descriptors that are extracted across the whole dataset are subsequently clustered in order to create a codebook, which is used in order to represent the overall motion of the subjects within small temporal windows.Finally,we use boosting in order to select the most discriminative of these windows for each class, and RVMs for classification. The fourth and last method addresses the joint problem of localization and recognition of human activities depicted in unsegmented image sequences. Its main contribution is the use of an implicit representation of the spatiotemporal shape of the activity, which relies on the spatiotemporal localization of characteristic ensembles of spatiotemporal features. The latter are localized around automatically detected salient points. Evidence for the spatiotemporal localization of the activity is accumulated in a probabilistic spatiotemporal voting scheme. During training, we use boosting in order to create codebooks of characteristic feature ensembles for each class. Subsequently, we construct class-specific spatiotemporal models, which encode where in space and time each codeword ensemble appears in the training set. During testing, each activated codeword ensemble casts probabilistic votes concerning the spatiotemporal localization of the activity, according to the information stored during training. We use a Mean Shift Mode estimation algorithm in order to extract the most probable hypotheses from each resulting voting space. Each hypothesis corresponds to a spatiotemporal volume which potentially engulfs the activity, and is verified by performing action category classification with an RVM classifier
    corecore