    Analysis of the hands in egocentric vision: A survey

    Egocentric vision (a.k.a. first-person vision - FPV) applications have thrived over the past few years, thanks to the availability of affordable wearable cameras and large annotated datasets. The position of the wearable camera (usually mounted on the head) allows recording exactly what the camera wearers have in front of them, in particular hands and manipulated objects. This intrinsic advantage enables the study of the hands from multiple perspectives: localizing hands and their parts within the images; understanding what actions and activities the hands are involved in; and developing human-computer interfaces that rely on hand gestures. In this survey, we review the literature that focuses on the hands using egocentric vision, categorizing the existing approaches into: localization (where are the hands or parts of them?); interpretation (what are the hands doing?); and application (e.g., systems that used egocentric hand cues for solving a specific problem). Moreover, a list of the most prominent datasets with hand-based annotations is provided

    To Design and Analyze an Method for Unsupervised Recurrent All-Pairs Optical Flow Field Transform Algorithm

    The SMURF, a powerful technique for optical flow unsupervised learning that closes the gap with supervised methods, exhibits good cross-dataset generalization, and even allows for "zero-shot" depth estimation. SMURF introduces several important improvements: full-image warping for learning to predict out-of-frame motion, multi-frame self-supervision for improved flow estimates in occluded regions, and most importantly, modifications to the unsupervised losses and data augmentation that allow the RAFT architecture to operate in an unsupervised setting. These developments, in our opinion, take unsupervised optical flow one step closer to becoming truly practical, enabling optical flow models trained on unlabeled videos to deliver accurate pixel-matching in areas where labeled data is lacking

    Deep Learning-Based Human Pose Estimation: A Survey

    Human pose estimation aims to locate the human body parts and build human body representation (e.g., body skeleton) from input data such as images and videos. It has drawn increasing attention during the past decade and has been utilized in a wide range of applications including human-computer interaction, motion analysis, augmented reality, and virtual reality. Although the recently developed deep learning-based solutions have achieved high performance in human pose estimation, there still remain challenges due to insufficient training data, depth ambiguities, and occlusion. The goal of this survey paper is to provide a comprehensive review of recent deep learning-based solutions for both 2D and 3D pose estimation via a systematic analysis and comparison of these solutions based on their input data and inference procedures. More than 240 research papers since 2014 are covered in this survey. Furthermore, 2D and 3D human pose estimation datasets and evaluation metrics are included. Quantitative performance comparisons of the reviewed methods on popular datasets are summarized and discussed. Finally, the challenges involved, applications, and future research directions are concluded. We also provide a regularly updated project page: \url{https://github.com/zczcwh/DL-HPE

    Visual Representation Learning with Limited Supervision

    The quality of a Computer Vision system is proportional to the rigor of data representation it is built upon. Learning expressive representations of images is therefore the centerpiece to almost every computer vision application, including image search, object detection and classification, human re-identification, object tracking, pose understanding, image-to-image translation, and embodied agent navigation to name a few. Deep Neural Networks are most often seen among the modern methods of representation learning. The limitation is, however, that deep representation learning methods require extremely large amounts of manually labeled data for training. Clearly, annotating vast amounts of images for various environments is infeasible due to cost and time constraints. This requirement of obtaining labeled data is a prime restriction regarding pace of the development of visual recognition systems. In order to cope with the exponentially growing amounts of visual data generated daily, machine learning algorithms have to at least strive to scale at a similar rate. The second challenge consists in the learned representations having to generalize to novel objects, classes, environments and tasks in order to accommodate to the diversity of the visual world. Despite the evergrowing number of recent publications tangentially addressing the topic of learning generalizable representations, efficient generalization is yet to be achieved. This dissertation attempts to tackle the problem of learning visual representations that can generalize to novel settings while requiring few labeled examples. In this research, we study the limitations of the existing supervised representation learning approaches and propose a framework that improves the generalization of learned features by exploiting visual similarities between images which are not captured by provided manual annotations. Furthermore, to mitigate the common requirement of large scale manually annotated datasets, we propose several approaches that can learn expressive representations without human-attributed labels, in a self-supervised fashion, by grouping highly-similar samples into surrogate classes based on progressively learned representations. The development of computer vision as science is preconditioned upon the seamless ability of a machine to record and disentangle pictures' attributes that were expected to only be conceived by humans. As such, particular interest was dedicated to the ability to analyze the means of artistic expression and style which depicts a more complex task than merely breaking an image down to colors and pixels. The ultimate test for this ability is the task of style transfer which involves altering the style of an image while keeping its content. An effective solution of style transfer requires learning such image representation which would allow disentangling image style and its content. Moreover, particular artistic styles come with idiosyncrasies that affect which content details should be preserved and which discarded. Another pitfall here is that it is impossible to get pixel-wise annotations of style and how the style should be altered. We address this problem by proposing an unsupervised approach that enables encoding the image content in such a way that is required by a particular style. The proposed approach exchanges the style of an input image by first extracting the content representation in a style-aware way and then rendering it in a new style using a style-specific decoder network, achieving compelling results in image and video stylization. Finally, we combine supervised and self-supervised representation learning techniques for the task of human and animals pose understanding. The proposed method enables transfer of the representation learned for recognition of human poses to proximal mammal species without using labeled animal images. This approach is not limited to dense pose estimation and could potentially enable autonomous agents from robots to self-driving cars to retrain themselves and adapt to novel environments based on learning from previous experiences

    A deep learning solution for real-time human motion decoding in smart walkers

    Dissertação de mestrado integrado em Engenharia Biomédica (especialização em Eletrónica Médica)The treatment of gait impairments has increasingly relied on rehabilitation therapies which benefit from the use of smart walkers. These walkers still lack advanced and seamless Human-Robot Interaction, which intuitively understands the intentions of human motion, empowering the user’s recovery state and autonomy, while reducing the physician’s effort. This dissertation proposes the development of a deep learning solution to tackle the human motion decoding problematic in smart walkers, using only lower body vision information from a camera stream, mounted on the WALKit Smart Walker, a smart walker prototype for rehabilitation purposes. Different deep learning frameworks were designed for early human motion recognition and detec tion. A custom acquisition method, including a smart walker’s automatic driving algorithm and labelling procedure, was also designed to enable further training and evaluation of the proposed frameworks. Facing a 4-class (stop, walk, turn right/left) classification problem, a deep learning convolutional model with an attention mechanism achieved the best results: an offline f1-score of 99.61%, an online calibrated instantaneous precision higher than 97% and a human-centred focus slightly higher than 30%. Promising results were attained for early human motion detection, with enhancements in the focus of the proposed architectures. However, further improvements are still needed to achieve a more reliable solution for integration in a smart walker’s control strategy, based in the human motion intentions.O tratamento de distúrbios da marcha tem apostado cada vez mais em terapias de reabilitação que beneficiam do uso de andarilhos inteligentes. Estes ainda carecem de uma Interação Humano-Robô avançada e eficaz, capaz de entender, intuitivamente, as intenções do movimento humano, fortalecendo a recuperação autónoma do paciente e reduzindo o esforço médico. Esta dissertação propõe o desenvolvimento de uma solução de aprendizagem para o problema de descodificação de movimento humano em andarilhos inteligentes, usando apenas vídeos recolhidos pelo WALKit Smart Walker, um protótipo de andarilho inteligente usado para reabilitação. Foram desenvolvidos algoritmos de aprendizagem para o reconhecimento e detecção precoces de movimento humano. Um método de aquisição personalizado, incluindo um algoritmo de condução e labelização automatizados, foi projetado para permitir o conseguinte treino e avaliação dos algoritmos propostos. Perante a classificação de 4 ações (parar, andar, virar à direita/esquerda), um modelo convolucional com um mecanismo de atenção alcançou os melhores resultados: f1-score offline de 99,61%, precisão instantânea calibrada online de superior a 97 % e um foco centrado no ser humano ligeiramente superior a 30%. Com esta dissertação alcançaram-se resultados promissores para a detecção precoce de movimento humano, com aprimoramentos no foco dos algoritmos propostos. No entanto, ainda são necessárias melhorias adicionais para alcançar uma solução mais robusta para a integração na estratégia de controlo de um andarilho inteligente, com base nas intenções de movimento do utilizador

    The Wits intelligent teaching system (WITS): a smart lecture theatre to assess audience engagement

    A Thesis submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in fulfilment of the requirements for the degree of Doctor of Philosophy, 2017The utility of lectures is directly related to the engagement of the students therein. To ensure the value of lectures, one needs to be certain that they are engaging to students. In small classes experienced lecturers develop an intuition of how engaged the class is as a whole and can then react appropriately to remedy the situation through various strategies such as breaks or changes in style, pace and content. As both the number of students and size of the venue grow, this type of contingent teaching becomes increasingly difficult and less precise. Furthermore, relying on intuition alone gives no way to recall and analyse previous classes or to objectively investigate trends over time. To address these problems this thesis presents the WITS INTELLIGENT TEACHING SYSTEM (WITS) to highlight disengaged students during class. A web-based, mobile application called Engage was developed to try elicit anonymous engagement information directly from students. The majority of students were unwilling or unable to self-report their engagement levels during class. This stems from a number of cultural and practical issues related to social display rules, unreliable internet connections, data costs, and distractions. This result highlights the need for a non-intrusive system that does not require the active participation of students. A nonintrusive, computer vision and machine learning based approach is therefore proposed. To support the development thereof, a labelled video dataset of students was built by recording a number of first year lectures. Students were labelled across a number of affects – including boredom, frustration, confusion, and fatigue – but poor inter-rater reliability meant that these labels could not be used as ground truth. Based on manual coding methods identified in the literature, a number of actions, gestures, and postures were identified as proxies of behavioural engagement. These proxies are then used in an observational checklist to mark students as engaged or not. A Support Vector Machine (SVM) was trained on Histograms of Oriented Gradients (HOG) to classify the students based on the identified behaviours. The results suggest a high temporal correlation of a single subject’s video frames. This leads to extremely high accuracies on seen subjects. However, this approach generalised poorly to unseen subjects and more careful feature engineering is required. The use of Convolutional Neural Networks (CNNs) improved the classification accuracy substantially, both over a single subject and when generalising to unseen subjects. While more computationally expensive than the SVM, the CNN approach lends itself to parallelism using Graphics Processing Units (GPUs). With GPU hardware acceleration, the system is able to run in near real-time and with further optimisations a real-time classifier is feasible. The classifier provides engagement values, which can be displayed to the lecturer live during class. This information is displayed as an Interest Map which highlights spatial areas of disengagement. The lecturer can then make informed decisions about how to progress with the class, what teaching styles to employ, and on which students to focus. An Interest Map was presented to lecturers and professors at the University of the Witwatersrand yielding 131 responses. The vast majority of respondents indicated that they would like to receive live engagement feedback during class, that they found the Interest Map an intuitive visualisation tool, and that they would be interested in using such technology. Contributions of this thesis include the development of a labelled video dataset; the development of a web based system to allow students to self-report engagement; the development of cross-platform, open-source software for spatial, action and affect labelling; the application of Histogram of Oriented Gradient based Support Vector Machines, and Deep Convolutional Neural Networks to classify this data; the development of an Interest Map to intuitively display engagement information to presenters; and finally an analysis of acceptance of such a system by educators.XL201

    Coping with Data Scarcity in Deep Learning and Applications for Social Good

    The recent years are experiencing an extremely fast evolution of the Computer Vision and Machine Learning fields: several application domains benefit from the newly developed technologies and industries are investing a growing amount of money in Artificial Intelligence. Convolutional Neural Networks and Deep Learning substantially contributed to the rise and the diffusion of AI-based solutions, creating the potential for many disruptive new businesses. The effectiveness of Deep Learning models is grounded by the availability of a huge amount of training data. Unfortunately, data collection and labeling is an extremely expensive task in terms of both time and costs; moreover, it frequently requires the collaboration of domain experts. In the first part of the thesis, I will investigate some methods for reducing the cost of data acquisition for Deep Learning applications in the relatively constrained industrial scenarios related to visual inspection. I will primarily assess the effectiveness of Deep Neural Networks in comparison with several classical Machine Learning algorithms requiring a smaller amount of data to be trained. Hereafter, I will introduce a hardware-based data augmentation approach, which leads to a considerable performance boost taking advantage of a novel illumination setup designed for this purpose. Finally, I will investigate the situation in which acquiring a sufficient number of training samples is not possible, in particular the most extreme situation: zero-shot learning (ZSL), which is the problem of multi-class classification when no training data is available for some of the classes. Visual features designed for image classification and trained offline have been shown to be useful for ZSL to generalize towards classes not seen during training. Nevertheless, I will show that recognition performances on unseen classes can be sharply improved by learning ad hoc semantic embedding (the pre-defined list of present and absent attributes that represent a class) and visual features, to increase the correlation between the two geometrical spaces and ease the metric learning process for ZSL. In the second part of the thesis, I will present some successful applications of state-of-the- art Computer Vision, Data Analysis and Artificial Intelligence methods. I will illustrate some solutions developed during the 2020 Coronavirus Pandemic for controlling the disease vii evolution and for reducing virus spreading. I will describe the first publicly available dataset for the analysis of face-touching behavior that we annotated and distributed, and I will illustrate an extensive evaluation of several computer vision methods applied to the produced dataset. Moreover, I will describe the privacy-preserving solution we developed for estimating the \u201cSocial Distance\u201d and its violations, given a single uncalibrated image in unconstrained scenarios. I will conclude the thesis with a Computer Vision solution developed in collaboration with the Egyptian Museum of Turin for digitally unwrapping mummies analyzing their CT scan, to support the archaeologists during mummy analysis and avoiding the devastating and irreversible process of physically unwrapping the bandages for removing amulets and jewels from the body

    Analysis of 3D human gait reconstructed with a depth camera and mirrors

    L'évaluation de la démarche humaine est l'une des composantes essentielles dans les soins de santé. Les systèmes à base de marqueurs avec plusieurs caméras sont largement utilisés pour faire cette analyse. Cependant, ces systèmes nécessitent généralement des équipements spécifiques à prix élevé et/ou des moyens de calcul intensif. Afin de réduire le coût de ces dispositifs, nous nous concentrons sur un système d'analyse de la marche qui utilise une seule caméra de profondeur. Le principe de notre travail est similaire aux systèmes multi-caméras, mais l'ensemble de caméras est remplacé par un seul capteur de profondeur et des miroirs. Chaque miroir dans notre configuration joue le rôle d'une caméra qui capture la scène sous un point de vue différent. Puisque nous n'utilisons qu'une seule caméra, il est ainsi possible d'éviter l'étape de synchronisation et également de réduire le coût de l'appareillage. Notre thèse peut être divisée en deux sections: reconstruction 3D et analyse de la marche. Le résultat de la première section est utilisé comme entrée de la seconde. Notre système pour la reconstruction 3D est constitué d'une caméra de profondeur et deux miroirs. Deux types de capteurs de profondeur, qui se distinguent sur la base du mécanisme d'estimation de profondeur, ont été utilisés dans nos travaux. Avec la technique de lumière structurée (SL) intégrée dans le capteur Kinect 1, nous effectuons la reconstruction 3D à partir des principes de l'optique géométrique. Pour augmenter le niveau des détails du modèle reconstruit en 3D, la Kinect 2 qui estime la profondeur par temps de vol (ToF), est ensuite utilisée pour l'acquisition d'images. Cependant, en raison de réflections multiples sur les miroirs, il se produit une distorsion de la profondeur dans notre système. Nous proposons donc une approche simple pour réduire cette distorsion avant d'appliquer les techniques d'optique géométrique pour reconstruire un nuage de points de l'objet 3D. Pour l'analyse de la démarche, nous proposons diverses alternatives centrées sur la normalité de la marche et la mesure de sa symétrie. Cela devrait être utile lors de traitements cliniques pour évaluer, par exemple, la récupération du patient après une intervention chirurgicale. Ces méthodes se composent d'approches avec ou sans modèle qui ont des inconvénients et avantages différents. Dans cette thèse, nous présentons 3 méthodes qui traitent directement les nuages de points reconstruits dans la section précédente. La première utilise la corrélation croisée des demi-corps gauche et droit pour évaluer la symétrie de la démarche, tandis que les deux autres methodes utilisent des autoencodeurs issus de l'apprentissage profond pour mesurer la normalité de la démarche.The problem of assessing human gaits has received a great attention in the literature since gait analysis is one of key components in healthcare. Marker-based and multi-camera systems are widely employed to deal with this problem. However, such systems usually require specific equipments with high price and/or high computational cost. In order to reduce the cost of devices, we focus on a system of gait analysis which employs only one depth sensor. The principle of our work is similar to multi-camera systems, but the collection of cameras is replaced by one depth sensor and mirrors. Each mirror in our setup plays the role of a camera which captures the scene at a different viewpoint. Since we use only one camera, the step of synchronization can thus be avoided and the cost of devices is also reduced. Our studies can be separated into two categories: 3D reconstruction and gait analysis. The result of the former category is used as the input of the latter one. Our system for 3D reconstruction is built with a depth camera and two mirrors. Two types of depth sensor, which are distinguished based on the scheme of depth estimation, have been employed in our works. With the structured light (SL) technique integrated into the Kinect 1, we perform the 3D reconstruction based on geometrical optics. In order to increase the level of details of the 3D reconstructed model, the Kinect 2 with time-of-flight (ToF) depth measurement is used for image acquisition instead of the previous generation. However, due to multiple reflections on the mirrors, depth distortion occurs in our setup. We thus propose a simple approach for reducing such distortion before applying geometrical optics to reconstruct a point cloud of the 3D object. For the task of gait analysis, we propose various alternative approaches focusing on the problem of gait normality/symmetry measurement. They are expected to be useful for clinical treatments such as monitoring patient's recovery after surgery. These methods consist of model-free and model-based approaches that have different cons and pros. In this dissertation, we present 3 methods that directly process point clouds reconstructed from the previous work. The first one uses cross-correlation of left and right half-bodies to assess gait symmetry while the other ones employ deep auto-encoders to measure gait normality

    Proceedings XXI Congresso SIAMOC 2021

    XXI Congresso Annuale della SIAMOC, modalità telematica il 30 settembre e il 1° ottobre 2021. Come da tradizione, il congresso vuole essere un’occasione di arricchimento e mutuo scambio, dal punto di vista scientifico e umano. Verranno toccati i temi classici dell’analisi del movimento, come lo sviluppo e l’applicazione di metodi per lo studio del movimento nel contesto clinico, e temi invece estremamente attuali, come la teleriabilitazione e il telemonitoraggio