139 research outputs found

    Apprentissage profond de formes manuscrites pour la reconnaissance et le repérage efficace de l'écriture dans les documents numérisés

    Get PDF
    Malgré les efforts importants de la communauté d’analyse de documents, définir une representation robuste pour les formes manuscrites demeure un défi de taille. Une telle representation ne peut pas être définie explicitement par un ensemble de règles, et doit plutôt être obtenue avec une extraction intelligente de caractéristiques de haut niveau à partir d’images de documents. Dans cette thèse, les modèles d’apprentissage profond sont investigués pour la representation automatique de formes manuscrites. Les représentations proposées par ces modèles sont utilisées pour définir un système de reconnaissance et de repérage de mots individuels dans les documents. Le choix de traiter les mots individuellement est motivé par le fait que n’importe quel texte peut être segmenté en un ensemble de mots séparés. Dans une première contribution, une représentation non supervisée profonde est proposée pour la tâche de repérage de mots manuscrits. Cette représentation se base sur l’algorithme de regroupement spherical k-means, qui est employé pour construire une hiérarchie de fonctions paramétriques encodant les images de documents. Les avantages de cette représentation sont multiples. Tout d’abord, elle est définie de manière non supervisée, ce qui évite la nécessité d’avoir des données annotées pour l’entraînement. Ensuite, elle se calcule rapidement et est de taille compacte, permettant ainsi de repérer des mots efficacement. Dans une deuxième contribution, un modèle de bout en bout est développé pour la reconnaissance de mots manuscrits. Ce modèle est composé d’un réseau de neurones convolutifs qui prend en entrée l’image d’un mot et produit en sortie une représentation du texte reconnu. Ce texte est représenté sous la forme d’un ensemble de sous-sequences bidirectionnelles de caractères formant une hiérarchie. Cette représentation se distingue des approches existantes dans la littérature et offre plusieurs avantages par rapport à celles-ci. Notamment, elle est binaire et a une taille fixe, ce qui la rend robuste à la taille du texte. Par ailleurs, elle capture la distribution des sous-séquences de caractères dans le corpus d’entraînement, et permet donc au modèle entraîné de transférer cette connaissance à de nouveaux mots contenant les memes sous-séquences. Dans une troisième et dernière contribution, un modèle de bout en bout est proposé pour résoudre simultanément les tâches de repérage et de reconnaissance. Ce modèle intègre conjointement les textes et les images de mots dans un seul espace vectoriel. Une image est projetée dans cet espace via un réseau de neurones convolutifs entraîné à détecter les différentes forms de caractères. De même, un mot est projeté dans cet espace via un réseau de neurones récurrents. Le modèle proposé est entraîné de manière à ce que l’image d’un mot et son texte soient projetés au même point. Dans l’espace vectoriel appris, les tâches de repérage et de reconnaissance peuvent être traitées efficacement comme un problème de recherche des plus proches voisins

    HUMAN ACTIVITY RECOGNITION FROM EGOCENTRIC VIDEOS AND ROBUSTNESS ANALYSIS OF DEEP NEURAL NETWORKS

    Get PDF
    In recent years, there has been significant amount of research work on human activity classification relying either on Inertial Measurement Unit (IMU) data or data from static cameras providing a third-person view. There has been relatively less work using wearable cameras, providing egocentric view, which is a first-person view providing the view of the environment as seen by the wearer. Using only IMU data limits the variety and complexity of the activities that can be detected. Deep machine learning has achieved great success in image and video processing in recent years. Neural network based models provide improved accuracy in multiple fields in computer vision. However, there has been relatively less work focusing on designing specific models to improve the performance of egocentric image/video tasks. As deep neural networks keep improving the accuracy in computer vision tasks, the robustness and resilience of the networks should be improved as well to make it possible to be applied in safety-crucial areas such as autonomous driving. Motivated by these considerations, in the first part of the thesis, the problem of human activity detection and classification from egocentric cameras is addressed. First, anew method is presented to count the number of footsteps and compute the total traveled distance by using the data from the IMU sensors and camera of a smart phone. By incorporating data from multiple sensor modalities, and calculating the length of each step, instead of using preset stride lengths and assuming equal-length steps, the proposed method provides much higher accuracy compared to commercially available step counting apps. After the application of footstep counting, more complicated human activities, such as steps of preparing a recipe and sitting on a sofa, are taken into consideration. Multiple classification methods, non-deep learning and deep-learning-based, are presented, which employ both ego-centric camera and IMU data. Then, a Genetic Algorithm-based approach is employed to set the parameters of an activity classification network autonomously and performance is compared with empirically-set parameters. Then, a new framework is introduced to reduce the computational cost of human temporal activity recognition from egocentric videos while maintaining the accuracy at a comparable level. The actor-critic model of reinforcement learning is applied to optical flow data to locate a bounding box around region of interest, which is then used for clipping a sub-image from a video frame. A shallow and deeper 3D convolutional neural network is designed to process the original image and the clipped image region, respectively.Next, a systematic method is introduced that autonomously and simultaneously optimizes multiple parameters of any deep neural network by using a bi-generative adversarial network (Bi-GAN) guiding a genetic algorithm(GA). The proposed Bi-GAN allows the autonomous exploitation and choice of the number of neurons for the fully-connected layers, and number of filters for the convolutional layers, from a large range of values. The Bi-GAN involves two generators, and two different models compete and improve each other progressively with a GAN-based strategy to optimize the networks during a GA evolution.In this analysis, three different neural network layers and datasets are taken into consideration: First, 3D convolutional layers for ModelNet40 dataset. We applied the proposed approach on a 3D convolutional network by using the ModelNet40 dataset. ModelNet is a dataset of 3D point clouds. The goal is to perform shape classification over 40shape classes. LSTM layers for UCI HAR dataset. UCI HAR dataset is composed of InertialMeasurement Unit (IMU) data captured during activities of standing, sitting, laying, walking, walking upstairs and walking downstairs. These activities were performed by 30 subjects, and the 3-axial linear acceleration and 3-axial angular velocity were collected at a constant rate of 50Hz. 2D convolutional layers for Chars74k Dataset. Chars74k dataset contains 64 classes(0-9, A-Z, a-z), 7705 characters obtained from natural images, 3410 hand-drawn characters using a tablet PC and 62992 synthesised characters from computer fonts giving a total of over 74K images. In the final part of the thesis, network robustness and resilience for neural network models is investigated from adversarial examples (AEs) and automatic driving conditions. The transferability of adversarial examples across a wide range of real-world computer vision tasks, including image classification, explicit content detection, optical character recognition(OCR), and object detection are investigated. It represents the cybercriminal’s situation where an ensemble of different detection mechanisms need to be evaded all at once.Novel dispersion Reduction(DR) attack is designed, which is a practical attack that overcomes existing attacks’ limitation of requiring task-specific loss functions by targeting on the “dispersion” of internal feature map. In the autonomous driving scenario, the adversarial machine learning attacks against the complete visual perception pipeline in autonomous driving is studied. A novel attack technique, tracker hijacking, that can effectively fool Multi-Object Tracking (MOT) using AEs on object detection is presented. Using this technique, successful AEs on as few as one single frame can move an existing object in to or out of the headway of an autonomous vehicle to cause potential safety hazards

    Transfer learning for multi-channel time-series Human Activity Recognition

    Get PDF
    Abstract for the PHD Thesis Transfer Learning for Multi-Channel Time-Series Human Activity Recognition Methods of human activity recognition (HAR) have been developed for the purpose of automatically classifying recordings of human movements into a set of activities. Capturing, evaluating, and analysing sequential data to recognise human activities accurately is critical for many applications in pervasive and ubiquitous computing applications, e.g., in applications such as mobile- or ambient-assisted living, smart-homes, activities of daily living, health support and rehabilitation, sports, automotive surveillance, and industry 4.0. For example, HAR is particularly interesting for optimisation in those industries where manual work remains dominant. HAR takes as inputs signals from videos or from multi-channel time-series, e.g., human joint measurements from marker-based motion capturing systems and inertial measurements measured by wearables or on-body devices. Wearables have become relevant as they extend the potential of HAR beyond constrained or laboratory settings. This thesis focuses on HAR using multi-channel time-series. Multi-channel Time-Series HAR is, in general, a challenging classification task. This is because human activities and movements show a large variation. Humans carry out in similar manner activities that are semantically very distinctive; conversely, they carry out similar activities in many different ways. Furthermore, multi-channel Time-Series HAR datasets suffer from the class unbalance problem, with more samples of certain activities than others. This problem strongly depends on the annotation. Moreover, there are non-standard definitions of human activities for annotation. Methods based on Deep Neural Networks (DNNs) are prevalent for Multi-channel Time-Series HAR. Nevertheless, the performance of DNNs has not significantly increased compared to as other fields such as image classification or segmentation. DNNs present a low sample efficiency as they learn the temporal structure from activities completely from data. Considering supervised DNNs, the scarcity of annotated data is the primary concern. Annotated data from human behaviour is scarce and costly to obtain. The annotation process demands enormous resources. Additionally, annotation reliability varies because they can be subject to human errors or unclear and non-elaborated annotation protocols. Transfer learning has been used to cope with a limited amount of annotated data, overfitting, zero-shot learning or classification of unseen human activities, and the class-unbalance problem. Transfer learning can alleviate the problem of scarcity of annotated data. Learnt parameters and feature representations from a specific source domain are transferred to a target domain. Transfer learning extends the usability of large annotated data from source domains to related problems. This thesis proposes a general transfer learning approach to improve automatic multi-channel Time-Series HAR. The proposed transfer learning method combines a semantic attribute representation of activities and a specific deep neural network. It handles situations where the source and target domains differ, i.e., the sensor space and the set of activities change, without needing a large amount of annotated data from the target domain. The method considers different levels of transferability. First, an architecture handles a variate of dataset configurations in regard to the number of devices and their type; it creates fixed-size representations of sensor recordings that are representative of the human limbs. These networks will process sequences of movements from the human limbs, either from poses or inertial measurements. Second, it introduces a search of semantic attribute representations that favourably represent signal segments for recognising human activities in unknown scenarios, as they only consider annotations of activities, and they lack human-annotated semantic attributes. And third, it covers transferability from data of a variety of source datasets. The method takes advantage of a large human-pose dataset as a source domain, which is created during the develop of this thesis. Furthermore, synthetic-inertial measurements will be derived from sequences of human poses either from a marker-based motion capturing system or video-based HAR and pose-based HAR datasets. The latter will specifically use the annotations of pixel-coordinate of human poses as multi-channel time-series data. Real inertial measurements and these synthetic measurements will then be deployed as a source domain for parameter transfer learning. Experimentation on different target datasets demonstrates that the proposed transfer learning method improves performance, most evidently when deploying a proportion of their training material. This outcome suggests that the temporal convolutional filters are rather general as they learn local temporal relations of human movements related to the semantic attributes, independent of the number of devices and their type. A human-limb-oriented deep architecture and an evolutionary algorithm provide an out-of-the-shelf predictor of semantic attributes that can be deployed directly on a new target scenario. Very related problems can directly be addressed by manually giving the attribute-to-activity relations without the need for a search throughout an evolutionary algorithm. Besides, the learnt convolutional filters are activity class dependent. Hence, the classification performance on the activities shared among the datasets improves

    Detection of Driver Drowsiness and Distraction Using Computer Vision and Machine Learning Approaches

    Get PDF
    Drowsiness and distracted driving are leading factor in most car crashes and near-crashes. This research study explores and investigates the applications of both conventional computer vision and deep learning approaches for the detection of drowsiness and distraction in drivers. In the first part of this MPhil research study conventional computer vision approaches was studied to develop a robust drowsiness and distraction system based on yawning detection, head pose detection and eye blinking detection. These algorithms were implemented by using existing human crafted features. Experiments were performed for the detection and classification with small image datasets to evaluate and measure the performance of system. It was observed that the use of human crafted features together with a robust classifier such as SVM gives better performance in comparison to previous approaches. Though, the results were satisfactorily, there are many drawbacks and challenges associated with conventional computer vision approaches, such as definition and extraction of human crafted features, thus making these conventional algorithms to be subjective in nature and less adaptive in practice. In contrast, deep learning approaches automates the feature selection process and can be trained to learn the most discriminative features without any input from human. In the second half of this research study, the use of deep learning approaches for the detection of distracted driving was investigated. It was observed that one of the advantages of the applied methodology and technique for distraction detection includes and illustrates the contribution of CNN enhancement to a better pattern recognition accuracy and its ability to learn features from various regions of a human body simultaneously. The comparison of the performance of four convolutional deep net architectures (AlexNet, ResNet, MobileNet and NASNet) was carried out, investigated triplet training and explored the impact of combining a support vector classifier (SVC) with a trained deep net. The images used in our experiments with the deep nets are from the State Farm Distracted Driver Detection dataset hosted on Kaggle, each of which captures the entire body of a driver. The best results were obtained with the NASNet trained using triplet loss and combined with an SVC. It was observed that one of the advantages of deep learning approaches are their ability to learn discriminative features from various regions of a human body simultaneously. The ability has enabled deep learning approaches to reach accuracy at human level.
    corecore