6 research outputs found

    Pattern Recognition - 40th German Conference, GCPR 2018, Stuttgart, Germany, October 9-12, 2018, Proceedings

    No full text
    Pattern Recognition - 40th German Conference, GCPR 2018, Stuttgart, Germany, October 9-12, 2018, Proceeding

    Information Maximization Clustering via Multi-View Self-Labelling

    Full text link
    Image clustering is a particularly challenging computer vision task, which aims to generate annotations without human supervision. Recent advances focus on the use of self-supervised learning strategies in image clustering, by first learning valuable semantics and then clustering the image representations. These multiple-phase algorithms, however, increase the computational time and their final performance is reliant on the first stage. By extending the self-supervised approach, we propose a novel single-phase clustering method that simultaneously learns meaningful representations and assigns the corresponding annotations. This is achieved by integrating a discrete representation into the self-supervised paradigm through a classifier net. Specifically, the proposed clustering objective employs mutual information, and maximizes the dependency between the integrated discrete representation and a discrete probability distribution. The discrete probability distribution is derived though the self-supervised process by comparing the learnt latent representation with a set of trainable prototypes. To enhance the learning performance of the classifier, we jointly apply the mutual information across multi-crop views. Our empirical results show that the proposed framework outperforms state-of-the-art techniques with the average accuracy of 89.1% and 49.0%, respectively, on CIFAR-10 and CIFAR-100/20 datasets. Finally, the proposed method also demonstrates attractive robustness to parameter settings, making it ready to be applicable to other datasets

    Reconocimiento de la actividad humana mediante aprendizaje profundo en imágenes de vídeo y sobre dataset multimodal

    Get PDF
    El campo del Reconocimiento de la Actividad Humana (HAR) se encuentra en auge debido a la creciente demanda de análisis de vídeo aplicado al ámbito médico. No obstante, la tarea de predicción de actividades en una secuencia de vídeo no es trivial, puesto que existen numerosos factores como la iluminación o el ángulo de captura, que afectan al reconocimiento. El objetivo del trabajo es poder realizar este Reconocimiento de la Actividad Humana haciendo uso de Aprendizaje Profundo (Deep Learning), más concretamente, mediante una Red Neuronal. La red utilizada permite ejercer la tarea de clasificación de secuencias de imágenes. Para la extracción de características de las imágenes se emplean capas convolucionales 3D, asimismo, se emplean bloques residuales para mitigar el problema del desvanecimiento de gradiente observado en redes con un elevado número de capas. Trabajos previos han realizado estimación de poses de las mismas secuencias de vídeo, así como han llevado a cabo el HAR mediante Aprendizaje Profundo haciendo uso de datos provenientes de sensores. Debido al aumento en el uso de sistemas de captura ópticos para la adquisición de datos, han surgido grandes datasets de refencia. No obstante, el trabajo se centra en el reconocimiento de actividades con relevancia en el ámbito médico, razón por la cual se ha hecho uso del dataset adquirido por el grupo de investigación. En consecuencia, se ha llevado a cabo el reconocimiento de 13 actividades realizadas por 37 sujetos diferentes. El entrenamiento de la red para dicho dataset ha sido realizado tanto desde cero, como mediante el uso de transfer learning. Se ha observado como el empleo de un modelo pre-entrenado permite llegar al punto de convergencia de la red más rápidamente, ahorrando además capacidad computacional. Además, se muestran las dificultades del reconocimiento de datos provenientes de sistemas de captura ópticos, como son la dificultad en clasificación de actividades con movimiento reducido, o actividades bimanuales.Human Activity Recognition (HAR) has garnered a lot of attention due to the growing demand for video analysis applied to the medical field. However, the task of predicting activities in video sequences is not trivial, since there are numerous factors that affect the recognition, such as lighting or the viewpoint. The purpose of this work is to carry out Human Activity Recognition using Deep Learning, more specifically, through Neural Networks. The network performs the task of classifying image sequences. 3D convolutional layers are used to extract image features, and residual blocks are used to mitigate the problem of gradient vanishing observed in networks with a large number of layers. Previous works have estimated poses in the same video sequences that were employed. Moreover, they have also carried out HAR through Deep Learning using data acquired from sensors. Due to the growing popularity of optical capture systems for data acquisition, a large number of benchmark datasets have emerged. Nevertheless, this work focuses on the recognition of activities relevant in the medical field, consecuently, the dataset employed has been the one acquired by the research group. Therefore, 13 activities carried out by 37 different subjects have been classified. The network’s training has been conducted both from scratch, and by transferring learning from a previously trained model. It has been observed how the use of a pre-trained model allows reaching convergence faster, thus saving computational cost. In addition, the results exhibit the limitations of recognizing data from optical capture systems, such as the difficulty of classifying activities with reduced movement, or bimanual activities.Departamento de Teoría de la Señal y Comunicaciones e Ingeniería TelemáticaGrado en Ingeniería de Tecnologías de Telecomunicació

    Exploring Enhanced Motion Modeling Methods for Action Recognition

    Get PDF
    This thesis aims to address three key issues in action recognition through the enhancement of motion modeling, including handling complex motion variations, improving pseudo label quality in semi-supervised settings, and incorporating explicit motion modeling for transformers. First, we propose to capture proper motion information, since motion dynamics like moving tempos and action amplitude may vary a lot in different video clips. To this end, we introduce a Motion Diversification and Selection (MoDS) module to generate diversified spatiotemporal motion features and select the most appropriate motion representation for categorizing the input video. Second, we propose to improve pseudo label quality in semi-supervised action recognition. Previous methods only use a single network to generate pseudo labels, where a single network is limited in capturing different motion patterns simultaneously. To this end, we advocate jointly training a pair of heterogeneous networks, i.e., a 2D CNN and a 3D CNN, to characterize different specific motion patterns simultaneously. Then, we utilize the label propagation strategy within and across these networks to refine pseudo labels. Third, we propose to perform explicit motion modeling for transformers. We observe that transformer-based methods underperform on motion-sensitive datasets, indicating their limited capacity in temporal modeling. We also note that the conventional motion representation, namely cost volume, is quite similar to the affinity matrix defined in self-attention, but poses powerful motion model capacities. In this case, we propose to examine the essential properties of cost volume for effective motion modeling and integrate them into self-attention to enhance motion representation. We have conducted comprehensive experiments on widely-used datasets to confirm the effectiveness of our proposed methods. Our approaches have proven to be superior to other advanced methods under different scenarios
    corecore