600 research outputs found

    Development of New Models for Vision-Based Human Activity Recognition

    Get PDF
    Els mètodes de reconeixement d'accions permeten als sistemes intel·ligents reconèixer accions humanes en vídeos de la vida quotidiana. No obstant, molts mètodes de reconeixement d'accions donen taxes notables d’error de classificació degut a les grans variacions dins dels vídeos de la mateixa classe i als canvis en el punt de vista, l'escala i el fons. Per reduir la classificació incorrecta , proposem un nou mètode de representació de vídeo que captura l'evolució temporal de l'acció que succeeix en el vídeo, un nou mètode per a la segmentació de mans i un nou mètode per al reconeixement d'activitats humanes en imatges fixes.Los métodos de reconocimiento de acciones permiten que los sistemas inteligentes reconozcan acciones humanas en videos de la vida cotidiana. No obstante, muchos métodos de reconocimiento de acciones dan tasas notables de error de clasificación debido a las grandes variaciones dentro de los videos de la misma clase y los cambios en el punto de vista, la escala y el fondo. Para reducir la clasificación errónea, Łproponemos un nuevo método de representación de video que captura la evolución temporal de la acción que ocurre en el video completo, un nuevo método para la segmentación de manos y un nuevo método para el reconocimiento de actividades humanas en imágenes fijas.Action recognition methods enable intelligent systems to recognize human actions in daily life videos. However, many action recognition methods give noticeable misclassification rates due to the big variations within the videos of the same class, and the changes in viewpoint, scale and background. To reduce the misclassification rate, we propose a new video representation method that captures the temporal evolution of the action happening in the whole video, a new method for human hands segmentation and a new method for human activity recognition in still images

    Deep understanding of shopper behaviours and interactions using RGB-D vision

    Get PDF
    AbstractIn retail environments, understanding how shoppers move about in a store's spaces and interact with products is very valuable. While the retail environment has several favourable characteristics that support computer vision, such as reasonable lighting, the large number and diversity of products sold, as well as the potential ambiguity of shoppers' movements, mean that accurately measuring shopper behaviour is still challenging. Over the past years, machine-learning and feature-based tools for people counting as well as interactions analytic and re-identification were developed with the aim of learning shopper skills based on occlusion-free RGB-D cameras in a top-view configuration. However, after moving into the era of multimedia big data, machine-learning approaches evolved into deep learning approaches, which are a more powerful and efficient way of dealing with the complexities of human behaviour. In this paper, a novel VRAI deep learning application that uses three convolutional neural networks to count the number of people passing or stopping in the camera area, perform top-view re-identification and measure shopper–shelf interactions from a single RGB-D video flow with near real-time performances has been introduced. The framework is evaluated on the following three new datasets that are publicly available: TVHeads for people counting, HaDa for shopper–shelf interactions and TVPR2 for people re-identification. The experimental results show that the proposed methods significantly outperform all competitive state-of-the-art methods (accuracy of 99.5% on people counting, 92.6% on interaction classification and 74.5% on re-id), bringing to different and significative insights for implicit and extensive shopper behaviour analysis for marketing applications

    Vision Grid Transformer for Document Layout Analysis

    Full text link
    Document pre-trained models and grid-based models have proven to be very effective on various tasks in Document AI. However, for the document layout analysis (DLA) task, existing document pre-trained models, even those pre-trained in a multi-modal fashion, usually rely on either textual features or visual features. Grid-based models for DLA are multi-modality but largely neglect the effect of pre-training. To fully leverage multi-modal information and exploit pre-training techniques to learn better representation for DLA, in this paper, we present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding. Furthermore, a new dataset named D4^4LA, which is so far the most diverse and detailed manually-annotated benchmark for document layout analysis, is curated and released. Experiment results have illustrated that the proposed VGT model achieves new state-of-the-art results on DLA tasks, e.g. PubLayNet (95.7%95.7\%\rightarrow96.2%96.2\%), DocBank (79.6%79.6\%\rightarrow84.1%84.1\%), and D4^4LA (67.7%67.7\%\rightarrow68.8%68.8\%). The code and models as well as the D4^4LA dataset will be made publicly available ~\url{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}.Comment: Accepted by ICCV202

    Multimodal Side-Tuning for Document Classification

    Full text link
    In this paper, we propose to exploit the side-tuning framework for multimodal document classification. Side-tuning is a methodology for network adaptation recently introduced to solve some of the problems related to previous approaches. Thanks to this technique it is actually possible to overcome model rigidity and catastrophic forgetting of transfer learning by fine-tuning. The proposed solution uses off-the-shelf deep learning architectures leveraging the side-tuning framework to combine a base model with a tandem of two side networks. We show that side-tuning can be successfully employed also when different data sources are considered, e.g. text and images in document classification. The experimental results show that this approach pushes further the limit for document classification accuracy with respect to the state of the art.Comment: 2020 25th International Conference on Pattern Recognition (ICPR

    SelfDocSeg: A Self-Supervised vision-based Approach towards Document Segmentation

    Full text link
    Document layout analysis is a known problem to the documents research community and has been vastly explored yielding a multitude of solutions ranging from text mining, and recognition to graph-based representation, visual feature extraction, etc. However, most of the existing works have ignored the crucial fact regarding the scarcity of labeled data. With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain and thus making data annotation a tedious task. We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches which use text mining and textual labels, we use a complete vision-based approach in pre-training without any ground-truth label or its derivative. Instead, we generate pseudo-layouts from the document images to pre-train an image encoder to learn the document object representation and localization in a self-supervised framework before fine-tuning it with an object detection model. We show that our pipeline sets a new benchmark in this context and performs at par with the existing methods and the supervised counterparts, if not outperforms. The code is made publicly available at: https://github.com/MaitySubhajit/SelfDocSegComment: Accepted at The 17th International Conference on Document Analysis and Recognition (ICDAR 2023
    corecore