11 research outputs found

    Inflated 3D ConvNet context analysis for violence detection

    Get PDF
    According to the Wall Street Journal, one billion surveillance cameras will be deployed around the world by 2021. This amount of information can be hardly managed by humans. Using a Inflated 3D ConvNet as backbone, this paper introduces a novel automatic violence detection approach that outperforms state-of-the-art existing proposals. Most of those proposals consider a pre-processing step to only focus on some regions of interest in the scene, i.e., those actually containing a human subject. In this regard, this paper also reports the results of an extensive analysis on whether and how the context can affect or not the adopted classifier performance. The experiments show that context-free footage yields substantial deterioration of the classifier performance (2% to 5%) on publicly available datasets. However, they also demonstrate that performance stabilizes in context-free settings, no matter the level of context restriction applied. Finally, a cross-dataset experiment investigates the generalizability of results obtained in a single-collection experiment (same dataset used for training and testing) to cross-collection settings (different datasets used for training and testing)

    Gait Analysis for Gender Classification in Forensics

    Get PDF
    Gender Classification (GC) is a natural ability that belongs to the human beings. Recent improvements in computer vision provide the possibility to extract information for different classification/recognition purposes. Gender is a soft biometrics useful in video surveillance, especially in uncontrolled contexts such as low-light environments, with arbitrary poses, facial expressions, occlusions and motion blur. In this work we present a methodology for the construction of a gait analyzer. The methodology is divided into three major steps: (1) data extraction, where body keypoints are extracted from video sequences; (2) feature creation, where body features are constructed using body keypoints; and (3) classifier selection when such data are used to train four different classifiers in order to determine the one that best performs. The results are analyzed on the dataset Gotcha, characterized by user and camera either in motion

    Gotcha-I: A Multiview Human Videos Dataset

    Get PDF
    The growing need of security in large open spaces led to the need to use video capture of people in different context and illumination and with multiple biometric traits as head pose, body gait, eyes, nose, mouth, and further more. All these traits are useful for a multibiometric identification or a person re-identification in a video surveillance context. Body Worn Cameras (BWCs) are used by the police of different countries all around the word and their use is growing significantly. This raises the need to develop new recognition methods that consider multibiometric traits on person re-identification. The purpose of this work is to present a new video dataset called Gotcha-I. This dataset has been obtained using more mobile cameras to adhere to the data of BWCs. The dataset includes videos from 62 subjects in indoor and outdoor environments to address both security and surveillance problem. During these videos, subjects may have a different behavior in videos such as freely, path, upstairs, avoid the camera. The dataset is composed by 493 videos including a set of 180° videos for each face of the subjects in the dataset. Furthermore, there are already processed data, such as: the 3D model of the face of each subject with all the poses of the head in pitch, yaw and roll; and the body keypoint coordinates of the gait for each video frame. It’s also shown an application of gender recognition performed on Gotcha-I, confirming the usefulness and innovativeness of the proposed dataset

    AveroBot: An audio-visual dataset for people re-identification and verification in human-robot interaction

    Get PDF
    Intelligent technologies have pervaded our daily life, making it easier for people to complete their activities. One emerging application is involving the use of robots for assisting people in various tasks (e.g., visiting a museum). In this context, it is crucial to enable robots to correctly identify people. Existing robots often use facial information to establish the identity of a person of interest. But, the face alone may not offer enough relevant information due to variations in pose, illumination, resolution and recording distance. Other biometric modalities like the voice can improve the recognition performance in these conditions. However, the existing datasets in robotic scenarios usually do not include the audio cue and tend to suffer from one or more limitations: most of them are acquired under controlled conditions, limited in number of identities or samples per user, collected by the same recording device, and/or not freely available. In this paper, we propose AveRobot, an audio-visual dataset of 111 participants vocalizing short sentences under robot assistance scenarios. The collection took place into a three-floor building through eight different cameras with built-in microphones. The performance for face and voice re-identification and verification was evaluated on this dataset with deep learning baselines, and compared against audio-visual datasets from diverse scenarios. The results showed that AveRobot is a challenging dataset for people re-identification and verification

    Gender classification on 2D human skeleton

    No full text
    Soft bimetrics has become a trending research topic over the past decade. In last years the increase of new technologies such as the wearable camera devices has introduced a new challenge into the gender classification problem. In this sense the ability to classify the gender not by an image but by the 2D estimated skeleton points is considered in this paper. Our experiments show that the human gender can be classified just considering the pose information provided by the body pose information. The proposed method have shown a remarkable performance on a dataset where subjects and camera are in movement

    Deep Multi-biometric Fusion for Audio-Visual User Re-Identification and Verification

    No full text
    From border controls to personal devices, from online exam proctoring to human-robot interaction, biometric technologies are empowering individuals and organizations with convenient and secure authentication and identification services. However, most biometric systems leverage only a single modality, and may face challenges related to acquisition distance, environmental conditions, data quality, and computational resources. Combining evidence from multiple sources at a certain level (e.g., sensor, feature, score, or decision) of the recognition pipeline may mitigate some limitations of the common uni-biometric systems. Such a fusion has been rarely investigated at intermediate level, i.e., when uni-biometric model parameters are jointly optimized during training. In this chapter, we propose a multi-biometric model training strategy that digests face and voice traits in parallel, and we explore how it helps to improve recognition performance in re-identification and verification scenarios. To this end, we design a neural architecture for jointly embedding face and voice data, and we experiment with several training losses and audio-visual datasets. The idea is to exploit the relation between voice characteristics and facial morphology, so that face and voice uni-biometric models help each other to recognize people when trained jointly. Extensive experiments on four real-world datasets show that the biometric feature representation of a uni-biometric model jointly trained performs better than the one computed by the same uni-biometric model trained alone. Moreover, the recognition results are further improved by embedding face and voice data into a single shared representation of the two modalities. The proposed fusion strategy generalizes well on unseen and unheard users, and should be considered as a feasible solution that improves model performance. We expect that this chapter will support the biometric community to shape the research on deep audio-visual fusion in real-world contexts
    corecore