379 research outputs found

    CentralNet: a Multilayer Approach for Multimodal Fusion

    Full text link
    This paper proposes a novel multimodal fusion approach, aiming to produce best possible decisions by integrating information coming from multiple media. While most of the past multimodal approaches either work by projecting the features of different modalities into the same space, or by coordinating the representations of each modality through the use of constraints, our approach borrows from both visions. More specifically, assuming each modality can be processed by a separated deep convolutional network, allowing to take decisions independently from each modality, we introduce a central network linking the modality specific networks. This central network not only provides a common feature embedding but also regularizes the modality specific networks through the use of multi-task learning. The proposed approach is validated on 4 different computer vision tasks on which it consistently improves the accuracy of existing multimodal fusion approaches

    Empowerment through journalism: social change through youth media production in northeast Brazil [abstract]

    Get PDF
    Abstract only availableJournalism is a process in which people can begin to understand their realities and can be used as a powerful force in democratic societies for or against change. Specifically, youth journalism engages students in identifying themes that elicit social and emotional involvement and a high level of motivation to participate. This thesis intends to explore the question of how journalism can be used as a tool of empowerment in building the capacity of youth to become aware of their own realities and communicate these realities to others through a newspaper. I also explore how the production is linked to social justice by analyzing how it allows the youth of DaruĂȘ Malungo, a Center for Arts and Education, in Recife, Brazil to examine visible and invisible systems shaping their interactions and identities. My methodology for this research included teaching a journalism class using Paulo Freire's theory in the Pedagogy of the Oppressed and the development of a newspaper made by the students. I argue that the newspaper by the students at DaruĂȘ Malungo allowed them to navigate experiences of difference in terms of race, class, privilege, and oppression. Their production was linked to social justice because it was cry, “”um lamento” as the students decided to name their newspaper, for social action in terms of the racial prejudice that still surrounds them, the violence and drug problems in their community, the lack of education they receive, the pollution and abuse of the environment, and an explanation of how they express themselves through their culture. This journalism production created a space for youth development and empowerment, in which students said they weren't afraid to be silent anymore: they were given the opportunity to tell their community, their country and the world what was important to them and why they wanted change.School for International Trainin

    Exploiting feature representations through similarity learning, post-ranking and ranking aggregation for person re-identification

    Full text link
    Person re-identification has received special attention by the human analysis community in the last few years. To address the challenges in this field, many researchers have proposed different strategies, which basically exploit either cross-view invariant features or cross-view robust metrics. In this work, we propose to exploit a post-ranking approach and combine different feature representations through ranking aggregation. Spatial information, which potentially benefits the person matching, is represented using a 2D body model, from which color and texture information are extracted and combined. We also consider background/foreground information, automatically extracted via Deep Decompositional Network, and the usage of Convolutional Neural Network (CNN) features. To describe the matching between images we use the polynomial feature map, also taking into account local and global information. The Discriminant Context Information Analysis based post-ranking approach is used to improve initial ranking lists. Finally, the Stuart ranking aggregation method is employed to combine complementary ranking lists obtained from different feature representations. Experimental results demonstrated that we improve the state-of-the-art on VIPeR and PRID450s datasets, achieving 67.21% and 75.64% on top-1 rank recognition rate, respectively, as well as obtaining competitive results on CUHK01 dataset.Comment: Preprint submitted to Image and Vision Computin

    Duplications in nomenclature

    Full text link
    Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/62822/1/389539a0.pd

    Overcoming Calibration Problems in Pattern Labeling with Pairwise Ratings: Application to Personality Traits

    Get PDF
    We address the problem of calibration of workers whose task is to label patterns with continuous variables, which arises for instance in labeling images of videos of humans with continuous traits. Worker bias is particularly difficult to evaluate and correct when many workers contribute just a few labels, a situation arising typically when labeling is crowd-sourced. In the scenario of labeling short videos of people facing a camera with personality traits, we evaluate the feasibility of the pairwise ranking method to alleviate bias problems. Workers are exposed to pairs of videos at a time and must order by preference. The variable levels are reconstructed by fitting a Bradley-Terry-Luce model with maximum likelihood. This method may at first sight, seem prohibitively expensive because for N videos, p=N(N−1)/2 pairs must be potentially processed by workers rather that N videos. However, by performing extensive simulations, we determine an empirical law for the scaling of the number of pairs needed as a function of the number of videos in order to achieve a given accuracy of score reconstruction and show that the pairwise method is affordable. We apply the method to the labeling of a large scale dataset of 10,000 videos used in the ChaLearn Apparent Personality Trait challenge

    Convolutional Neural Network Super Resolution for Face Recognition in Surveillance Monitoring

    Get PDF
    Due to the importance of security in society, monitoring activities and recognizing specific people through surveillance video cameras play an important role. One of the main issues in such activity arises from the fact that cameras do not meet the resolution requirement for many face recognition algorithms. In order to solve this issue, in this paper we are proposing a new system which super resolves the image using deep learning convolutional network followed by the Hidden Markov Model and Singular Value Decomposition based face recognition. The proposed system has been tested on many well-known face databases such as FERET, HeadPose, and Essex University databases as well as our recently introduced iCV Face Recognition database (iCV-F). The experimental results show that the recognition rate is improving considerably after apply the super resolution

    Video Transformers: A Survey

    Full text link
    Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced with the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey we analyze main contributions and trends of works leveraging Transformers to model video. Specifically, we delve into how videos are handled as input-level first. Then, we study the architectural changes made to deal with video more efficiently, reduce redundancy, re-introduce useful inductive biases, and capture long-term temporal dynamics. In addition we provide an overview of different training regimes and explore effective self-supervised learning strategies for video. Finally, we conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D ConvNets even with less computational complexity
    • 

    corecore