379 research outputs found
CentralNet: a Multilayer Approach for Multimodal Fusion
This paper proposes a novel multimodal fusion approach, aiming to produce
best possible decisions by integrating information coming from multiple media.
While most of the past multimodal approaches either work by projecting the
features of different modalities into the same space, or by coordinating the
representations of each modality through the use of constraints, our approach
borrows from both visions. More specifically, assuming each modality can be
processed by a separated deep convolutional network, allowing to take decisions
independently from each modality, we introduce a central network linking the
modality specific networks. This central network not only provides a common
feature embedding but also regularizes the modality specific networks through
the use of multi-task learning. The proposed approach is validated on 4
different computer vision tasks on which it consistently improves the accuracy
of existing multimodal fusion approaches
Empowerment through journalism: social change through youth media production in northeast Brazil [abstract]
Abstract only availableJournalism is a process in which people can begin to understand their realities and can be used as a powerful force in democratic societies for or against change. Specifically, youth journalism engages students in identifying themes that elicit social and emotional involvement and a high level of motivation to participate. This thesis intends to explore the question of how journalism can be used as a tool of empowerment in building the capacity of youth to become aware of their own realities and communicate these realities to others through a newspaper. I also explore how the production is linked to social justice by analyzing how it allows the youth of DaruĂȘ Malungo, a Center for Arts and Education, in Recife, Brazil to examine visible and invisible systems shaping their interactions and identities. My methodology for this research included teaching a journalism class using Paulo Freire's theory in the Pedagogy of the Oppressed and the development of a newspaper made by the students. I argue that the newspaper by the students at DaruĂȘ Malungo allowed them to navigate experiences of difference in terms of race, class, privilege, and oppression. Their production was linked to social justice because it was cry, ââum lamentoâ as the students decided to name their newspaper, for social action in terms of the racial prejudice that still surrounds them, the violence and drug problems in their community, the lack of education they receive, the pollution and abuse of the environment, and an explanation of how they express themselves through their culture. This journalism production created a space for youth development and empowerment, in which students said they weren't afraid to be silent anymore: they were given the opportunity to tell their community, their country and the world what was important to them and why they wanted change.School for International Trainin
Exploiting feature representations through similarity learning, post-ranking and ranking aggregation for person re-identification
Person re-identification has received special attention by the human analysis
community in the last few years. To address the challenges in this field, many
researchers have proposed different strategies, which basically exploit either
cross-view invariant features or cross-view robust metrics. In this work, we
propose to exploit a post-ranking approach and combine different feature
representations through ranking aggregation. Spatial information, which
potentially benefits the person matching, is represented using a 2D body model,
from which color and texture information are extracted and combined. We also
consider background/foreground information, automatically extracted via Deep
Decompositional Network, and the usage of Convolutional Neural Network (CNN)
features. To describe the matching between images we use the polynomial feature
map, also taking into account local and global information. The Discriminant
Context Information Analysis based post-ranking approach is used to improve
initial ranking lists. Finally, the Stuart ranking aggregation method is
employed to combine complementary ranking lists obtained from different feature
representations. Experimental results demonstrated that we improve the
state-of-the-art on VIPeR and PRID450s datasets, achieving 67.21% and 75.64% on
top-1 rank recognition rate, respectively, as well as obtaining competitive
results on CUHK01 dataset.Comment: Preprint submitted to Image and Vision Computin
Duplications in nomenclature
Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/62822/1/389539a0.pd
Overcoming Calibration Problems in Pattern Labeling with Pairwise Ratings: Application to Personality Traits
We address the problem of calibration of workers whose task is to label patterns with continuous variables, which arises for instance in labeling images of videos of humans with continuous traits. Worker bias is particularly difficult to evaluate and correct when many workers contribute just a few labels, a situation arising typically when labeling is crowd-sourced. In the scenario of labeling short videos of people facing a camera with personality traits, we evaluate the feasibility of the pairwise ranking method to alleviate bias problems. Workers are exposed to pairs of videos at a time and must order by preference. The variable levels are reconstructed by fitting a Bradley-Terry-Luce model with maximum likelihood. This method may at first sight, seem prohibitively expensive because for N videos, p=N(Nâ1)/2 pairs must be potentially processed by workers rather that N videos. However, by performing extensive simulations, we determine an empirical law for the scaling of the number of pairs needed as a function of the number of videos in order to achieve a given accuracy of score reconstruction and show that the pairwise method is affordable. We apply the method to the labeling of a large scale dataset of 10,000 videos used in the ChaLearn Apparent Personality Trait challenge
Convolutional Neural Network Super Resolution for Face Recognition in Surveillance Monitoring
Due to the importance of security in society, monitoring activities and recognizing specific people through surveillance video cameras play an important role. One of the main issues in such activity arises from the fact that cameras do not meet the resolution requirement for many face recognition algorithms. In order to solve this issue, in this paper we are proposing a new system which super resolves the image using deep learning convolutional network followed by the Hidden Markov Model and Singular Value Decomposition based face recognition. The proposed system has been tested on many well-known face databases such as FERET, HeadPose, and Essex University databases as well as our recently introduced iCV Face Recognition database (iCV-F). The experimental results show that the recognition rate is improving considerably after apply the super resolution
Video Transformers: A Survey
Transformer models have shown great success handling long-range interactions,
making them a promising tool for modeling video. However they lack inductive
biases and scale quadratically with input length. These limitations are further
exacerbated when dealing with the high dimensionality introduced with the
temporal dimension. While there are surveys analyzing the advances of
Transformers for vision, none focus on an in-depth analysis of video-specific
designs. In this survey we analyze main contributions and trends of works
leveraging Transformers to model video. Specifically, we delve into how videos
are handled as input-level first. Then, we study the architectural changes made
to deal with video more efficiently, reduce redundancy, re-introduce useful
inductive biases, and capture long-term temporal dynamics. In addition we
provide an overview of different training regimes and explore effective
self-supervised learning strategies for video. Finally, we conduct a
performance comparison on the most common benchmark for Video Transformers
(i.e., action classification), finding them to outperform 3D ConvNets even with
less computational complexity
- âŠ