18 research outputs found
Deep Metric Learning for Computer Vision: A Brief Overview
Objective functions that optimize deep neural networks play a vital role in
creating an enhanced feature representation of the input data. Although
cross-entropy-based loss formulations have been extensively used in a variety
of supervised deep-learning applications, these methods tend to be less
adequate when there is large intra-class variance and low inter-class variance
in input data distribution. Deep Metric Learning seeks to develop methods that
aim to measure the similarity between data samples by learning a representation
function that maps these data samples into a representative embedding space. It
leverages carefully designed sampling strategies and loss functions that aid in
optimizing the generation of a discriminative embedding space even for
distributions having low inter-class and high intra-class variances. In this
chapter, we will provide an overview of recent progress in this area and
discuss state-of-the-art Deep Metric Learning approaches.Comment: Book Chapter Published In Handbook of Statistics, Special Issue -
Deep Learning 48, 5
CoNAN: Conditional Neural Aggregation Network For Unconstrained Face Feature Fusion
Face recognition from image sets acquired under unregulated and uncontrolled
settings, such as at large distances, low resolutions, varying viewpoints,
illumination, pose, and atmospheric conditions, is challenging. Face feature
aggregation, which involves aggregating a set of N feature representations
present in a template into a single global representation, plays a pivotal role
in such recognition systems. Existing works in traditional face feature
aggregation either utilize metadata or high-dimensional intermediate feature
representations to estimate feature quality for aggregation. However,
generating high-quality metadata or style information is not feasible for
extremely low-resolution faces captured in long-range and high altitude
settings. To overcome these limitations, we propose a feature distribution
conditioning approach called CoNAN for template aggregation. Specifically, our
method aims to learn a context vector conditioned over the distribution
information of the incoming feature set, which is utilized to weigh the
features based on their estimated informativeness. The proposed method produces
state-of-the-art results on long-range unconstrained face recognition datasets
such as BTS, and DroneSURF, validating the advantages of such an aggregation
strategy.Comment: Paper accepted at IJCB 202
DIOR: Dataset for Indoor-Outdoor Reidentification -- Long Range 3D/2D Skeleton Gait Collection Pipeline, Semi-Automated Gait Keypoint Labeling and Baseline Evaluation Methods
In recent times, there is an increased interest in the identification and
re-identification of people at long distances, such as from rooftop cameras,
UAV cameras, street cams, and others. Such recognition needs to go beyond face
and use whole-body markers such as gait. However, datasets to train and test
such recognition algorithms are not widely prevalent, and fewer are labeled.
This paper introduces DIOR -- a framework for data collection, semi-automated
annotation, and also provides a dataset with 14 subjects and 1.649 million RGB
frames with 3D/2D skeleton gait labels, including 200 thousands frames from a
long range camera. Our approach leverages advanced 3D computer vision
techniques to attain pixel-level accuracy in indoor settings with motion
capture systems. Additionally, for outdoor long-range settings, we remove the
dependency on motion capture systems and adopt a low-cost, hybrid 3D computer
vision and learning pipeline with only 4 low-cost RGB cameras, successfully
achieving precise skeleton labeling on far-away subjects, even when their
height is limited to a mere 20-25 pixels within an RGB frame. On publication,
we will make our pipeline open for others to use
Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization
Learning to localize the sound source in videos without explicit annotations
is a novel area of audio-visual research. Existing work in this area focuses on
creating attention maps to capture the correlation between the two modalities
to localize the source of the sound. In a video, oftentimes, the objects
exhibiting movement are the ones generating the sound. In this work, we capture
this characteristic by modeling the optical flow in a video as a prior to
better aid in localizing the sound source. We further demonstrate that the
addition of flow-based attention substantially improves visual sound source
localization. Finally, we benchmark our method on standard sound source
localization datasets and achieve state-of-the-art performance on the Soundnet
Flickr and VGG Sound Source datasets. Code:
https://github.com/denfed/heartheflow.Comment: Accepted to WACV 202
Digital Image Enhancement using Normalization Techniques and their application to Palm Leaf Manuscripts
Palm leaves were one of the earliest forms of writing media and their use as writing material in South and Southeast Asia has been recorded from as early as the fifth century B.C. until as recently as the late 19th century. Palm leaf manuscripts relating to art and architecture, mathematics, astronomy, astrology, and medicine dating back several hundreds of years are still available for reference today thanks to many ongoing efforts for preservation of ancient documents by libraries and universities around the world. Palm leaf manuscripts typically last a few centuries but with time the palm leaves degrade and the writing becomes illegible to be useful in any form. Image processing techniques can help enhance the images of these manuscripts so as to enable retrieval of the written text from these degraded documents. In this paper we propose a set of transform based methods for enhancing digital images of palm leaf manuscripts. The methods first approximate the background of a gray scale image using one of two models – piece-wise linear or nonlinear models. The background approximations are designed to overcome unevenness of document background. Then the background normalization algorithms are applied to the component channel images of a color palm leaf image. We also propose two local adaptive normalization algorithms for extracting enhanced gray scale images from color palm leaf images. The techniques are tested on a set of palm leaf images from various sources and the preliminary results show significant improvement in readability. The techniques can also be used to enhance images of ancient, historical, degraded papyrus and paper documents
Text Extraction from Gray Scale Historical Document Images Using Adaptive Local Connectivity Map
This paper presents an algorithm using adaptive local connectivity map for retrieving text lines from the complex handwritten documents such as handwritten historical manuscripts. The algorithm is designed for solving the particularly complex problems seen in handwritten documents. These problems include fluctuating text lines, touching or crossing text lines and low quality image that do not lend themselves easily to binarizations. The algorithm is based on connectivity features similar to local projection profiles, which can be directly extracted from gray scale images. The proposed technique is robust and has been tested on a set of complex historical handwritten documents such as Newton’s and Galileo’s manuscripts. A preliminary testing shows a successful location rate of above 95 % for the test set.