2,804 research outputs found
Speech Processing in Computer Vision Applications
Deep learning has been recently proven to be a viable asset in determining features in the field of Speech Analysis. Deep learning methods like Convolutional Neural Networks facilitate the expansion of specific feature information in waveforms, allowing networks to create more feature dense representations of data. Our work attempts to address the problem of re-creating a face given a speaker\u27s voice and speaker identification using deep learning methods. In this work, we first review the fundamental background in speech processing and its related applications. Then we introduce novel deep learning-based methods to speech feature analysis. Finally, we will present our deep learning approaches to speaker identification and speech to face synthesis. The presented method can convert a speaker audio sample to an image of their predicted face. This framework is composed of several chained together networks, each with an essential step in the conversion process. These include Audio embedding, encoding, and face generation networks, respectively. Our experiments show that certain features can map to the face and that with a speaker\u27s voice, DNNs can create their face and that a GUI could be used in conjunction to display a speaker recognition network\u27s data
One-to-many face recognition with bilinear CNNs
The recent explosive growth in convolutional neural network (CNN) research
has produced a variety of new architectures for deep learning. One intriguing
new architecture is the bilinear CNN (B-CNN), which has shown dramatic
performance gains on certain fine-grained recognition problems [15]. We apply
this new CNN to the challenging new face recognition benchmark, the IARPA Janus
Benchmark A (IJB-A) [12]. It features faces from a large number of identities
in challenging real-world conditions. Because the face images were not
identified automatically using a computerized face detection system, it does
not have the bias inherent in such a database. We demonstrate the performance
of the B-CNN model beginning from an AlexNet-style network pre-trained on
ImageNet. We then show results for fine-tuning using a moderate-sized and
public external database, FaceScrub [17]. We also present results with
additional fine-tuning on the limited training data provided by the protocol.
In each case, the fine-tuned bilinear model shows substantial improvements over
the standard CNN. Finally, we demonstrate how a standard CNN pre-trained on a
large face database, the recently released VGG-Face model [20], can be
converted into a B-CNN without any additional feature training. This B-CNN
improves upon the CNN performance on the IJB-A benchmark, achieving 89.5%
rank-1 recall.Comment: Published version at WACV 201
What-and-Where to Match: Deep Spatially Multiplicative Integration Networks for Person Re-identification
Matching pedestrians across disjoint camera views, known as person
re-identification (re-id), is a challenging problem that is of importance to
visual recognition and surveillance. Most existing methods exploit local
regions within spatial manipulation to perform matching in local
correspondence. However, they essentially extract \emph{fixed} representations
from pre-divided regions for each image and perform matching based on the
extracted representation subsequently. For models in this pipeline, local finer
patterns that are crucial to distinguish positive pairs from negative ones
cannot be captured, and thus making them underperformed. In this paper, we
propose a novel deep multiplicative integration gating function, which answers
the question of \emph{what-and-where to match} for effective person re-id. To
address \emph{what} to match, our deep network emphasizes common local patterns
by learning joint representations in a multiplicative way. The network
comprises two Convolutional Neural Networks (CNNs) to extract convolutional
activations, and generates relevant descriptors for pedestrian matching. This
thus, leads to flexible representations for pair-wise images. To address
\emph{where} to match, we combat the spatial misalignment by performing
spatially recurrent pooling via a four-directional recurrent neural network to
impose spatial dependency over all positions with respect to the entire image.
The proposed network is designed to be end-to-end trainable to characterize
local pairwise feature interactions in a spatially aligned manner. To
demonstrate the superiority of our method, extensive experiments are conducted
over three benchmark data sets: VIPeR, CUHK03 and Market-1501.Comment: Published at Pattern Recognition, Elsevie
- …