467 research outputs found
Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss
We devise a cascade GAN approach to generate talking face video, which is
robust to different face shapes, view angles, facial characteristics, and noisy
audio conditions. Instead of learning a direct mapping from audio to video
frames, we propose first to transfer audio to high-level structure, i.e., the
facial landmarks, and then to generate video frames conditioned on the
landmarks. Compared to a direct audio-to-image approach, our cascade approach
avoids fitting spurious correlations between audiovisual signals that are
irrelevant to the speech content. We, humans, are sensitive to temporal
discontinuities and subtle artifacts in video. To avoid those pixel jittering
problems and to enforce the network to focus on audiovisual-correlated regions,
we propose a novel dynamically adjustable pixel-wise loss with an attention
mechanism. Furthermore, to generate a sharper image with well-synchronized
facial movements, we propose a novel regression-based discriminator structure,
which considers sequence-level information along with frame-level information.
Thoughtful experiments on several datasets and real-world samples demonstrate
significantly better results obtained by our method than the state-of-the-art
methods in both quantitative and qualitative comparisons
NEVIS'22: A Stream of 100 Tasks Sampled from 30 Years of Computer Vision Research
We introduce the Never Ending VIsual-classification Stream (NEVIS'22), a
benchmark consisting of a stream of over 100 visual classification tasks,
sorted chronologically and extracted from papers sampled uniformly from
computer vision proceedings spanning the last three decades. The resulting
stream reflects what the research community thought was meaningful at any point
in time. Despite being limited to classification, the resulting stream has a
rich diversity of tasks from OCR, to texture analysis, crowd counting, scene
recognition, and so forth. The diversity is also reflected in the wide range of
dataset sizes, spanning over four orders of magnitude. Overall, NEVIS'22 poses
an unprecedented challenge for current sequential learning approaches due to
the scale and diversity of tasks, yet with a low entry barrier as it is limited
to a single modality and each task is a classical supervised learning problem.
Moreover, we provide a reference implementation including strong baselines and
a simple evaluation protocol to compare methods in terms of their trade-off
between accuracy and compute. We hope that NEVIS'22 can be useful to
researchers working on continual learning, meta-learning, AutoML and more
generally sequential learning, and help these communities join forces towards
more robust and efficient models that efficiently adapt to a never ending
stream of data. Implementations have been made available at
https://github.com/deepmind/dm_nevis
Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features
The task of open-vocabulary object-centric image retrieval involves the
retrieval of images containing a specified object of interest, delineated by an
open-set text query. As working on large image datasets becomes standard,
solving this task efficiently has gained significant practical importance.
Applications include targeted performance analysis of retrieved images using
ad-hoc queries and hard example mining during training. Recent advancements in
contrastive-based open vocabulary systems have yielded remarkable
breakthroughs, facilitating large-scale open vocabulary image retrieval.
However, these approaches use a single global embedding per image, thereby
constraining the system's ability to retrieve images containing relatively
small object instances. Alternatively, incorporating local embeddings from
detection pipelines faces scalability challenges, making it unsuitable for
retrieval from large databases.
In this work, we present a simple yet effective approach to object-centric
open-vocabulary image retrieval. Our approach aggregates dense embeddings
extracted from CLIP into a compact representation, essentially combining the
scalability of image retrieval pipelines with the object identification
capabilities of dense detection methods. We show the effectiveness of our
scheme to the task by achieving significantly better results than global
feature approaches on three datasets, increasing accuracy by up to 15 mAP
points. We further integrate our scheme into a large scale retrieval framework
and demonstrate our method's advantages in terms of scalability and
interpretability.Comment: BMVC 202
Pedestrian Attribute Recognition: A Survey
Recognizing pedestrian attributes is an important task in computer vision
community due to it plays an important role in video surveillance. Many
algorithms has been proposed to handle this task. The goal of this paper is to
review existing works using traditional methods or based on deep learning
networks. Firstly, we introduce the background of pedestrian attributes
recognition (PAR, for short), including the fundamental concepts of pedestrian
attributes and corresponding challenges. Secondly, we introduce existing
benchmarks, including popular datasets and evaluation criterion. Thirdly, we
analyse the concept of multi-task learning and multi-label learning, and also
explain the relations between these two learning algorithms and pedestrian
attribute recognition. We also review some popular network architectures which
have widely applied in the deep learning community. Fourthly, we analyse
popular solutions for this task, such as attributes group, part-based,
\emph{etc}. Fifthly, we shown some applications which takes pedestrian
attributes into consideration and achieve better performance. Finally, we
summarized this paper and give several possible research directions for
pedestrian attributes recognition. The project page of this paper can be found
from the following website:
\url{https://sites.google.com/view/ahu-pedestrianattributes/}.Comment: Check our project page for High Resolution version of this survey:
https://sites.google.com/view/ahu-pedestrianattributes
Machine Learning: When and Where the Horses Went Astray?
Machine Learning is usually defined as a subfield of AI, which is busy with
information extraction from raw data sets. Despite of its common acceptance and
widespread recognition, this definition is wrong and groundless. Meaningful
information does not belong to the data that bear it. It belongs to the
observers of the data and it is a shared agreement and a convention among them.
Therefore, this private information cannot be extracted from the data by any
means. Therefore, all further attempts of Machine Learning apologists to
justify their funny business are inappropriate.Comment: The paper is accepted to be published in the Machine Learning serie
of the InTec
Efficient 3D data compression through parameterization of free-form surface patches
This paper presents a new method for 3D data compression based on parameterization of surface patches. The technique is applied to data that can be defined as single valued functions; this is the case for 3D patches obtained using standard 3D scanners. The method defines a number of mesh cutting planes and the intersection of planes on the mesh defines a set of sampling points. These points contain an explicit structure that allows us to define parametrically both x and y coordinates. The z values are interpolated using high degree polynomials and results show that compressions over 99% are achieved while preserving the quality of the mesh
- …