72 research outputs found
Delving Deep into the Sketch and Photo Relation
"Sketches drawn by humans can play a similar role to photos in terms of conveying shape, posture as well as fine-grained information, and this fact has stimulated one line of cross-domain research that is related to sketch and photo, including sketch-based photo synthesis and retrieval. In this thesis, we aim to further investigate the relationship between sketch and photo. More specifically, we study certain under- explored traits in this relationship, and propose novel applications to reinforce the understanding of sketch and photo relation.Our exploration starts with the problem of sketch-based photo synthesis, where the unique trait of non-rigid alignment between sketch and photo is overlooked in existing research. We then carry on with our investigation from a new angle to study whether sketch can facilitate photo classifier generation. Building upon this, we continue to explore how sketch and photo are linked together on a more fine-grained level by tackling with the sketch-based photo segmenter prediction. Furthermore, we address the data scarcity issue identified in nearly all sketch-photo-related applications by examining their inherent correlation in the semantic aspect using sketch-based image retrieval (SBIR) as a test-bed. In general, we make four main contributions to the research on relationship between sketch and photo.Firstly, to mitigate the effect of deformation in sketch-based photo synthesis, we introduce the spatial transformer network to our image-image regression framework, which subtly deals with non-rigid alignment between the sketches and photos. The qualitative and quantitative experiments consistently reveal the superior quality of our synthesised photos over those generated by existing approaches.Secondly, sketch-based photo classifier generation is achieved with a novel model regression network, which maps the sketch to the parameters of photo classification model. It is shown that our model regression network is able to generalise across categories and photo classifiers for novel classes not involved in training are just a sketch away. Comprehensive experiments illustrate the promising performance of the generated binary and multi-class photo classifiers, and demonstrate that sketches can also be employed to enhance the granularity of existing photo classifiers.Thirdly, to achieve the goal of sketch-based photo segmentation, we propose a photo segmentation model generation algorithm that predicts the weights of a deep photo segmentation network according to the input sketch. The results confirm that one single sketch is the only prerequisite for unseen category photo segmentation, and the segmentation performance can be further improved by utilising sketch that is aligned with the object to be segmented in shape and position.Finally, we present an unsupervised representation learning framework for SBIR, the purpose of which is to eliminate the barrier imposed by data annotation scarcity. Prototype and memory bank reinforced joint distribution optimal transport is integrated into the unsupervised representation learning framework, so that the mapping between the sketches and photos could be automatically detected to learn a semantically meaningful yet domain-agnostic feature space. Extensive experiments and feature visualisation validate the efficacy of our proposed algorithm.
Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing
Existing works on weakly-supervised audio-visual video parsing adopt hybrid
attention network (HAN) as the multi-modal embedding to capture the cross-modal
context. It embeds the audio and visual modalities with a shared network, where
the cross-attention is performed at the input. However, such an early fusion
method highly entangles the two non-fully correlated modalities and leads to
sub-optimal performance in detecting single-modality events. To deal with this
problem, we propose the messenger-guided mid-fusion transformer to reduce the
uncorrelated cross-modal context in the fusion. The messengers condense the
full cross-modal context into a compact representation to only preserve useful
cross-modal information. Furthermore, due to the fact that microphones capture
audio events from all directions, while cameras only record visual events
within a restricted field of view, there is a more frequent occurrence of
unaligned cross-modal context from audio for visual event predictions. We thus
propose cross-audio prediction consistency to suppress the impact of irrelevant
audio information on visual event prediction. Experiments consistently
illustrate the superior performance of our framework compared to existing
state-of-the-art methods.Comment: WACV 202
Generalized Few-Shot Point Cloud Segmentation Via Geometric Words
Existing fully-supervised point cloud segmentation methods suffer in the
dynamic testing environment with emerging new classes. Few-shot point cloud
segmentation algorithms address this problem by learning to adapt to new
classes at the sacrifice of segmentation accuracy for the base classes, which
severely impedes its practicality. This largely motivates us to present the
first attempt at a more practical paradigm of generalized few-shot point cloud
segmentation, which requires the model to generalize to new categories with
only a few support point clouds and simultaneously retain the capability to
segment base classes. We propose the geometric words to represent geometric
components shared between the base and novel classes, and incorporate them into
a novel geometric-aware semantic representation to facilitate better
generalization to the new classes without forgetting the old ones. Moreover, we
introduce geometric prototypes to guide the segmentation with geometric prior
knowledge. Extensive experiments on S3DIS and ScanNet consistently illustrate
the superior performance of our method over baseline methods. Our code is
available at: https://github.com/Pixie8888/GFS-3DSeg_GWs.Comment: Accepted by ICCV 202
End-To-End Semi-supervised Learning for Differentiable Particle Filters
Recent advances in incorporating neural networks into particle filters
provide the desired flexibility to apply particle filters in large-scale
real-world applications. The dynamic and measurement models in this framework
are learnable through the differentiable implementation of particle filters.
Past efforts in optimising such models often require the knowledge of true
states which can be expensive to obtain or even unavailable in practice. In
this paper, in order to reduce the demand for annotated data, we present an
end-to-end learning objective based upon the maximisation of a
pseudo-likelihood function which can improve the estimation of states when
large portion of true states are unknown. We assess performance of the proposed
method in state estimation tasks in robotics with simulated and real-world
datasets.Comment: Accepted in ICRA 202
3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions
3D building reconstruction from monocular remote sensing images is an
important and challenging research problem that has received increasing
attention in recent years, owing to its low cost of data acquisition and
availability for large-scale applications. However, existing methods rely on
expensive 3D-annotated samples for fully-supervised training, restricting their
application to large-scale cross-city scenarios. In this work, we propose
MLS-BRN, a multi-level supervised building reconstruction network that can
flexibly utilize training samples with different annotation levels to achieve
better reconstruction results in an end-to-end manner. To alleviate the demand
on full 3D supervision, we design two new modules, Pseudo Building Bbox
Calculator and Roof-Offset guided Footprint Extractor, as well as new tasks and
training strategies for different types of samples. Experimental results on
several public and new datasets demonstrate that our proposed MLS-BRN achieves
competitive performance using much fewer 3D-annotated samples, and
significantly improves the footprint extraction and 3D reconstruction
performance compared with current state-of-the-art. The code and datasets of
this work will be released at https://github.com/opendatalab/MLS-BRN.git.Comment: accepted by CVPR 202
Sketch-based Video Object Segmentation: Benchmark and Analysis
Reference-based video object segmentation is an emerging topic which aims to
segment the corresponding target object in each video frame referred by a given
reference, such as a language expression or a photo mask. However, language
expressions can sometimes be vague in conveying an intended concept and
ambiguous when similar objects in one frame are hard to distinguish by
language. Meanwhile, photo masks are costly to annotate and less practical to
provide in a real application. This paper introduces a new task of sketch-based
video object segmentation, an associated benchmark, and a strong baseline. Our
benchmark includes three datasets, Sketch-DAVIS16, Sketch-DAVIS17 and
Sketch-YouTube-VOS, which exploit human-drawn sketches as an informative yet
low-cost reference for video object segmentation. We take advantage of STCN, a
popular baseline of semi-supervised VOS task, and evaluate what the most
effective design for incorporating a sketch reference is. Experimental results
show sketch is more effective yet annotation-efficient than other references,
such as photo masks, language and scribble.Comment: BMVC 202
Mobile phone short video use negatively impacts attention functions: an EEG study
The pervasive nature of short-form video platforms has seamlessly integrated into daily routines, yet it is important to recognize their potential adverse effects on both physical and mental health. Prior research has identified a detrimental impact of excessive short-form video consumption on attentional behavior, but the underlying neural mechanisms remain unexplored. In the current study, we aimed to investigate the effect of short-form video use on attentional functions, measured through the attention network test (ANT). A total of 48 participants, consisting of 35 females and 13 males, with a mean age of 21.8 years, were recruited. The mobile phone short video addiction tendency questionnaire (MPSVATQ) and self-control scale (SCS) were conducted to assess the short video usage behavior and self-control ability. Electroencephalogram (EEG) data were recorded during the completion of the ANT task. The correlation analysis showed a significant negative relationship between MPSVATQ and theta power index reflecting the executive control in the prefrontal region (r = −0.395, p = 0.007), this result was not observed by using theta power index of the resting-state EEG data. Furthermore, a significant negative correlation was identified between MPSVATQ and SCS outcomes (r = −0.320, p = 0.026). These results suggest that an increased tendency toward mobile phone short video addiction could negatively impact self-control and diminish executive control within the realm of attentional functions. This study sheds light on the adverse consequences stemming from short video consumption and underscores the importance of developing interventions to mitigate short video addiction
Sketch-a-Classifier: Sketch-based Photo Classifier Generation
Contemporary deep learning techniques have made image recognition a
reasonably reliable technology. However training effective photo classifiers
typically takes numerous examples which limits image recognition's scalability
and applicability to scenarios where images may not be available. This has
motivated investigation into zero-shot learning, which addresses the issue via
knowledge transfer from other modalities such as text. In this paper we
investigate an alternative approach of synthesizing image classifiers: almost
directly from a user's imagination, via free-hand sketch. This approach doesn't
require the category to be nameable or describable via attributes as per
zero-shot learning. We achieve this via training a {model regression} network
to map from {free-hand sketch} space to the space of photo classifiers. It
turns out that this mapping can be learned in a category-agnostic way, allowing
photo classifiers for new categories to be synthesized by user with no need for
annotated training photos. {We also demonstrate that this modality of
classifier generation can also be used to enhance the granularity of an
existing photo classifier, or as a complement to name-based zero-shot learning.Comment: published in CVPR2018 as spotligh
- …