713 research outputs found
Learning Binary Residual Representations for Domain-specific Video Streaming
We study domain-specific video streaming. Specifically, we target a streaming
setting where the videos to be streamed from a server to a client are all in
the same domain and they have to be compressed to a small size for low-latency
transmission. Several popular video streaming services, such as the video game
streaming services of GeForce Now and Twitch, fall in this category. While
conventional video compression standards such as H.264 are commonly used for
this task, we hypothesize that one can leverage the property that the videos
are all in the same domain to achieve better video quality. Based on this
hypothesis, we propose a novel video compression pipeline. Specifically, we
first apply H.264 to compress domain-specific videos. We then train a novel
binary autoencoder to encode the leftover domain-specific residual information
frame-by-frame into binary representations. These binary representations are
then compressed and sent to the client together with the H.264 stream. In our
experiments, we show that our pipeline yields consistent gains over standard
H.264 compression across several benchmark datasets while using the same
channel bandwidth.Comment: Accepted in AAAI'18. Project website at
https://research.nvidia.com/publication/2018-02_Learning-Binary-Residua
Weakly-supervised Caricature Face Parsing through Domain Adaptation
A caricature is an artistic form of a person's picture in which certain
striking characteristics are abstracted or exaggerated in order to create a
humor or sarcasm effect. For numerous caricature related applications such as
attribute recognition and caricature editing, face parsing is an essential
pre-processing step that provides a complete facial structure understanding.
However, current state-of-the-art face parsing methods require large amounts of
labeled data on the pixel-level and such process for caricature is tedious and
labor-intensive. For real photos, there are numerous labeled datasets for face
parsing. Thus, we formulate caricature face parsing as a domain adaptation
problem, where real photos play the role of the source domain, adapting to the
target caricatures. Specifically, we first leverage a spatial transformer based
network to enable shape domain shifts. A feed-forward style transfer network is
then utilized to capture texture-level domain gaps. With these two steps, we
synthesize face caricatures from real photos, and thus we can use parsing
ground truths of the original photos to learn the parsing model. Experimental
results on the synthetic and real caricatures demonstrate the effectiveness of
the proposed domain adaptation algorithm. Code is available at:
https://github.com/ZJULearning/CariFaceParsing .Comment: Accepted in ICIP 2019, code and model are available at
https://github.com/ZJULearning/CariFaceParsin
Deep Image Harmonization
Compositing is one of the most common operations in photo editing. To
generate realistic composites, the appearances of foreground and background
need to be adjusted to make them compatible. Previous approaches to harmonize
composites have focused on learning statistical relationships between
hand-crafted appearance features of the foreground and background, which is
unreliable especially when the contents in the two layers are vastly different.
In this work, we propose an end-to-end deep convolutional neural network for
image harmonization, which can capture both the context and semantic
information of the composite images during harmonization. We also introduce an
efficient way to collect large-scale and high-quality training data that can
facilitate the training process. Experiments on the synthesized dataset and
real composite images show that the proposed network outperforms previous
state-of-the-art methods
Delving into Motion-Aware Matching for Monocular 3D Object Tracking
Recent advances of monocular 3D object detection facilitate the 3D
multi-object tracking task based on low-cost camera sensors. In this paper, we
find that the motion cue of objects along different time frames is critical in
3D multi-object tracking, which is less explored in existing monocular-based
approaches. In this paper, we propose a motion-aware framework for monocular 3D
MOT. To this end, we propose MoMA-M3T, a framework that mainly consists of
three motion-aware components. First, we represent the possible movement of an
object related to all object tracklets in the feature space as its motion
features. Then, we further model the historical object tracklet along the time
frame in a spatial-temporal perspective via a motion transformer. Finally, we
propose a motion-aware matching module to associate historical object tracklets
and current observations as final tracking results. We conduct extensive
experiments on the nuScenes and KITTI datasets to demonstrate that our MoMA-M3T
achieves competitive performance against state-of-the-art methods. Moreover,
the proposed tracker is flexible and can be easily plugged into existing
image-based 3D object detectors without re-training. Code and models are
available at https://github.com/kuanchihhuang/MoMA-M3T.Comment: Accepted by ICCV 2023. Code is available at
https://github.com/kuanchihhuang/MoMA-M3
- …