44 research outputs found
Lester: rotoscope animation through video object segmentation and tracking
This article introduces Lester, a novel method to automatically synthetise
retro-style 2D animations from videos. The method approaches the challenge
mainly as an object segmentation and tracking problem. Video frames are
processed with the Segment Anything Model (SAM) and the resulting masks are
tracked through subsequent frames with DeAOT, a method of hierarchical
propagation for semi-supervised video object segmentation. The geometry of the
masks' contours is simplified with the Douglas-Peucker algorithm. Finally,
facial traits, pixelation and a basic shadow effect can be optionally added.
The results show that the method exhibits an excellent temporal consistency and
can correctly process videos with different poses and appearances, dynamic
shots, partial shots and diverse backgrounds. The proposed method provides a
more simple and deterministic approach than diffusion models based
video-to-video translation pipelines, which suffer from temporal consistency
problems and do not cope well with pixelated and schematic outputs. The method
is also much most practical than techniques based on 3D human pose estimation,
which require custom handcrafted 3D models and are very limited with respect to
the type of scenes they can process
Convolutional Neural Networks Quantization with Attention
It has been proven that, compared to using 32-bit floating-point numbers in
the training phase, Deep Convolutional Neural Networks (DCNNs) can operate with
low precision during inference, thereby saving memory space and power
consumption. However, quantizing networks is always accompanied by an accuracy
decrease. Here, we propose a method, double-stage Squeeze-and-Threshold
(double-stage ST). It uses the attention mechanism to quantize networks and
achieve state-of-art results. Using our method, the 3-bit model can achieve
accuracy that exceeds the accuracy of the full-precision baseline model. The
proposed double-stage ST activation quantization is easy to apply: inserting it
before the convolution.Comment: Preprint of an article published in International Journal of Neural
Systems, [10.1142/S0129065722500514] \c{opyright} [copyright World Scientific
Publishing Company]
[https://www.worldscientific.com/doi/10.1142/S0129065722500514
Video Question Answering on Screencast Tutorials
This paper presents a new video question answering task on screencast
tutorials. We introduce a dataset including question, answer and context
triples from the tutorial videos for a software. Unlike other video question
answering works, all the answers in our dataset are grounded to the domain
knowledge base. An one-shot recognition algorithm is designed to extract the
visual cues, which helps enhance the performance of video question answering.
We also propose several baseline neural network architectures based on various
aspects of video contexts from the dataset. The experimental results
demonstrate that our proposed models significantly improve the question
answering performances by incorporating multi-modal contexts and domain
knowledge
MobileOne: An Improved One millisecond Mobile Backbone
Efficient neural network backbones for mobile devices are often optimized for
metrics such as FLOPs or parameter count. However, these metrics may not
correlate well with latency of the network when deployed on a mobile device.
Therefore, we perform extensive analysis of different metrics by deploying
several mobile-friendly networks on a mobile device. We identify and analyze
architectural and optimization bottlenecks in recent efficient neural networks
and provide ways to mitigate these bottlenecks. To this end, we design an
efficient backbone MobileOne, with variants achieving an inference time under 1
ms on an iPhone12 with 75.9% top-1 accuracy on ImageNet. We show that MobileOne
achieves state-of-the-art performance within the efficient architectures while
being many times faster on mobile. Our best model obtains similar performance
on ImageNet as MobileFormer while being 38x faster. Our model obtains 2.3%
better top-1 accuracy on ImageNet than EfficientNet at similar latency.
Furthermore, we show that our model generalizes to multiple tasks - image
classification, object detection, and semantic segmentation with significant
improvements in latency and accuracy as compared to existing efficient
architectures when deployed on a mobile device. Code and models are available
at https://github.com/apple/ml-mobileoneComment: Accepted at CVPR 202
Does the dataset meet your expectations? Explaining sample representation in image data
Since the behavior of a neural network model is adversely affected by a lack of diversity in training data, we present a method that identifies and explains such deficiencies. When a dataset is labeled,we n ote that annotations alone are capable of providing a human interpretable summary of sample diversity. This allows explaining any lack of diversity as the mismatch found when comparing the actual distribution of annotations in the dataset with an expected distribution of annotations, specified manually to capture essential label diversity. While, in many practical cases, labeling (samples → annotations) is expensive, its inverse, simulation (annotations → samples) can be cheaper. By mapping the expected distribution of annotations into test samples using parametric simulation, we present a method that explains sample representation using the mismatch in diversity between simulated and collected data.We then apply the method to examine a dataset of geometric shapes to qualitatively and quantitatively explain sample representation in termsof comprehensible aspects such as size, position, and pixel brightness
Semi-Supervised Domain Generalization for Object Detection via Language-Guided Feature Alignment
Existing domain adaptation (DA) and generalization (DG) methods in object
detection enforce feature alignment in the visual space but face challenges
like object appearance variability and scene complexity, which make it
difficult to distinguish between objects and achieve accurate detection. In
this paper, we are the first to address the problem of semi-supervised domain
generalization by exploring vision-language pre-training and enforcing feature
alignment through the language space. We employ a novel Cross-Domain
Descriptive Multi-Scale Learning (CDDMSL) aiming to maximize the agreement
between descriptions of an image presented with different domain-specific
characteristics in the embedding space. CDDMSL significantly outperforms
existing methods, achieving 11.7% and 7.5% improvement in DG and DA settings,
respectively. Comprehensive analysis and ablation studies confirm the
effectiveness of our method, positioning CDDMSL as a promising approach for
domain generalization in object detection tasks.Comment: Accepted at BMVC 202
TinyHD: Efficient video saliency prediction with heterogeneous decoders using hierarchical maps distillation
Video saliency prediction has recently attracted atten- tion of the research community, as it is an upstream task for several practical applications. However, current so- lutions are particurly computationally demanding, espe- cially due to the wide usage of spatio-temporal 3D convolu- tions. We observe that, while different model architectures achieve similar performance on benchmarks, visual varia- tions between predicted saliency maps are still significant. Inspired by this intuition, we propose a lightweight model that employs multiple simple heterogeneous decoders and adopts several practical approaches to improve accuracy while keeping computational costs low, such as hierarchi- cal multi-map knowledge distillation, multi-output saliency prediction, unlabeled auxiliary datasets and channel re- duction with teacher assistant supervision. Our approach achieves saliency prediction accuracy on par or better than state-of-the-art methods on DFH1K, UCF-Sports and Hol- lywood2 benchmarks, while enhancing significantly the ef- ficiency of the model
Does the dataset meet your expectations? Explaining sample representation in image data
Since the behavior of a neural network model is adversely affected by a lack of diversity in training data, we present a method that identifies and explains such deficiencies. When a dataset is labeled,we n ote that annotations alone are capable of providing a human interpretable summary of sample diversity. This allows explaining any lack of diversity as the mismatch found when comparing the actual distribution of annotations in the dataset with an expected distribution of annotations, specified manually to capture essential label diversity. While, in many practical cases, labeling (samples → annotations) is expensive, its inverse, simulation (annotations → samples) can be cheaper. By mapping the expected distribution of annotations into test samples using parametric simulation, we present a method that explains sample representation using the mismatch in diversity between simulated and collected data.We then apply the method to examine a dataset of geometric shapes to qualitatively and quantitatively explain sample representation in termsof comprehensible aspects such as size, position, and pixel brightness