44 research outputs found

    Lester: rotoscope animation through video object segmentation and tracking

    Full text link
    This article introduces Lester, a novel method to automatically synthetise retro-style 2D animations from videos. The method approaches the challenge mainly as an object segmentation and tracking problem. Video frames are processed with the Segment Anything Model (SAM) and the resulting masks are tracked through subsequent frames with DeAOT, a method of hierarchical propagation for semi-supervised video object segmentation. The geometry of the masks' contours is simplified with the Douglas-Peucker algorithm. Finally, facial traits, pixelation and a basic shadow effect can be optionally added. The results show that the method exhibits an excellent temporal consistency and can correctly process videos with different poses and appearances, dynamic shots, partial shots and diverse backgrounds. The proposed method provides a more simple and deterministic approach than diffusion models based video-to-video translation pipelines, which suffer from temporal consistency problems and do not cope well with pixelated and schematic outputs. The method is also much most practical than techniques based on 3D human pose estimation, which require custom handcrafted 3D models and are very limited with respect to the type of scenes they can process

    Convolutional Neural Networks Quantization with Attention

    Full text link
    It has been proven that, compared to using 32-bit floating-point numbers in the training phase, Deep Convolutional Neural Networks (DCNNs) can operate with low precision during inference, thereby saving memory space and power consumption. However, quantizing networks is always accompanied by an accuracy decrease. Here, we propose a method, double-stage Squeeze-and-Threshold (double-stage ST). It uses the attention mechanism to quantize networks and achieve state-of-art results. Using our method, the 3-bit model can achieve accuracy that exceeds the accuracy of the full-precision baseline model. The proposed double-stage ST activation quantization is easy to apply: inserting it before the convolution.Comment: Preprint of an article published in International Journal of Neural Systems, [10.1142/S0129065722500514] \c{opyright} [copyright World Scientific Publishing Company] [https://www.worldscientific.com/doi/10.1142/S0129065722500514

    Video Question Answering on Screencast Tutorials

    Full text link
    This paper presents a new video question answering task on screencast tutorials. We introduce a dataset including question, answer and context triples from the tutorial videos for a software. Unlike other video question answering works, all the answers in our dataset are grounded to the domain knowledge base. An one-shot recognition algorithm is designed to extract the visual cues, which helps enhance the performance of video question answering. We also propose several baseline neural network architectures based on various aspects of video contexts from the dataset. The experimental results demonstrate that our proposed models significantly improve the question answering performances by incorporating multi-modal contexts and domain knowledge

    MobileOne: An Improved One millisecond Mobile Backbone

    Full text link
    Efficient neural network backbones for mobile devices are often optimized for metrics such as FLOPs or parameter count. However, these metrics may not correlate well with latency of the network when deployed on a mobile device. Therefore, we perform extensive analysis of different metrics by deploying several mobile-friendly networks on a mobile device. We identify and analyze architectural and optimization bottlenecks in recent efficient neural networks and provide ways to mitigate these bottlenecks. To this end, we design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12 with 75.9% top-1 accuracy on ImageNet. We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile. Our best model obtains similar performance on ImageNet as MobileFormer while being 38x faster. Our model obtains 2.3% better top-1 accuracy on ImageNet than EfficientNet at similar latency. Furthermore, we show that our model generalizes to multiple tasks - image classification, object detection, and semantic segmentation with significant improvements in latency and accuracy as compared to existing efficient architectures when deployed on a mobile device. Code and models are available at https://github.com/apple/ml-mobileoneComment: Accepted at CVPR 202

    Does the dataset meet your expectations? Explaining sample representation in image data

    Get PDF
    Since the behavior of a neural network model is adversely affected by a lack of diversity in training data, we present a method that identifies and explains such deficiencies. When a dataset is labeled,we n ote that annotations alone are capable of providing a human interpretable summary of sample diversity. This allows explaining any lack of diversity as the mismatch found when comparing the actual distribution of annotations in the dataset with an expected distribution of annotations, specified manually to capture essential label diversity. While, in many practical cases, labeling (samples → annotations) is expensive, its inverse, simulation (annotations → samples) can be cheaper. By mapping the expected distribution of annotations into test samples using parametric simulation, we present a method that explains sample representation using the mismatch in diversity between simulated and collected data.We then apply the method to examine a dataset of geometric shapes to qualitatively and quantitatively explain sample representation in termsof comprehensible aspects such as size, position, and pixel brightness

    Semi-Supervised Domain Generalization for Object Detection via Language-Guided Feature Alignment

    Full text link
    Existing domain adaptation (DA) and generalization (DG) methods in object detection enforce feature alignment in the visual space but face challenges like object appearance variability and scene complexity, which make it difficult to distinguish between objects and achieve accurate detection. In this paper, we are the first to address the problem of semi-supervised domain generalization by exploring vision-language pre-training and enforcing feature alignment through the language space. We employ a novel Cross-Domain Descriptive Multi-Scale Learning (CDDMSL) aiming to maximize the agreement between descriptions of an image presented with different domain-specific characteristics in the embedding space. CDDMSL significantly outperforms existing methods, achieving 11.7% and 7.5% improvement in DG and DA settings, respectively. Comprehensive analysis and ablation studies confirm the effectiveness of our method, positioning CDDMSL as a promising approach for domain generalization in object detection tasks.Comment: Accepted at BMVC 202

    TinyHD: Efficient video saliency prediction with heterogeneous decoders using hierarchical maps distillation

    Get PDF
    Video saliency prediction has recently attracted atten- tion of the research community, as it is an upstream task for several practical applications. However, current so- lutions are particurly computationally demanding, espe- cially due to the wide usage of spatio-temporal 3D convolu- tions. We observe that, while different model architectures achieve similar performance on benchmarks, visual varia- tions between predicted saliency maps are still significant. Inspired by this intuition, we propose a lightweight model that employs multiple simple heterogeneous decoders and adopts several practical approaches to improve accuracy while keeping computational costs low, such as hierarchi- cal multi-map knowledge distillation, multi-output saliency prediction, unlabeled auxiliary datasets and channel re- duction with teacher assistant supervision. Our approach achieves saliency prediction accuracy on par or better than state-of-the-art methods on DFH1K, UCF-Sports and Hol- lywood2 benchmarks, while enhancing significantly the ef- ficiency of the model

    Does the dataset meet your expectations? Explaining sample representation in image data

    Get PDF
    Since the behavior of a neural network model is adversely affected by a lack of diversity in training data, we present a method that identifies and explains such deficiencies. When a dataset is labeled,we n ote that annotations alone are capable of providing a human interpretable summary of sample diversity. This allows explaining any lack of diversity as the mismatch found when comparing the actual distribution of annotations in the dataset with an expected distribution of annotations, specified manually to capture essential label diversity. While, in many practical cases, labeling (samples → annotations) is expensive, its inverse, simulation (annotations → samples) can be cheaper. By mapping the expected distribution of annotations into test samples using parametric simulation, we present a method that explains sample representation using the mismatch in diversity between simulated and collected data.We then apply the method to examine a dataset of geometric shapes to qualitatively and quantitatively explain sample representation in termsof comprehensible aspects such as size, position, and pixel brightness
    corecore