45 research outputs found
SBNet: Sparse Blocks Network for Fast Inference
Conventional deep convolutional neural networks (CNNs) apply convolution
operators uniformly in space across all feature maps for hundreds of layers -
this incurs a high computational cost for real-time applications. For many
problems such as object detection and semantic segmentation, we are able to
obtain a low-cost computation mask, either from a priori problem knowledge, or
from a low-resolution segmentation network. We show that such computation masks
can be used to reduce computation in the high-resolution main network. Variants
of sparse activation CNNs have previously been explored on small-scale tasks
and showed no degradation in terms of object classification accuracy, but often
measured gains in terms of theoretical FLOPs without realizing a practical
speed-up when compared to highly optimized dense convolution implementations.
In this work, we leverage the sparsity structure of computation masks and
propose a novel tiling-based sparse convolution algorithm. We verified the
effectiveness of our sparse CNN on LiDAR-based 3D object detection, and we
report significant wall-clock speed-ups compared to dense convolution without
noticeable loss of accuracy.Comment: 10 pages, CVPR 201
BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling
Like masked language modeling (MLM) in natural language processing, masked
image modeling (MIM) aims to extract valuable insights from image patches to
enhance the feature extraction capabilities of the underlying deep neural
network (DNN). Contrasted with other training paradigms like supervised
learning and unsupervised contrastive learning, masked image modeling (MIM)
pretraining typically demands significant computational resources in order to
manage large training data batches (e.g., 4096). The significant memory and
computation requirements pose a considerable challenge to its broad adoption.
To mitigate this, we introduce a novel learning framework,
termed~\textit{Block-Wise Masked Image Modeling} (BIM). This framework involves
decomposing the MIM tasks into several sub-tasks with independent computation
patterns, resulting in block-wise back-propagation operations instead of the
traditional end-to-end approach. Our proposed BIM maintains superior
performance compared to conventional MIM while greatly reducing peak memory
consumption. Moreover, BIM naturally enables the concurrent training of
numerous DNN backbones of varying depths. This leads to the creation of
multiple trained DNN backbones, each tailored to different hardware platforms
with distinct computing capabilities. This approach significantly reduces
computational costs in comparison with training each DNN backbone individually.
Our framework offers a promising solution for resource constrained training of
MIM
LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos
In this paper we introduce LifelongMemory, a new framework for accessing
long-form egocentric videographic memory through natural language question
answering and retrieval. LifelongMemory generates concise video activity
descriptions of the camera wearer and leverages the zero-shot capabilities of
pretrained large language models to perform reasoning over long-form video
context. Furthermore, Lifelong Memory uses a confidence and explanation module
to produce confident, high-quality, and interpretable answers. Our approach
achieves state-of-the-art performance on the EgoSchema benchmark for question
answering and is highly competitive on the natural language query (NLQ)
challenge of Ego4D. Code is available at
https://github.com/Agentic-Learning-AI-Lab/lifelong-memory