3,730 research outputs found
Stream Fusion, to Completeness
Stream processing is mainstream (again): Widely-used stream libraries are now
available for virtually all modern OO and functional languages, from Java to C#
to Scala to OCaml to Haskell. Yet expressivity and performance are still
lacking. For instance, the popular, well-optimized Java 8 streams do not
support the zip operator and are still an order of magnitude slower than
hand-written loops. We present the first approach that represents the full
generality of stream processing and eliminates overheads, via the use of
staging. It is based on an unusually rich semantic model of stream interaction.
We support any combination of zipping, nesting (or flat-mapping), sub-ranging,
filtering, mapping-of finite or infinite streams. Our model captures
idiosyncrasies that a programmer uses in optimizing stream pipelines, such as
rate differences and the choice of a "for" vs. "while" loops. Our approach
delivers hand-written-like code, but automatically. It explicitly avoids the
reliance on black-box optimizers and sufficiently-smart compilers, offering
highest, guaranteed and portable performance. Our approach relies on high-level
concepts that are then readily mapped into an implementation. Accordingly, we
have two distinct implementations: an OCaml stream library, staged via
MetaOCaml, and a Scala library for the JVM, staged via LMS. In both cases, we
derive libraries richer and simultaneously many tens of times faster than past
work. We greatly exceed in performance the standard stream libraries available
in Java, Scala and OCaml, including the well-optimized Java 8 streams
Two-Stream Transformer Architecture for Long Video Understanding
Pure vision transformer architectures are highly effective for short video
classification and action recognition tasks. However, due to the quadratic
complexity of self attention and lack of inductive bias, transformers are
resource intensive and suffer from data inefficiencies. Long form video
understanding tasks amplify data and memory efficiency problems in transformers
making current approaches unfeasible to implement on data or memory restricted
domains. This paper introduces an efficient Spatio-Temporal Attention Network
(STAN) which uses a two-stream transformer architecture to model dependencies
between static image features and temporal contextual features. Our proposed
approach can classify videos up to two minutes in length on a single GPU, is
data efficient, and achieves SOTA performance on several long video
understanding tasks
SODFormer: Streaming Object Detection with Transformer Using Events and Frames
DAVIS camera, streaming two complementary sensing modalities of asynchronous
events and frames, has gradually been used to address major object detection
challenges (e.g., fast motion blur and low-light). However, how to effectively
leverage rich temporal cues and fuse two heterogeneous visual streams remains a
challenging endeavor. To address this challenge, we propose a novel streaming
object detector with Transformer, namely SODFormer, which first integrates
events and frames to continuously detect objects in an asynchronous manner.
Technically, we first build a large-scale multimodal neuromorphic object
detection dataset (i.e., PKU-DAVIS-SOD) over 1080.1k manual labels. Then, we
design a spatiotemporal Transformer architecture to detect objects via an
end-to-end sequence prediction problem, where the novel temporal Transformer
module leverages rich temporal cues from two visual streams to improve the
detection performance. Finally, an asynchronous attention-based fusion module
is proposed to integrate two heterogeneous sensing modalities and take
complementary advantages from each end, which can be queried at any time to
locate objects and break through the limited output frequency from synchronized
frame-based fusion strategies. The results show that the proposed SODFormer
outperforms four state-of-the-art methods and our eight baselines by a
significant margin. We also show that our unifying framework works well even in
cases where the conventional frame-based camera fails, e.g., high-speed motion
and low-light conditions. Our dataset and code can be available at
https://github.com/dianzl/SODFormer.Comment: 18 pages, 15 figures, in IEEE Transactions on Pattern Analysis and
Machine Intelligenc
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs
Large-scale pretraining and task-specific fine-tuning is now the standard
methodology for many tasks in computer vision and natural language processing.
Recently, a multitude of methods have been proposed for pretraining vision and
language BERTs to tackle challenges at the intersection of these two key areas
of AI. These models can be categorised into either single-stream or dual-stream
encoders. We study the differences between these two categories, and show how
they can be unified under a single theoretical framework. We then conduct
controlled experiments to discern the empirical differences between five V&L
BERTs. Our experiments show that training data and hyperparameters are
responsible for most of the differences between the reported results, but they
also reveal that the embedding layer plays a crucial role in these massive
models.Comment: To appear in TACL 202
Dual-Stream Attention Transformers for Sewer Defect Classification
We propose a dual-stream multi-scale vision transformer (DS-MSHViT)
architecture that processes RGB and optical flow inputs for efficient sewer
defect classification. Unlike existing methods that combine the predictions of
two separate networks trained on each modality, we jointly train a single
network with two branches for RGB and motion. Our key idea is to use
self-attention regularization to harness the complementary strengths of the RGB
and motion streams. The motion stream alone struggles to generate accurate
attention maps, as motion images lack the rich visual features present in RGB
images. To facilitate this, we introduce an attention consistency loss between
the dual streams. By leveraging motion cues through a self-attention
regularizer, we align and enhance RGB attention maps, enabling the network to
concentrate on pertinent input regions. We evaluate our data on a public
dataset as well as cross-validate our model performance in a novel dataset. Our
method outperforms existing models that utilize either convolutional neural
networks (CNNs) or multi-scale hybrid vision transformers (MSHViTs) without
employing attention regularization between the two streams
- …