1,226 research outputs found
Indexing Highly Repetitive String Collections
Two decades ago, a breakthrough in indexing string collections made it
possible to represent them within their compressed space while at the same time
offering indexed search functionalities. As this new technology permeated
through applications like bioinformatics, the string collections experienced a
growth that outperforms Moore's Law and challenges our ability of handling them
even in compressed form. It turns out, fortunately, that many of these rapidly
growing string collections are highly repetitive, so that their information
content is orders of magnitude lower than their plain size. The statistical
compression methods used for classical collections, however, are blind to this
repetitiveness, and therefore a new set of techniques has been developed in
order to properly exploit it. The resulting indexes form a new generation of
data structures able to handle the huge repetitive string collections that we
are facing.
In this survey we cover the algorithmic developments that have led to these
data structures. We describe the distinct compression paradigms that have been
used to exploit repetitiveness, the fundamental algorithmic ideas that form the
base of all the existing indexes, and the various structures that have been
proposed, comparing them both in theoretical and practical aspects. We conclude
with the current challenges in this fascinating field
A Unified View of Long-Sequence Models towards Modeling Million-Scale Dependencies
Ever since their conception, Transformers have taken over traditional
sequence models in many tasks, such as NLP, image classification, and
video/audio processing, for their fast training and superior performance. Much
of the merit is attributable to positional encoding and multi-head attention.
However, Transformers fall short in learning long-range dependencies mainly due
to the quadratic complexity scaled with context length, in terms of both time
and space. Consequently, over the past five years, a myriad of methods has been
proposed to make Transformers more efficient. In this work, we first take a
step back, study and compare existing solutions to long-sequence modeling in
terms of their pure mathematical formulation. Specifically, we summarize them
using a unified template, given their shared nature of token mixing. Through
benchmarks, we then demonstrate that long context length does yield better
performance, albeit application-dependent, and traditional Transformer models
fall short in taking advantage of long-range dependencies. Next, inspired by
emerging sparse models of huge capacity, we propose a machine learning system
for handling million-scale dependencies. As a proof of concept, we evaluate the
performance of one essential component of this system, namely, the distributed
multi-head attention. We show that our algorithm can scale up attention
computation by almost using four GeForce RTX 4090 GPUs, compared to
vanilla multi-head attention mechanism. We believe this study is an
instrumental step towards modeling million-scale dependencies.Comment: 20 pages, 7 figure
- …