Abstract. The effective use of parallel computing resources to speed up algorithms in current multi-core parallel architectures remains a difficult challenge, with ease of programming playing a key role in the eventual success of various parallel architectures. In this paper we consider an alternative view of parallelism in the form of an ultra-wide word processor. We introduce the Ultra-Wide Word architecture and model, an extension of the word-ram model that allows for constant time operations on thousands of bits in parallel. Word parallelism as exploited by the word-ram model does not suffer from the more difficult aspects of parallel programming, namely synchronization and concurrency. For the standard word-ram algorithms, the speedups obtained are moderate, as they are limited by the word size. We argue that a large class of word-ram algorithms can be implemented in the Ultra-Wide Word model, obtaining speedups comparable to multi-threaded computations while keeping the simplicity of programming of the sequential ram model. We show that this is the case by describing implementations of Ultra-Wide Word algorithms for dynamic programming and string searching. In addition, we show that the Ultra-Wide Word model can be used to implement a nonstandard memory architecture, which enables the sidestepping of lower bounds of important data structure problems such as priority queues and dynamic prefix sums. While similar ideas about operating on large words have been mentioned before in the context of multimedia processors [37] , it is only recently that an architecture like the one we propose has become feasible and that details can be worked out.
Introduction
In the last few years, multi-core architectures have become the dominant commercial hardware platform. The potential of these architectures to improve performance through parallelism remains to be fully attained, as effectively using all cores on a single application has proven to be a difficult challenge. In this paper we introduce the Ultra-Wide Word architecture and model of computation, an alternate view of parallelism for a modern architecture in the form of an ultra-wide word processor. This can be implemented by replacing one or more cores of a multi-core chip with a very wide word Arithmetic Logic Unit (alu) that can perform operations on a very large number of bits in parallel.
The idea of executing operations on a large number of bits simultaneously has been successfully exploited in different forms. In Very Long Instruction Word (VLIW) architectures [17] , several instructions can be encoded in one wide word and executed in one single parallel instruction. Vector processors allow the execution of one instruction on multiple elements simultaneously, implementing Single-Instruction-Multiple-Data (SIMD) parallelism. This form of parallelism led to the design of supercomputers such as the Cray architecture family [36] and is now present in Graphics Processing Units (GPUs) as well as in Streaming SIMD Extensions (SSE) to scalar processors.
In 2003, Thorup [37] observed that certain instructions present in some SSE implementations were particularly useful for operating on large integers and speeding up algorithms for combinatorial problems. To a certain extent, some of the ideas in the Ultra Wide Word architecture are presaged in the paper by Thorup , which was proposed in the context of multimedia processors. Our architecture developed independently and differs on several aspects (see discussion in Section 2.3) but it is motivated by similar considerations.
As CPU hardware advances, so does the model used in theory to analyze it. The increase in word size was reflected in the word-ram model in which algorithm performance is given as a function of the input size n and the word size w, with the common assumption that w = Θ(log n). In its simplest version, the word-ram model allows the same operations as the traditional ram model. Algorithms in this model take advantage of bit-level parallelism through packing various elements in one word and operating on them simultaneously. Although similar to vector processing, the word-ram provides more flexibility in that the layout of data in a word depends on the algorithm and data elements can be packed in an arbitrary way. Unlike VLIW architectures, the Ultra-Wide Word model we propose is not concerned with the compiler identifying operations which can be done in parallel but rather with achieving large speedups in implementations of word-ram algorithms through operations on thousands of bits in parallel.
As multi-core chip designs evolve, chip vendors try to determine the best way to use the available area on the chip, and the options traditionally are an increased number of cores or larger caches. We believe that the current stage in processor design allows for the inclusion of an architecture such as the one we propose. In addition, ease of programming is a major hurdle to the eventual success of parallel and multi-core architectures. In contrast, bit parallelism as exploited by the word-ram model does not suffer from this drawback: there is a large selection of word-ram algorithms (see, e.g., [2, 26, 24, 12] ) that readily benefit from bit parallelism without having to deal with the more difficult aspects of concurrency such as mutual exclusion, synchronization, and resource contention. In this sense, the advantage of an on-chip ultra-wide word architecture is that it can enable word-ram algorithms to achieve speedups comparable to those of multi-threaded computations, while at the same time keeping the simplicity of sequential programming that is inherent to the ram model. We argue that this is the case by showing several examples of implementations of word-ram algorithms using the wide word, usually with simple modifications to existing algorithms, and extending the ideas and techniques from the word-ram model.
In terms of the actual architecture, we envision the ultra-wide alu together with multi-cores on the same chip. Thus, the Ultra-Wide Word architecture adds to the computing power of current architectures. The results we present in this paper, however, do not use multi-core parallelism.
Summary of Results
We introduce the Ultra-Wide Word architecture and model, which extends the w-bit word-ram model by adding an alu that operates on w 2 -bit words. We show that several broad classes of algorithms can be implemented in this model. In particular:
-We describe Ultra-Wide Word implementations of dynamic programming algorithms for the subset sum problem, the knapsack problem, the longest common subsequence problem, as well as many generalizations of these problems. Each of these algorithms illustrates a different technique (or combination of techniques) for translating an implementation of an algorithm in the word-ram model to the Ultra-Wide Word model. In all these cases we obtain a w-fold speedup over word-ram algorithms. -We also describe Ultra-Wide Word implementations of popular string searching algorithms: the Shift-And/Shift-Or algorithms [4, 40] and the BoyerMoore-Horspool algorithm [28] . Again, we obtain a w-fold speedup over the original algorithms. -Finally, we show that the Ultra-Wide Word model is powerful enough to simulate a non-standard memory architecture in which bytes can overlap, which we shall call fs-ram [18] . This allows us to implement data structures and algorithms that circumvent known lower bounds for the word-ram model.
The rest of this paper is organized as follows. In Section 2 we describe the Ultra-Wide architecture and model of computation. We show in Section 3 how to simulate the fs-ram memory architecture. In Sections 4 and 5 we present uw-ram implementations of algorithms for dynamic programming and string searching. We present concluding remarks in Section 6.
The Ultra-Wide Word-RAM Model
The Ultra-Wide word-ram model (uw-ram) we propose is an extension of the word-ram model. We briefly review here the key features of the word-ram.
Algorithms in the word-RAM model
The word-ram is a variant of the ram model in which a word has length w bits, and the contents of memory are integers in the range {0, . . . , 2 w − 1} [24] . This implies that w ≥ log n, where n is the size of the input, and a common assumption is w = Θ(log n) (see, e.g., [32, 8] ). The word-ram includes the usual load, store, and jump instructions of the ram model, allowing for immediate operands and for direct and indirect addressing. In this model, arithmetic operations on two words are modulo 2
w , and the instruction set includes left and right shift operations (equal to multiplication and division by powers of two) and boolean operations. All instructions take constant time to execute. There are different versions of the word-ram model depending on the instruction set assumed to be available. The restricted model is limited to addition, subtraction, left and right shifts, and boolean operations AND, OR, and NOT. These instructions augmented with multiplication constitute the multiplication model. Finally, the AC 0 model assumes that all functions computable by an unbounded fan-in circuit of polynomial size (in w) and constant depth are available in the instruction set and execute in constant time. This definition includes all instructions from the restricted model and excludes multiplication. We refer to the reader to the survey by Hagerup [24] for a more extended description of the model and a discussion of its practicality.
Word-ram algorithms exploit word-level parallelism by operating on various elements simultaneously using instructions on w-bits words. There are various algorithms for fundamental problems that take advantage of word-level parallelism or a bounded universe, some of which fit into the word-ram model, although are not explicitly designed for it [3] . Much attention has been given to sorting and searching, for which known lower bounds in the comparison model do not carry to the word-ram model [20] . For example, in a word-ram model with multiplication, sorting n words can be done in O(n log log n) time and O(n) space deterministically [26] , and in expected O(n √ log log n) time and O(n) space using randomization [27] . Word-ram techniques have also been applied in many different areas, such as succinct data structures [29, 32] , computational geometry [12, 13] , and text indexing [22] .
Ultra-Wide RAM
The Ultra-Wide word-ram model (uw-ram) extends the word-ram model by introducing an ultra-wide alu with w 2 -bit wide words, where w is the number of bits in a word-ram. The ultra-wide alu supports the basic operations available in a word-ram on the entire word at once. As in the word-ram model, the available set of instructions can be assumed to be those of the restricted, multiplication, or the AC 0 models. For the results in this paper we assume the instructions of the restricted model (addition, subtraction, left and right shift, and bitwise boolean operations), plus two non-standard straightforward AC 0 operations that we describe at the end of this subsection.
The model maintains the standard w-bit alu as well as w-bit memory addressing. In general, we use the parameter w for the word size in the description and analysis of algorithms, although in some cases we explicitly assume w = Θ(log n). In terms of real world parameters, the wide word in the ultra-wide alu would presently have between 1,000 and 10,000 bits and could increase even further in the future. In reality, the addition of an alu that supports operations on thousands on bits would require appropriate adjustments to the data and instruction caches of a processor as well as to the instruction pipeline implementation. Similarly to the abstractions made by the ram and word-ram models, the uw-ram model ignores the effects of these and other architectural features and assumes that the execution of instructions on ultra-wide words is as efficient as the execution of operations on regular w-bit words, up to constant factors.
Provided that the uw-ram supports the same operations as the word-ram, the techniques to achieve bit-level parallelism in the word-ram extend directly to the uw-ram. However, since the word-ram assumes that a word can be read from memory in constant time, many operations in word-ram algorithms can be implemented through constant time table lookups. For example, counting the number of set bits in a word of w = log n bits can be implemented through two table lookups to a precomputed table that stores the number of set bits for each number of log n/2 bits. The space used by the table is √ n words. We cannot expect to achieve the same constant time lookup operation with words of w 2 bits since the size of the lookup tables would be prohibitive. However, the memory access operations of our model allow for the implementation of simultaneous table lookups of several w-bit words within a wide word, as we shall explain below.
We first introduce some notation. Let W denote a w 2 -bit word. Let W [i] denote the i-th bit of W , and let W [i..j] denote the contiguous subword of W from bit i to bit j, inclusive. The least significant bit of W is W [0], and thus W =
For the sake of memory access operations, we divide W into w-bit blocks. Let W j denote the j-th contiguous block of w bits in W , for 0 ≤ j ≤ w − 1, and let W j [i] denote the i-th bit within W j . Thus,
The division of a wide word in blocks is solely intended for certain memory access operations, but basic operations of the model have no notion of block boundaries. Fig. 1 shows a representation of a wide word, depicting bits with increasing significance from left to right. In the description of operations with wide words we generally refer to variables with uppercase letters, whereas we use lowercase to refer to regular variables that use one w-bit word. Thus, shifts to the left (right) by i are equivalent to division (multiplication) by 2
i . In addition, we use 0 to denote a wide word with value 0. We use standard C-like notation for operations and ('&'), or ('|'), not ('∼') and shifts ('<<','>>').
Memory Access Operations In this architecture w (not necessarily contiguous) words from memory can be transferred into the w blocks of a wide word W in constant time. These blocks can be written to memory in parallel as well. As with PRAM algorithms, the memory access type of the model can be assumed to allow or disallow concurrent reads and writes. For the results in this paper we assume the Concurrent-Read-Exclusive-Write (CREW) model. The memory access operations that involve wide words are of three types: block, word, and content. We describe read accesses (write accesses are analogous). A block access loads a single w-bit word from memory into a given block of a wide word. A word access loads w contiguous w-bit words from memory into an entire wide word in constant time. Finally, a content access uses the contents of a wide word W as addresses to load (possibly non-contiguous) words of memory simultaneously: for each block j within W , this operation loads from memory the w-bit word whose address is W j (plus possibly a base address). The specifics of read and write operations are shown in Table 1 .
Note that accessing several (possibly non-contiguous) words from memory simultaneously is an assumption that is already made by any shared memory multiprocessing model. While, in reality, simultaneous access to all addresses in actual physical memory (e.g., DRAM) might not be possible, in shared memory systems, such as multi-core processors, the slowdown is mitigated by truly parallel access to private and shared caches, and thus the assumption is reasonable. We therefore follow this assumption in the same spirit.
In fact, for w equal to the regular word size (32 or 64 bits), the choice of w blocks of w bits each for the wide word alu was judiciously made to provide the model with a feasible memory access implementation. w 2 lines to memory are well within the realm of the possible, as they are of the same order of magnitude (a factor of 2 or 8) as modern GPUs, some of which feature bus widths of 512 bits (e.g., FirePro W9100 [1] or Nvidia GeForce GTX 285 [21] , see also [38, 39] ). We note that a more general model could feature a wide word with k blocks of w bits each, where k is a parameter, which can be adjusted in reality according to the feasibility of implementation of parallel memory accesses. Although described for w blocks, the algorithms presented in this paper can easily be adapted to work with k blocks instead. Naturally, the speedups obtained would depend on the number of blocks assumed, but also on the memory bandwidth of the architecture. A practical implementation with a large number of blocks would likely suffer slowdowns due to congestion in the memory bus. We believe that an implementation with k equal to 32 or 64 can be realized with truly parallel memory access, leading to significant speedups. Table 1 . Wide word memory access operations of the uw-ram. mem denotes regular ram memory, which is indexed by addresses to words, and base is some base address.
The compress operation takes a wide word W whose set bits are restricted to the first bit of each block and compresses them to the first block of a wide word.
UW-RAM Subroutines
We now describe some operations that will be used throughout the uw-ram implementations that we describe in later sections. A procedure called compress serves to bring together bits from all blocks into one block in constant time, while a procedure called spread is the inverse function 5 . Both operations can be implemented by straightforward constant-depth circuits. We will also use parallel comparators, a standard technique used in word-ram algorithms [24] (see details in Appendix A). Although these are all the subroutines that we need for the results in this paper, other operations of similar complexity could be defined if proved useful.
-Compress: Let W be a wide word in which all bits are zero except possibly for the first bit of each block. The compress operation copies the first bit of each block of W to the first block of a word X. I.e., if
for 0 ≤ j < w, and X[j] = 0 for j ≥ w (see Fig. 2 ). -Spread: This operation is the inverse of the compress operation. It takes a word W whose set bits are all in the first block and spreads them across blocks of a word X so that
Relation to Other Models
There exist various models and architectures that exploit the execution of instructions on a large number of bits simultaneously. In Very Large Instruction Word (VLIW) architectures [17] several, possibly different instructions can be encoded in one wide word and executed in parallel. It is usually the compiler's job to determine which instructions of a program can be executed safely in parallel. In contrast, in the uw-ram model it is up to the algorithm designer to specify how parallelism in the ultra wide word should be used. In addition, the wide word can only execute one type of instruction at a time. In this sense, the uw-ram is closer to a vector processor, in which a single instruction is executed on various data item, implementing SIMD parallelism. However, while vector processors operate on fields which are independent of each other, the ultra wide alu in the uw-ram is really one wide word of thousands of bits that treats its contents as one data object. An exception to this are the memory access instructions, which load and store data in blocks within the wide word so that the wide word alu can interact with regular w-bit data. It is of course possible to use the ultra-wide word to implement a vectorized operation, however, as instructions in the uw-ram operate on the entire word, it is up to the algorithm designer to deal with carries and other interference within fields. Moreover, the length of a field in the uw-ram is variable, as it depends on the algorithm's choice. In that sense, the uw-ram is a more flexible model.
Many modern processors support some form of SIMD parallelism with vectors of a small number of fields (e.g. Intel's SSE). Depending on the architecture, some of the available operations include inter-field instructions such as shuffle (which permutes fields in a vector), pack and unpack (equivalent to our compress and spread operations), inter-field shifts, or global sum (which sums all fields in the vector). The power of multimedia processors was studied by Thorup [37] , who modeled these processors as vectors of k fields of bits each. Thorup showed that standard global operations on (k × )-bit words can be implemented using vector instructions and inter-field operations in constant time, and argued that this enables the implementation of fundamental combinatorial algorithms such as sorting, hashing, and algorithms for minimum spanning trees on (k × )-bit integers.
In contrast to Thorup's work, our main interest is in using the ultra wide word to deal with inputs of regular w-bit data objects and to speed up algorithms by being able to operate on more of these objects simultaneously. Moreover, we assume that the wide-word alu supports the standard operations on the full word from the outset, with no need to simulate them using vector operations. Finally, we explore the consequences of indirect memory addressing at the field level, a feature that is not mentioned in Thorup's model.
The uw-ram model can also be related to Multiple-Instruction-MultipleData (MIMD) models, and in particular to the PRAM. Although the uwram alu can only execute one instruction on the wide word, it is conceivable to devise a simulation of a PRAM algorithm on the uw-ram. Each block of the wide word in the uw-ram acts like a PRAM processor. Since the uw-ram can only execute one type of instruction at a time, each parallel step of the PRAM algorithm is executed in s/w steps on the uw-ram, where s is the number of different instructions involved in the PRAM algorithm. For a constant number of different PRAM instructions and a non-constant number of uw-ram blocks w, this simulation results in a constant overhead in time (compared to the PRAM . Yggdrasil memory layout [10] : each node in a complete binary tree is an fsram bit and registers are defined as paths from a leaf to the root. For example, register 3 contains bits B11, B5, B2, and B1 (shaded nodes).
algorithm running on Θ(w) processors). However, if such simulation were to be done in any practical implementation of these two models, the actual slowdown would be significant and most instructions would execute serially (as the number of different PRAM instructions is in the same order of magnitude as w). On the other hand, any uw-ram algorithm that runs in time t + q, where q is the number of compress operations and t is the number of steps involved in the rest of the operations, can be simulated in time O(t + q log w) on a PRAM with w processors, as log w steps are necessary to simulate a compress operation.
Although simulations between the uw-ram and other models exist, the idea of introducing the uw-ram is to achieve larger speedups with word-ram algorithms, keeping the programming techniques of this model. In practice, the implementations of PRAM algorithms are usually on asynchronous multi-cores, in which programmers must deal with concurrency issues. The advantage of our model is that we can avoid these issues while obtaining similar speedups to those of multi-cores.
Simulation of FS-RAM
In the standard ram model of computation memory is organized in registers or words, each word containing a set of bits. Any bit in a word belongs to that word only. In contrast, in the fs-ram model [18] -also known as Random Access Machine with Byte Overlap (rambo)-words can overlap, that is, a single bit of memory can belong to several words. The topology of the memory, i.e., a specification of which bits are contained in which words, defines a particular variant of the fs-ram model. Variants of this model have been used to sidestep lower bounds for important data structure problems [10, 11] .
We show how the uw-ram can be used to implement memory access operations for any given fs-ram of word size at most w bits in constant time. Thus, the time bounds of any algorithm in the fs-ram model carry over directly to the uw-ram. Note that each fs-ram layout requires a different specialized hardware implementation, whereas a uw-ram architecture can simulate any fs-ram layout without further changes to its memory architecture.
Implementing FS-RAM Operations in the UW-RAM
Let B 1 , . . . , B B denote the bits of fs-ram memory. A particular fs-ram memory layout can be defined by the registers and the bits contained in them [9] . For example, in the Yggdrasil model in Fig. 3 
in the example) [10] . In order to implement memory access operations on a given fs-ram using the uw-ram, we need to represent the memory layout of fs-ram in standard ram. Assume an fs-ram memory of r registers of b ≤ w bits each and B ≤ br distinct fs-ram bits. We assume that the fs-ram layout is given as a table R that stores, for each register and bit within the register, the number of the corresponding fs-ram bit.
We assume R is stored in row major order. We simply store the value of each fs-ram bit B i in a different w-bit entry of an array A in ram, i.e., A[i] = B i . We could store more than one bit in each word of A; however, this representation allows us to avoid having to serialize concurrent writes to the same word.
Given an index t of a register of an fs-ram represented by R, we can read the values of each bit of reg 
Algorithm 1 fs-ram read(t)
Since the read and write operations described above are sufficient to implement any operation that uses fs-ram memory (any other operation is implemented in ram), we have the following result.
Algorithm 2 fs-ram write(t, B = B i0 . . .
Theorem 1. Let R be any fs-ram memory layout of r registers of at most b bits each and B distinct fs-ram bits, with b ≤ w and log B ≤ w. Let A be any fsram algorithm that uses R and runs in time T . Algorithm A can be implemented in the uw-ram to run in time O(T ), using rb + B additional words of ram.
Proof. Table R indicating the fs-ram bit identifier for each register and bit within register can be stored in rb words of ram, while the values of each bit can be stored in B words of ram. Since both fs-ram read and fs-ram write are constant time operations, any t-time operation that uses fs-ram memory can be implemented in uw-ram in the same time t.
Constant Time Priority Queue
Brodnik et al. [10] use the Yggdrasil fs-ram memory layout to implement priority queue operations in constant time using 3M −1 bits of space (2M of ordinary memory and M − 1 of fs-ram memory), where M is the size of the universe. This problem has non-constant lower bounds for several models , including an Ω(min{lg lg M/ lg lg lg M, lg N/ lg lg N }) lower bound in the ram model when the memory is restricted to N O (1) , where N is the number of elements in the set to be maintained [6] . For a universe of size M = 2 m , for some m, the Yggdrasil fs-ram layout consists of r = M/2 registers of b = log M bits each, and B = M − 1 distinct fs-ram bits (Fig. 3 is an example with M = 16). Thus, applying Theorem 1 we obtain the following result: Corollary 1. The discrete extended priority queue problem can be solved in the uw-ram in O(1) time per operation using 2M + w(M/2) log M + w(M − 1) bits, thus in O(M log M ) words of ram.
Constant Time Dynamic Prefix Sums
Brodnik et al. [11] use a modified version of the Yggdrasil fs-ram to solve the dynamic prefix sums problem in constant time. This problem consists of maintaining an array A of size N over a universe of size M that supports the operations update(j, d), which sets A[j] to A[j] ⊕ d, and retrieve(j), which returns ⊕ j i=0 A[i] [19, 11] , where ⊕ is any associative binary operation. This fs-ram implementation sidesteps lower bounds on various models: there is an Ω(log N ) algebraic complexity lower bound [19] as well as under the semi-group model of computation [25] , and an Ω(log N/ log log N ) information-theoretic lower bound [19] .
The result of Brodnik et al. [11] uses a complete binary tree on top of array A as leaves. The tree is similar to the one used in the priority queue problem, but it differs in that only internal nodes store any information and in that there are m = log M bits stored in each node. This tree is stored in a variant of the Yggdrasil memory called m-Yggdrasil, in which each register corresponds again to a path from a leaf to the root, but this time each node stores not only one bit but the m bits containing the sum of all values in the leaves of the left subtree of that node [11] . It is assumed that nm ≤ w, where n = log N and w is the size of the word in bits. Thus, an entire path from leaf to root fits in a word and can be accessed in constant time. An update or retrieve operation consists of retrieving the values along a path in the tree and processing them in constant time using bit-parallelism and table lookup operations. The space used by the lookup table can be reduced at the expense of an increased time for the retrieve operation. In general, both operations can be supported in time O(ι + [11] .
In order to represent the m-Yggdrasil memory in our model, we treat each bit of a node in the tree as a separate fs-ram bit. Thus, the fs-ram memory has r = N registers of b = nm bits each, and there are B = (N − 1)m distinct bits to be stored. Hence, by Theorem 1 we have: 
Dynamic Programming
In this section we describe uw-ram implementations of dynamic programming algorithms for the subset sum, knapsack, and longest common subsequence problems. A word-ram algorithm that only uses bit parallelism can be translated directly to the uw-ram. The algorithm for subset sum is an example of this. In general, however, word-ram algorithms that use lookup tables cannot be directly extended to w 2 bits, as this would require a mechanism to address Θ(w 2 )-bit words in memory as well as lookup tables of prohibitively large size. Hence, extra work is required to simulate table lookup operations. The knapsack implementation that we present is a good example of such case.
Subset Sum
Given a set S = {a 1 , a 2 , . . . , a n } of nonnegative integers (weights) and an integer t (capacity), the subset sum problem is to find S ⊆ S such that ai∈S a i = t The optimization version asks for the solution of maximum weight which does not exceed t [14] . This problem is NP-hard, but it can solved in pseudopolynomial time via dynamic programming in O(nt) time, using the following recurrence [7] : for each 0 ≤ i ≤ n and 0 ≤ j ≤ t, C i,j = 1 if and only if there is a subset of elements {a 1 , . . . , a i } that adds up to j. Thus, C 0,0 = 1, C 0,j = 0 for all j > 0, and C i,j = 1 if C i−1,j = 1 or C i−1,j−ai = 1 (C i,j = 0 for any j < 0). The problem admits a solution if C n,t = 1.
Pisinger [35] gives an algorithm that implements this recursion in the wordram with word size w by representing up to w entries of a row of C. Using bit parallelism, w bits of a row can be updated simultaneously in constant time from the entries of the previous row: C i is updated by computing [35] . Assuming w = Θ(log t), this approach leads to an O(nt/ log t) time solution in O(t/ log t) space. The actual elements in S that form the solution can be recovered with the same space and time bounds with a recursive technique by Pferschy [34] .
This algorithm can be implemented directly in the uw-ram: entries of row C i are stored contiguously in memory; thus, we can load and operate on w 2 bits in O(1) time when updating each row. Hence, the uw-ram implementation runs in O(nt/ log 2 t) time using the same O(t/ log t) space (number of w-bit words).
Knapsack
Given a set S of n elements with weights and values, the knapsack problem asks for a subset of S of maximum value such that the total weight is below a given capacity bound b.
, where w i and v i are the weight and value of the i-th element. Like subset sum, this problem is NP-hard but can be solved in pseudopolynomial time using the following recurrence [7] : let C i,j be the maximum value of a solution containing elements in the subset
with maximum capacity j. Then, C 0,j = 0 for all 0 ≤ j ≤ b, and C i,j = max{C i−1,j , C i−1,j−wi + v i }. The value of the optimal solution is C n,b . This leads to a dynamic program that runs in O(nb) time.
The word-ram algorithm by Pisinger [35] represents partial solutions of the dynamic programming table with two binary tables g and h and operates on O(w) entries at a time. More specifically, g i,u = 1 and h i,v = 1 if and only if there is a solution with weight u and value v that is not dominated by another solution in C i, * (i.e., there is no entry C i,u such that u < u and C i,u ≥ v). Pisinger shows how to update each entry of g and h with a constant time procedure, which can be encoded as a constant size lookup table T . A new lookup table T α is obtained as the product of α times the original table T . Thus, α entries of g and h can be computed in constant time. Setting α = w/10, an entire row of g and h can be computed in O(m/w) time and O(m/w) space [35] , where m is the maximum of the capacity b and the value of the optimal solution 6 . The optimal solution can then be computed in O(nm/w) time.
Compared to the subset sum algorithm, which relies mainly on bit-parallel operations, this word-ram algorithm for knapsack relies on precomputation and use of lookup tables to achieve a w-fold speedup. While we cannot precompute a composition of Θ(w 2 ) lookup tables to compute Θ(w 2 ) entries of g and h at a time, we can use the same tables with α = w/10 as in Pisinger's algorithm and use the read content operation of the uw-ram to make w simultaneous lookups to the table. Since the entries in a row i of h and g depend only on entries in row i − 1, then there are no dependencies between entries in the same row.
One difficulty is that in order to compute the entries in row i in parallel we must first preprocess row i − 1 in both h and g, such that we can return the number of one bits in both g i−1,0 , ..., g i−1,j and h i−1,0 , ..., h i−1,j in O(1) time for any column j ∈ {0, m − 1}. That is, the prefix sums of the one bits in row i − 1. Note that this is not the same as the dynamic problem described in Section 3.3, but it is a static prefix sums problem. Furthermore, since the algorithm is the same for both g and h, we describe the computation for g alone.
Static Prefix Sums
We divide g i−1 in blocks of w contiguous bits and compute the number of ones in each block g i−1,k , ..., g i−1,k+w−1 for k ∈ {0, w, 2w, ..., m/w w} using a lookup table. We store the results in an array A of length m/w , with A[k] storing the number of ones in the k-th block. Next, we compute the prefix sums A of A in two steps. We divide A in subarrays of w consecutive entries. Let A i denote the subarray A[iw, iw + w − 1], for i ∈ {0, 1, . . . , |A|/w − 1}.
The first step is to compute the prefix sums A i of each subarray A i , i.e.
. Using the w blocks of a wide word, we can operate on w entries at a time. Consider the first w consecutive subarrays A 0 , A 1 , . . . , A w−1 . In order to compute A 0 , . . . , A w−1 , for each 0 ≤ k ≤ w − 1, we use the i-th block of the wide work to compute A i [k], thus computing the entries for all 0 ≤ i ≤ w − 1 simultaneously. Each entry is computed in constant time, since
Hence, we can compute the prefix sums of w subarrays in O(w) time. After computing the first w subarrays we continue with the second group, and so on. Thus, we compute all prefix sums of the O(|A|/w) subarrays in O(|A|/w) time. The second step is to update each subarray of A by adding to each entry the last entry of the previous subarray. I.e., we set
for all i = 1, . . . , |A |/w − 1 (in increasing value of i). This can also be done for w entries at once, but this time we use the blocks of the wide word to update all entries of one subarray simultaneously. Thus, sequentially for each i = 1, . . . , |A |/w − 1 we update A i in O(1) time, and hence A is updated in O(|A|/w) time.
At this point, A contains the prefix sums of A, and took O(|A|/w) = O(m/w 2 ) time to compute. Fig. 4 shows an example of this procedure.
Step 
Step 2 Fig. 4 . Example of computing prefix sums in the uw-ram with w = 3 and m = 23. Numbers in parenthesis indicate the parallel step number when computing A and underlined entries indicate the entries computed in that step.
Let f be the number of ones in g i−1, j/w , ..., g i−1,j , which can be computed using the lookup table. To compute the number of ones in g i−1,0 , ..., g i−1,j we return
Then, each row of g and h takes O(m/w 2 ) time to compute, and since there are n rows, the total time to compute g and h (and hence the optimal solution) on the uw-ram is O(nm/w 2 ). This achieves a w-fold speedup over Pisinger's word-ram solution.
Generalizations of Subset Sum and Knapsack Problems
Pisinger [35] uses the techniques of the word-ram algorithm for subset sum and knapsack to obtain a word-ram algorithm for computing a path in a layered network: given a graph G = (V, E), a source s ∈ V and a terminal t ∈ V , and a weight for each edge, is there a path of weight b from s to t? Again, this algorithm translates directly to a uw-ram algorithm, thus yielding a w-fold speedup over the word-ram algorithm. Pisinger further uses the algorithms for the problems above to implement word-ram solutions for other generalizations of subset sum and knapsack problems, such as: the bounded subset sum and knapsack problems (each element can be chosen a bounded number of times), the multiple choice subset sum and knapsack problems (the set of numbers is divided in classes and the target sum must be matched with one number of each class), the unbounded subset sum and knapsack problems (each element can be chosen an arbitrary number of times), the change-making problem, and, finally, the two-partition problem. uw-ram implementations for all these generalizations are direct and yield a w-fold speedup over the word-ram algorithms (recall that w = Ω(log n)).
Longest Common Subsequence
The final dynamic programming problem we examine is that of computing the longest common subsequence (LCS) of two string sequences (Definition 1).
Definition 1.
[LCS] Given a sequence of symbols X = x 1 x 2 . . . x m , a sequence Z = z 1 z 2 . . . z k is a subsequence of X if there exists an increasing sequence of indices i 1 , i 2 , . . . , i k such that for all 1 ≤ j ≤ k, x ij = z j [14] . Let Σ be a finite alphabet of symbols, and let σ = |Σ|. Given two sequences X = x 1 x 2 . . . x m and Y = y 1 y 2 . . . y n , where x i , y j ∈ Σ, the Longest Common Subsequence problem asks for a sequence Z = z 1 z 2 . . . z k of maximum length such that Z is a subsequence of both X and Y .
This problem can be solved via a classic dynamic programming algorithm in O(nm) time [14] . We describe a uw-ram algorithm for LCS based on an algorithm by Masek and Paterson [31] . We note that there exist other approaches to solving the LCS problem with bit-parallelism (e.g., [15] ) that could also be adapted to work in the uw-ram. The approach we show here is a good example of bit parallelism combined with the parallel lookup power of the model, which we use to implement the Four Russians technique.
The base algorithm, which mainly relies on bit parallelism, leads to Theorem 2. We then extend the algorithm with the Four Russians technique to achieve further speedups, obtaining Theorem 3. 
The length of the LCS is c m,n , which can be computed in O(mn) time. Consider an (m + 1) × (n + 1) table C storing the values c i,j . The idea of the uw-ram algorithm is to compute various entries of this table in parallel. We assume w = Θ(max{log n, log m}).
Let d k denote the values in the k-th diagonal of table C, this is d k = {c i,j |i + j = k}. Since a value in a cell i, j > 0 depends only on the values of cells (i − 1, j), (i − 1, j − 1) and (i, j − 1), all values in the same diagonal d k can be computed in parallel. Thus, we use the wide word to compute various entries of a diagonal in constant time. Since each value in the cell might use up to min{log n, log m} bits, each value might use up to an entire block of the wide word (if log m = Θ(log n)); thus, w cells can be computed in parallel. Since the total number of cells is O(mn) and the critical path of the table has m + n + 1 cells, this approach takes O(mn/w + m + n) parallel time, resulting in a speedup of w. However, we can obtain better speedups by using fewer bits per entry of the table, which enables us to operate on more values in parallel. For this sake, instead of storing the actual values of the partial longest common subsequences, we store differences between consecutive values as described in [31] for the related string edit distance problem.
Let V and H denote the tables of vertical and horizontal differences of values in C, respectively. Entries in these tables are defined as V i,j = c i,j − c i−1,j and H i,j = c i,j − c i,j−1 for 1 ≤ i ≤ m and 1 ≤ j ≤ n. Fig. 5 shows the tables C, V , and H for an example pair of input sequences. We adapt Corollary 1 in [31] for the computation of V and H:
Proof. Directly from Recurrence (1) we obtain V i,j = 1 − H i−1,j if x i = y j and
It is easy to verify from the definition of longest common subsequence and Recurrence (1) that 0 ≤ H i,j ≤ 1 and 0 ≤ V i,j ≤ 1 for all i, j, which implies that the maximum in max{[
} is equal to the first term if x i = y j and to the second or third terms otherwise.
We compute tables H and V according to Proposition 1 diagonal by diagonal using bit parallelism in the wide word. Assume an alphabet Σ = {0, 1, 2, . . . , σ − 1} with log σ ≤ w − 1. Although all entries in tables H and V are either 0 or 1, we will use fields of O(log σ) bits to store these values, since we can only compare at most w 2 / log σ symbols simultaneously in the wide word. We divide the wide word W in f -bit fields with f = max( log σ , 2) + 1. Each field will be used to store both symbols and intermediate results for the computation of the diagonals of H and V , plus an additional bit to serve as a test bit in order to implement fieldwise comparisons as described in Appendix A. We require at least 3 bits because although all entries in tables H and V use one bit, intermediate results in calculations can result in values of -1. Thus, we require 2 bits to represent values -1, 0, and 1, and a test or sentinel bit to prevent carry bits resulting from subtractions to interfere with neighboring fields. We represent -1 in two's complement. It is not hard to extend the techniques for comparisons and maxima to the case of positive and negative numbers [24] .
Let H k and V k denote the k-th diagonal of H and V , respectively, i.e., H k = {H i,j |i+j = k} and V k = {V i,j |i+j = k}. Consider table H. We will operate with each diagonal H k using |H k |/ wide words, where = w 2 /f . Let f 0 , . . . , f −1 denote the fields within a wide word in increasing order of bit significance. In each wide word, cells of H k will be stored in increasing order of column, i.e., if H i,j is stored in field f r , then f r+1 stores H i−1,j+1 . In order to compute each diagonal we must compare the relevant entries of strings X and Y . We assume that each symbol of X and Y is stored using log σ + 1 bits (including the test bit) and that X is stored in reverse order. X and Y can be preprocessed in O(m + n) to arrange this representation, which will allow us to do constant-time parallel comparisons of symbols for each diagonal loading contiguous words of memory in wide words.
Consider a diagonal H k . Assume that the entire diagonal fits in a word W . This will not be the case for most diagonals, but we describe the former case for simplicity. The latter case is implemented as a sequence of steps updating portions of the diagonal that fit in a wide word. We update the entries of H k as follows:
1. We load the symbols of the relevant substrings of X and Y into words W X and W Y , with the substring of X in reverse order. More specifically, for a diagonal k, W Y = y j1 y j1+1 . . . y j2 , where j 1 = k − min(|X|, k − 1) and j 2 = min(|Y |, k), and W X = x i2 x i2−1 . . . x i1 with i 2 = k − j 1 and i 1 = k − j 2 . We subtract W Y from W X , mask out all non-zero results and write a 1 in each field that resulted in 0. We store the resulting word in W eq , where each field corresponding to a cell (i, j) stores a 1 if x i = y j and a 0 otherwise (this can be implemented through comparisons as described in Appendix A). 2. We load V k−1 into a word W V and subtract it from W eq to obtain [a i = b j ] − V i,j−1 for all i, j in H k simultaneously and store the result in W 1 . 3. We load H k−1 into a word W H and subtract W V from it to obtain H i−1,j − V i,j−1 for all i, j in H k , storing the result in W 2 . 4. Finally, using fieldwise comparisons, we obtain the fieldwise maximum of W 1 , W 2 and the word 0. The resulting word is H k .
All the operations described above can be implemented in constant time. The procedure to compute V k is analogous. Note that the entries corresponding to base cases in the first row and column in the LCS table correspond to the base cases of the horizontal and vertical vectors, respectively. When computing diagonals H k with k ≤ n + 1 and V k with k ≤ m + 1, the entries corresponding to base cases are not computed from previous diagonals but should be added appropriately at the end of H k and beginning of V k . Example 1 shows how to compute H 6 from H 5 and V 5 (in gray) in Fig. 5 with the above procedure. Example 1. Let X = abbab and Y = aabbba be two strings. Fig. 5 shows the entries of the dynamic programming table for computing the LCS of X and Y , as well as the values of horizontal and vertical differences.
In this example σ = 2, thus we use one bit for each symbol ('a'=0, 'b'=1), but we use f = 3 bits per field. Consider the diagonal H 6 in table H (in dark gray). We now illustrate how to obtain H 6 from H 5 and V 5 (in light gray). In what follows we represent the number in each field in decimal and do not include the details of fieldwise comparison and maxima. Adding this time over all m + n diagonals yields the total time. For the space, each diagonal is represented in f /w 2 wide words, where f = O(log σ) is the number of bits per field. Since we can compute each diagonal H k and V k using only H k−1 and V k−1 , we only need to store 4 diagonals at any given time. Since the maximum length of a diagonal is min(n, m) + 1 and each wide word can be stored in w regular words of memory, the result follows.
Recovering a Longest Common Subsequence It is known that given a dynamic programming table storing the values of the LCS between strings X and Y , one can recover the actual subsequence by starting from c m,n and following the path through the cells corresponding to the values used when computing each value c i,j according to Recurrence (1) : if x i = y j , then we add x i to the LCS and continue with cell (i − 1, j − 1); otherwise the path follows the cell corresponding to the maximum of c i−1,j or c i,j−1 . Although Algorithm 3 does not compute the actual LCS table, a path of an LCS can be easily computed using tables H and V . The path starts at cell (m, n) (of either table). Then, to Four Russians Technique The computation of the longest common subsequence in the uw-ram can be made even faster by combining the diagonal-bydiagonal order of computation described above with the Four Russians technique. The Four Russians technique [3] was used by Masek and Paterson to speedup the computation of the string edit problem (and also the LCS) in a ram with indirect addressing [31] . The technique consists of dividing the dynamic programming table in blocks of size t × t cells. In a precomputation phase, all possible blocks are computed and stored as a data structure indexed by the first row and column of each block. The LCS can be then computed by looking up relevant values of the table one block at a time using the data structure. In a ram with indirect addressing and under a suitable value of t, the last row and column of a block can be obtained by looking up the entry corresponding to the first row and column of that block in constant time. This technique yields a speedup of O(t 2 ) with respect to computing all cells in the table, for a total time of O(n 2 /t 2 ) (for two strings of length n) plus the time for the precomputation of all blocks. By setting t = O(log n) (for a constant alphabet size) and encoding the table with difference vectors, the precomputation time can be absorbed by the time to compute the main table (see [31, 23] for a more detailed description of the technique).
We can use the power of parallel memory accesses of the uw-ram to speedup the computation of the LCS even further by looking up blocks in parallel, in a similar fashion to the diagonal-by-diagonal approach described above. For simplicity, assume m = n. Using the same encoding for H and V , we first precompute all possible blocks of H and V of size t × t. Since a block is completely determined by its first column and row, whose values are in {0, 1}, and the two substrings of length t (over an alphabet of size σ), there are O((2σ) 2t ) possible blocks. Note that we can encode each cell now with one bit, since we do not need to do symbol comparisons in parallel. Each block can be computed in O(t 2 ) time with the standard sequential algorithm, so the precomputation time is O((2σ) 2t t 2 ). We set t = log 2σ n/2, and thus the precomputation time is O(n log 2 n) [23] . Since t ≤ w/2, we can use each block of the wide word to lookup the entry for each block by using a parallel lookup operation. Thus, as described previously, we can compute tables H and V in diagonals of blocks, computing min( , w) blocks simultaneously in a diagonal of length blocks. There are (n/t) 2 blocks to compute and the critical path of the table has length n/t blocks. Therefore, the computation of H and V can be carried out in time O(n 2 /(t 2 w) + n/t) = O(n 2 log 2 σ/w 3 + n log σ/w), since t = Θ(w/ log σ). This result is summarized by Theorem 3:
Theorem 3. The length of the LCS of two strings X and Y of length n over an alphabet of size σ can be computed in the uw-ram in O(n 2 log 2 (σ)/w 3 + n log(σ)/w) time. For σ = O(1) and w = Θ(log n) this time is O(n 2 / log 3 n).
String Searching
Another example of a problem where a large class of algorithms can be sped up in the uw-ram is string searching. Given a text T of length n and a pattern P of length m, both over an alphabet Σ, string searching consists of reporting all the occurrences of P in T . We focus here on on-line searching, this is, with no preprocessing of the text (though preprocessing of the pattern is allowed), and we assume in general that n m. We use two classic algorithms for this problem to illustrate different ways of obtaining speedups via parallel operations in the wide word. More specifically, we obtain speedups of w = Ω(log n) for uwram implementations of the Shift-And and Shift-Or algorithms [4, 40] , and the Boyer-Moore-Horspool algorithm [28] . For a string S, let S[i] denote its i-th character, and let S[i..j] be the substring of S from position i to j. Indices start at 1. 
is a bit vector with set bits in the positions of the occurrences of σ in P . The OR with a 1 corresponds to the initial state always being active to allow a match to start at any position. The Shift-Or algorithm is similar but it saves this operation by representing active states with zeros instead of ones.
We describe in two uw-ram algorithms for Shift-And that illustrate different techniques, noting that the uw-ram implementation of Shift-Or is analogous. We obtain the following theorem: Theorem 4. Given a text T of length n and a pattern P of length m, we can find the occ occurrences of P in T in the uw-ram in time O(nm/w 2 +n/w+occ). ∈ Σ, and that w ≥ log(n + m). In order to report matches at each step in time proportional to the number of matches (and not the number of blocks), we move directly to blocks with matching positions by using a function that for every word of length w returns an array A with the positions of set bits. For example, for w = 5 and x = 01011, A = [1, 3, 4] . We do this by table look up to a table with (w/2)-bit entries, whose space is O(2 w/2 w) words, which for w = log n is O( √ n log n). (there is a match) or a mismatch is found. Either way, the window is then shifted so that T [i + m − 1] is aligned with the last occurrence of this character in P (not counting P [m]). The worst case running time of bmh is O(nm) (when the entire window is checked for all window positions) but on average the window can be shifted by more than one character, making the running time O(n) [5] . In the uw-ram, we can take advantage of the wide word to make several character comparisons in parallel, thus achieving a w-fold speedup over the worst case behaviour of bmh. A recent SIMD-based implementation of bmh using SSE4.2 on Intel i5 and Xeon processors [30] is evidence of the practicality of this approach. First, we divide each wide word in f -bit fields so that each field contains one character, thus f = log σ . At each position of the window, we do a fieldwise comparison between a wide word containing the characters of the text and one containing the characters of the pattern. We do this simply by subtracting both words. Since we only care if all symbols in the words match, we only need to check if the result is zero, without having to worry about carries crossing fields (and hence we do not need a test bit). We shift the window to the next position if the result is not zero. Note that this check can be done in constant time, and it is quite simple as we do not need to identify where there was a mismatch. Thus in each window we can compare up to w 2 /f symbols in parallel, and hence the running time in the worst case becomes O(mn log σ/w 2 + 1). We show the pseudocode in Algorithm 6 which, again, is based on the pseudocode of this algorithm presented in [33, Chapter 2.3.2] . Note that for a given input the distance of the shifts is exactly the same as in the original version of the algorithm, and therefore the average running time remains the same. Note as well that the average running time can be reduced by using each block to search in disjoint parts of the text at the expense of increasing the worst case time to O(mn log σ/w + 1) due to the reduction in the number of characters that can be compared simultaneously.
Theorem 5. Given T of length n and P of length m over an alphabet of size σ, we can find the occurrences of P in T with a uw-ram implementation of BMH in O(mn log σ/w 2 + 1) time in the worst-case and O(n) time on average.
Conclusions
We introduced the Ultra-Wide Word architecture and model and showed that several classes of algorithms can be readily implemented in this model to achieve a speedup of Ω(log n) over traditional word-ram algorithms. The examples we describe already show the potential of this model to enable parallel implementations of existing algorithms with speedups comparable to those of multi-core computations. We believe that this architecture could also serve to simplify many existing word-ram algorithms that in practice do not perform well due to large constant factors. We conjecture as well that this model will lead to new efficient algorithms and data structures that can sidestep existing lower bounds.
Algorithm 6 BMH(T, P, n = |T |, m = |P |, Σ). For simplicity, we assume that w divides m log σ. We assume also that T and P are represented with log σ bits per symbol. We still use T [i] to denote one character, which can be easily obtained from the packed representation in constant time (the same applies to the actual address of starting characters of substrings). 
