Permutation multiplication (or permutation composition) is perhaps the simplest of all algorithms in computer science. Yet for large permutations, the standard algorithm is not the fastest for disk or for flash, and surprisingly, it is not even the fastest algorithm for RAM on recent multi-core CPUs. On a recent commodity eight-core machine we demonstrate a novel algorithm that is 50% faster than the traditional algorithm. For larger permutations on flash or disk, the novel algorithm is orders of magnitude faster. A disk-parallel algorithm is demonstrated that can multiply two permutations with 12.8 billion points using 16 parallel local disks of a cluster in under one hour. Such large permutations are important in computational group theory, where they arise as the result of the well-known Todd-Coxeter coset enumeration algorithm. The novel algorithm emphasizes several passes of streaming access to the data instead of the traditional single pass using random access to the data. Similar novel algorithms are presented for permutation inverse and permutation multiplication by an inverse, thus providing a complete library of the underlying permutation operations needed for computations with permutation groups.
INTRODUCTION
Algorithms are introduced for efficiently executing the basic permutation operations for large permutations, permutations that range in size from 4 million points to permutations with billions of points.
The standard permutation algorithm is: All experiments are performed on random permutations. In this regime, almost every iteration incurs a cache miss.
The size of the permutation dictates the preferred architecture. At the high end of our regime (billions of points), the preferred architecture consists of parallel disks. Using parallel disks, we are able to efficiently multiply permutations with 12.8 billion points in under one hour using the 16 local disks of a 16-node cluster. (Table 4 ).
In the case of flash memory, it took under one hour to multiply two permutation with 2.5 billion points using a single machine with two solid state flash disks in a RAID configuration (see Table 2 ).
In the case of RAM, one has a choice of using a multithreaded algorithm or multiple independent single-threaded processes. Both regimes of computation are useful. Where independent computations from a parameter sweep are performed, or where a parallelization of the higher algorithm is available, independent single-threaded processes are preferred. Where a single inherently sequential algorithm is the goal, the multi-threaded algorithm is preferred.
Experimental results show a 50% speedup in both cases. The novel algorithm has its primary advantage for permutations large enough that they overflow the CPU cache. In the case of a multi-threaded algorithm, we demonstrate the speedup on a recent eight-core commodity computer for permutations with 32 million points (see Table 8 ). In the case of single-threaded processes, we run eight competing processes simultaneously, and demonstrate the same 50% speedup over the traditional permutation algorithm. In this single-threaded case, the speedup is observed for permutations with as few as 4 million points (see Table 7 ).
Similar algorithms are also presented for permutation inverse and permutation multiplication by inverse. This completes the standard suite of permutation primitives required by packages that support permutation algorithms, such as GAP [6] .
The importance of these new methods for computational group theory is immediately evident by considering a previous permutation computation of one of the authors. In 2003, a group membership permutation computation for Thompson's group was reported. Thompson's group acts on 143,127,000 points [4] . Those 143 million points from seven years earlier are well within the regime of interest discussed in this paper: between 4 million points and billions of points. That computation now fits on today's commodity computers, including the in-RAM technique of this paper, and would be expected to produce a result 50% faster.
In addition to permutations being given directly, permutations arise frequently as the output of a Todd-Coxeter coset enumeration algorithm. There are several excellent descriptions of this algorithm [1, 5, 13, 17] . In those cases, the first description of the group is as a finite presentation, and one employs coset enumeration to convert this into a more tractable permutation representation. The group can then be efficiently analyzed through such algorithms as Sims's original polynomial-time group membership and the rich library that has grown up around it. Examples of such large coset enumerations include parallel coset enumeration [2] used to find a permutation representation of Lyons's group on 8,835,156 points, sequential coset enumeration [7] used to find a different permutation representation of Lyons's group on 8,835,156 points, and a result [8] finding a permutation representation of Thompson's group on 143,127,000 points.
Problem Description
In addition to the problem of permutation multiplication, two other standard permutation operations are typically supported by permutation subroutine packages: permutation inverse and permutation multiplication by an inverse. The last problem, X −1 Y , is often included as a primitive operation because there exists a more efficient implementation than composing inverse with permutation multiplication:
More formally, the problems are: Let X and Y be two arrays with the same number of elements N , both indexed from 0 to N − 1, such that:
Compute the values of another array, Z, with N elements, defined as follows:
Problem 1.3 (Multiply by Inverse).
Compute the result of multiplying a permutation by an inverse X −1 × Y :
Other Problems
While a full discussion is beyond the scope of this paper, we also note that the new algorithms presented for permutation multiplication also apply to object rearrangement:
When the size of an object remains small compared to the size of a disk block, flash block, or cache line, then the algorithm can be used on disk, flash, or RAM, respectively. Further, the algorithm described here generalizes in an obvious way when Y is near to a permutation, but whose values may include duplicate entries from {0 . . . N − 1}, while omitting other entries from {0 . . . N − 1}.
Terminology.
In this paper we present three permutation multiplication algorithms for architectures with at least two levels of memory, in increasing order of performance: the "external sort algorithm", the "buckets algorithm" and the "implicit indices algorithm".
The terminology "fast-memory/slow-memory" refers to an algorithm which uses slow-memory as the slower, much larger lower-level memory (the one on which the permutation arrays are stored), and fast-memory as the faster, much smaller higher-level memory (which cannot hold the entire permutation arrays).
Organization of the Paper.
The rest of the paper is organized as follows: Section 2 presents related work, Sections 3 and 4 present our new fast algorithms, along with some theoretical considerations on their performance. Section 5 presents new fast algorithms for permutation inverse and multiplication by an inverse. Section 6 presents formulas for the optimal running time, under the assumption that the CPU cores are infinitely fast and that the single bus from CPU to RAM is the only bottleneck (or time to access flash memory or disk). Section 7 presents the experimental results, followed by the conclusion in Section 8.
Overview of the Algorithms.
Six algorithms are presented. Algorithms 1 and 2 are intended solely to explore the design space. Algorithms 1 and 2 are disk-based permutation multiplication algorithms using external sorting and a simple buckets technique, respectively. Algorithm 3 reviews an older method for permutation multiplication [3, 4] , here called implicit indices. Algorithm 4 constitutes the central novelty of this work. It presents a multi-threaded parallel permutation multiplication algorithm. Tables 4 and 5 , along with Section 3.2, present a generalization to parallel distributed disks. Algorithms 5 and 6 review older algorithms for permutation inverse and multiplication by inverse [3, 4] , that are analogous to Algorithm 3. The generalization to the multi-threaded case (analogous to Algorithm 4) is omitted for lack of space, but experimental results are presented in Table 8 .
Section 6 presents a new timing analysis applicable to Algorithms 3, 4, 5 and 6 and their parallel generalizations.
RELATED WORK
The current work builds upon [3] . In that work, the authors present a fast RAM-based permutation algorithm that worked well on the Pentium 4, due in part to the 128-byte cache line on that CPUs. Most later CPUs have 64-byte cache lines, and so that algorithm, which is reviewed in this paper as Algorithm 3, later achieved mixed results. Algorithm 3 was also used as a sequential disk-based algorithm in [4] . Related sequential algorithms for permutation inverse and permutation multiplication by inverse were also described in [3, 4] .
For lower-level memory data, some of the main ideas of disk-based computing [14, 16] have been used successfully in recent years to solve or make progress on important problems in computational group theory [9, 10, 14, 15] , where the size of the data is too large for one RAM subsystem or even the aggregate RAM of a cluster.
The memory gap and memory wall phenomena are very important for understanding the reasons behind the efficiency of our new algorithms and the limitations of both our new algorithms and the traditional algorithms. These phenomena are well-known in literature [3, 18] . All the algorithms we describe, whether traditional or new, are memorybound for certain parameters.
PERMUTATION MULTIPLICATION USING EXTERNAL MEMORY
New algorithms for large permutations are presented. For many problems in computational group theory, the size of a permutation is in the range of tens to hundreds of gigabytes.
The first case presented below deals with permutations that fit on a single disk, with a permutation occupying at least 10 GB of space, but not more than 50 GB. These same algorithms can be run on flash memory. Both disk and flash are types of external memory in wide use today. Table 2 presents experimental results obtained by running our implicit indices algorithm both on flash and on disk. In the following three subsections one can replace disk with flash and everything remains correct.
Local Disk and Flash
The traditional implementation for permutation multiplication would be:
Using this implementation would be impractical. For large enough pseudo-random permutations, most array accesses are to random locations on disk. Thus a memory page would be swapped in from disk at almost every array element access. On most current systems a memory page is on the order of 4 KB. If the element size is 8 bytes, then for each 8 bytes the traditional algorithm accesses the system would actually transfer 4 KB of data, which results in a 4 KB/8 bytes = 512 times ratio of transferred to useful data. This was indeed observed for naive permutation multiplication running in virtual memory (see Table 3 ).
A few important notions are defined before discussing the details of the three new algorithms for external memory.
Definition 1. System and Algorithm Parameters
The values in each permutation array X, Y and Z can be represented on β bytes. Hlms = the size of the higher-level memory component, in number of elements of β bytes.
Any arrays used in the algorithms can be divided into blocks of length Bl = (Hlms/2) number of elements. Two blocks must simultaneously fit in Hlms.
N b = N/Bl is the total number of blocks in an array.
Using External Sort
The disk-based permutation multiplication method using external sorting is described in Algorithm 1.
Using the concept of buckets that fit in RAM, one can significantly improve the performance of the algorithm. RAM buckets are an alternative to external sorting which trades the n log n running time of sorting for random access within RAM. RAM buckets have significantly sped up computations that previously used external sorting [11] . 
Using RAM Buckets
The RAM buckets method is described in Algorithm 2. The RAM bucket size has to be chosen such that two RAM buckets simultaneously fit in RAM. Considering that both the index i and the value X[i] are represented using the same number of bytes, one needs 2×N/Hlms buckets (here Hlms is the size of RAM). for each index i in this bucket do 6:
Save the pair (j,
Load buckets D b and Z b into RAM. 10:
for each index i in this bucket do 11:
Algorithm 2 presents a few important improvements over Algorithm 1. Note that in phase 2 of Algorithm 2, there is no need to save the index in the buckets of array Y , since it is implicit in the ordering. Thus a bucket of array Y occupies twice as little space as a bucket of pairs (i, X[i]). In phase 3, Z is also divided into 2 × N/Hlms − 1 buckets, and all indices from the j-th bucket of D correspond to positions in the j-th bucket of Z. Algorithm 2 completely eliminates sorting and, in practice, shows a 4 times (or more) speedup over the External Sort-based algorithm if the computation is disk-bound (see Table 5 , the 1 node case).
Both algorithms 1 and 2 need to save the index of each value of the X permutation, thus resulting in disk arrays as large as twice the size of the initial arrays. The implicit indices RAM/disk algorithm (Algorithm 3) avoids saving the indices to disk arrays. for each index i in this bucket do 6: 
With Implicit Indices
The implicit indices version runs about twice as fast as the buckets version (see Table 4 ). The implicit indices RAM/ disk algorithm performs the following steps: a sequential read of the X array and a sequential write of the D (temporary) array in phase 1 (2 sequential accesses); a sequential read of the D array, a sequential read of the Y array and a sequential write of the D array (3 sequential accesses); and a sequential read of the X array, a sequential read of the D array and a sequential write of the Z array (3 sequential accesses). In total, there are 8 sequential accesses
It is interesting to compare the running time of the implicit indices algorithm and the running time of a permutation multiplication algorithm that we implemented in Roomy [12] , which uses Algorithm 2. Roomy is a general framework for disk-based and parallel disk-based computing which provides a high-level API for manipulating large amounts of data. The disk-based implicit indices algorithm is generally twice as fast as the Roomy implementation.
Many Disks
Here we describe how the three disk-based algorithms for permutation multiplication, presented in Section 3.1, can be used with the many disks in a cluster of computers.
Serial permutation multiplication using external sort is described in Algorithm 1. To parallelize it, all arrays are first split into sub-arrays, each of which is placed on the disk of a single compute node in the cluster. All operations on those arrays are performed in parallel. In cases where one node generates data that references a sub-array on another node, that data is first sent over the network, then saved to disk. In our implementation, there is a separate thread of execution on each node that handles the writing of this remote data to the local disk. Finally, there is a synchronization point after each phase, to insure that all nodes are done with one phase before beginning the next.
Permutation multiplication using buckets (Algorithm 2) is made parallel in the same way. The arrays are already split into sub-arrays (buckets), and the same methods are used for data distribution, parallel processing, and synchronization.
There is one additional modification necessary to parallelize permutation multiplication using implicit indices (Algorithm 3). Because the algorithm depends on the specific ordering of elements in each bucket, the buckets can not be written to in parallel. This is solved in the same way that Algorithm 4 extends Algorithm 3: each bucket is further split into sub-buckets, so that each node has its own sub-bucket to write to. Unlike the multi-threaded RAM case, the parallel disk case does not need an extra phase to compute the sizes of the sub-buckets, since the buckets are represented with files, which are dynamically sized.
PERMUTATION MULTIPLICATION IN RAM
The traditional permutation multiplication algorithm for cache/RAM can be trivially-parallelized. Each thread processes a contiguous region of the X[] permutation array. Although this incurs frequent cache misses, it tends to scale linearly on current commodity computers until one goes beyond four cores. This is because the single bus to RAM becomes saturated by the pressure of the several cores. In Table 8 of Section 7, one sees this happening approximately with 3 threads for permutation multiplication and for 4 threads for inverse and multiplication by inverse.
Algorithm 3 of Section 3 presented a single-threaded diskbased algorithm to overcome the many page faults. The same algorithm can be implemented for cache/RAM to minimize cache misses. That algorithm's cache/RAM version is preferred for permutation algorithms that can be parallelized at a higher level and then call a single-thread permutation multiplication algorithm. Here, we consider a multithreaded version for the case when the higher level algorithm does not parallelize well. The corresponding results at the level of eight cores are presented in Table 7 in Section 7. As described in the extrapolation in Section 7.3, both the new single-threaded and the new multi-threaded algorithms are expected to have an even greater advantage at the 16-core and higher level in the future.
Algorithm 4 provides the multi-threaded version for multiplication using cache/RAM. Intuitively, it operates by splitting the buckets of Algorithm 3 into sub-buckets. Within a given bucket, each thread "owns" a contiguous region (a subbucket) for which it has responsibility. Algorithm 4 requires one extra phase (Phase 1) in order to determine in advance the size of the sub-bucket to allocate for each thread.
Some alternative designs were also explored. A brief summary of the alternatives considered is presented along with our reasons for rejecting them. • Using pthread private data via " thread" (problem: uses too much memory).
• Using pthread locks to synchronize memory access (problem: synchronization delays).
• Using an atomic add operation to a single global counter (problem: internally, it still uses a lock).
• Exploiting L1 cache via a two-level algorithm, similar to two-level external sort (problem: delays due to extra passes).
Section 7.3 presents experimental results for the cache/ RAM multi-threaded implicit indices algorithm.
PERMUTATION INVERSE. MULTIPLICATION BY AN INVERSE
While Algorithms 5 and 6 are not new [3, 4] , their multithreaded generalizations analogous to Algorithm 4 are novel. Experimental results for running permutation inverse and multiplication by an inverse, as well as theoretical estimates for these runs, can be found in Table 8 .
Permutation Inverse.
The traditional algorithm for permutation inverse is:
The bottleneck is still the random access (this time write access) to the Y array. Permutation Multiplication by an Inverse.
For the multiply-by-inverse, the traditional algorithm is:
At the end of the loop 
PERFORMANCE ANALYSIS
The analysis presented here can be used to estimate the running time for the implicit indices algorithms, when using any 2-level memory hierarchy, including cache/RAM, RAM/ flash, RAM/disk. The implicit indices algorithms include Algorithm 3, its generalization to Algorithm 4, Algorithms 5 and 6, and their parallel generalizations. We refer to permutation multiplication as PM, to permutation inverse as PI, and to permutation multiplication by an inverse as PMI. The next three formulas estimate the running time when memory is the bottleneck for PM, PI, and PMI, respectively. Note that in the case of cache/RAM N/Lrb must be added to each formula, due to the extra pass. 
EXPERIMENTAL RESULTS

Local disk and flash
Tests were ran on an AMD Phenom 9550 Quad-Core at 2.2 GHz with 4 GB of RAM, running Fedora Linux with kernel version 2.6.29. The machine has both a disk drive (Seagate Barracuda 7200.10 250GB) and 2 RAID-ed flash SSD drives (2 × INTEL SSD SSDSA2MH080G1GC, 80 GB each). Table 1 contains the measured system parameters of this machine. Table 1 also contains the measured system parameters for one of the disks of the cluster that was used to run the "parallel RAM/parallel disk" algorithms. The parallel disk bandwidth assumes that network bandwidth is not a limiting factor. Table 4 shows this to be the case for permutation arrays of size up to 25 GB. Table 2 shows a comparison between the new RAM/disk algorithm and the new RAM/flash algorithm, both based on implicit indices. The estimates from the formulas of Section 6 are also presented, to confirm that the algorithm is limited by the bandwidth of disk and flash. Table 3 details our findings about the traditional permutation multiplication algorithm ran in virtual memory on the same machine. The experimental results confirmed our expectations: when the working set is at least twice the size of available RAM, using the traditional algorithm in virtual 
Many disks
These experiments were run on a cluster of computers, each with two dual-core 2.0 GHz Intel Xeon 5130 CPUs and 16 GB of RAM, a locally attached 500 GB disk, running Linux kernel version 2.6.9. The network used a Dell PowerConnect 3348 Fast Ethernet switch. Only one process was used per node, to avoid competition for the single disk. Tables 4 and 5 give a comparison of the three disk-based permutation algorithms presented in Section 3.1, based on: external sorting; RAM buckets; and implicit indices. 6838 3228 * Table 4 shows the results of using 16 nodes of a cluster, with permutation sizes ranging from 800 million elements (6 GB) to 12.8 billion elements (95 GB). In general, the three algorithms scale roughly linearly with permutation size. The most notable exception is a 5-fold increase in the running times of the bucket and implicit indices algorithms when moving from 24 GB to 48 GB permutations. We believe that this is due to network traffic on an older Fast Ethernet switch. Until that point, the bottleneck was likely disk bandwidth. The sorting based algorithm does not see a similar effect because its time is dominated by the in-RAM sorting process, not inter-node communications. Table 5 shows the results of using between 1 and 16 nodes of the cluster, with permutations having 1.6 billion elements (12 GB). Again, the time for each algorithm scales roughly linearly with the number of nodes. The non-linear scaling when moving from 2 to 4 nodes is likely due to the bottleneck moving between disk and the network.
In general, the bucket algorithm takes about 1.5 to 2 times longer than the implicit indices algorithm, with the largest differences occurring with larger permutations and more parallelism. The implicit indices algorithm is more efficient because of the smaller amount of data that must be saved to disk. The sorting based algorithm takes roughly 5 to 10 times longer than the implicit indices algorithm, largely due to the time needed to sort data in RAM.
RAM
For cache/RAM, the performance of permutation multiplication, inverse and multiplication by an inverse was demonstrated on a recent 8-core commodity machine: two Quadcore Intel Xeon E5410 CPUs running at 2.33 GHz, with a total of 24 MB L2 cache -12 MB L2 cache per socket and 16 GB of RAM made up of four memory modules. Table 6 lists the system parameters measured on this system. Table 7 concerns the case of independent permutation computations running in parallel, with one computation per core. We believe that the traditional algorithm is close to saturating the bandwidth from CPU to RAM, both in the case of 8 threads and 8 processes. Table 8 provides confirming evidence of bandwidth saturation in comparing 4 threads versus 8 threads. As described in Section 6, the new algorithm is more bandwidth-efficient. We see that benefit for 8 processes but not for 8 threads. We speculate that is due to cache poisoning as the threads compete for the same cache. The new permutation multiplication algorithm is faster by about 50% than the traditional algorithm for permutations of 32 million elements or more, when using 8 threads. Our new algorithm is also faster than performing 8 multithreaded traditional permutation multiplications in a row by at least a factor of 1.6. In contrast, when using only one thread (with seven cores idle), the time represents a mixture of RAM bandwidth and CPU power. Hence, the traditional and new algorithms have similar performance.
Extrapolation on memory bandwidth results.
In the near future, commodity machines will continue to gain additional CPU cores at a rate based on Moore's Law. But the number of memory modules on the motherboard is likely to remain fixed (while the density of each memory module continues to rise). Hence the memory bandwidth is unlikely to grow significantly. Table 8 shows the times for the traditional algorithm already approaching an asymptotic value for the transition from 4 threads to 8 threads. Furthermore, the timings for 8 threads is close to the timing for the theoretically optimal case for bandwidth limited computation.
The new algorithm shows a significant improvement in time in the transition from 4 threads to 8 threads. In the case of permutation multiplication, the timing for 8 threads approaches that of the theoretically optimal memory bandwidth limited case. On the other hand, the algorithms for permutation inverse and permutation multiplication by an inverse show the potential for additional improvements in timings as more cores become available. This is seen by comparing the numbers for 8 threads and the optimal case.
CONCLUSIONS
New algorithms were presented for multiplication of large permutations for disk and flash (Section 3), for the aggregate disks of a cluster (Section 3.2) and a multi-threaded algorithm for RAM (Section 4). These algorithms make permutation multiplication a practical operation for large permutations that do not fit in RAM. Further, the multi-threaded cache/RAM implicit indices algorithm clearly outperforms the trivially-parallel traditional algorithm when using multiple threads on machines with many cores. Table 8 : Running times (seconds) of our new implicit indices permutation multiplication for cache/RAM. As explained, we need a machine with at least 8 cores working on the new algorithm in parallel for the CPU to be a less significant factor. Element size is 4 bytes. A bucket here is a cache line, the block size is variable between runs. The values in the column labeled "Optimal" are derived from the equations in Section 6 using values based on 
