# Sorting Omega Networks Simulated with P Systems: Optimal Data Layouts 

Rodica Ceterchi ${ }^{1}$, Mario J. Pérez-Jiménez ${ }^{2}$, Alexandru Ioan Tomescu ${ }^{1}$<br>${ }^{1}$ Faculty of Mathematics and Computer Science, University of Bucharest Academiei 14, RO-010014, Bucharest, Romania<br>${ }^{2}$ Research Group on Natural Computing<br>Department of Computer Science and Artificial Intelligence University of Sevilla<br>Avda. Reina Mercedes s/n, 41012 Sevilla, Spain<br>E-mails: rceterchi@gmail.com, marper@us.es, alexandru.tomescu@gmail.com

Summary. The paper introduces some sorting networks and their simulation with P systems, in which each processor/membrane can hold more than one piece of data, and perform operations on them internally. Several data layouts are discussed in this context, and an optimal one is proposed, together with its implementation as a P system with dynamic communication graphs.

## 1 Introduction

Paper [9] proposed two models to sort a sequence of $N$ numbers, based on the bitonic sorting network. The first one consisted of $N$ membranes, each storing two numbers; one number was an element of the sequence, and the other one was an auxiliary register used to route values. A number $x$ was codified as the number of appearances of a symbol $a$ in each membrane. Moreover, the membranes were disposed on a 2D-mesh, where only communication between neighbor membranes on the mesh are permitted. This model, using a variant of P Systems, called P systems with dynamic communication graphs, (see [8]), follows closely the implementation of the bitonic sort on the 2D-mesh.

The second model consisted of only one membrane, where all the $N$ numbers were encoded as occurrences of $N$ different symbols. Restrictions on communication were no longer imposed, as if the underlying communication graph were the complete graph.

In this paper we introduce a model in between the two. First of all, observe that the first model has the advantage of a codifying alphabet of fixed size, while the second has the advantage of a small communication overhead. The model we put forth in this paper captures these two benefits. Each membrane holds a fixed number of values, and each of the membranes can communicate with any other.

Additionally, in order to minimize the communication between membranes, we use a periodic remap of values to membranes, according to the steps of the omega network.

The problem of mapping values to processors has been previously addressed in the context of parallel sorting algorithms. The bitonic sorting network, which can sort $N$ keys in time $O\left(\log ^{2} N\right)$, is probably one of the most well-known parallel sorting algorithms. However, modern architectures differ greatly from the theoretical models under which such good results were obtained. As coarse-grained processors can store internally more than one value, the following problem arises: how to map $N$ keys to $P$ processors $(N>P)$, such that inter-processor communication is minimized. In the bitonic sorting algorithm, and for $N \geq P^{2}$, the solution given in $[13,14]$ consisted in alternating a blocked layout with a cyclic layout, performing thus the minimal number of remaps. This paper gives an optimal mapping strategy for the bitonic sort for any $N>P$, and then applies this result to P Systems.

The paper is organized as follows. Section 2 presents preliminaries on bitonic sorting networks and defines omega networks. Section 3 approaches the problem of mapping $N$ keys among $P$ processors, each processor manipulating $n=N / P$ keys, such that overall communication is minimized. Optimal data layouts for the omega network are proposed along the lines of [20], and some essential results are proved about them. Section 4 discusses about internal processing in one processor, and how we model it in our implementation with P systems. Section 5 introduces the P system which simulates the omega network with optimal data layouts, and the algorithms which generate the sequence of dynamic communication graphs of this model. Complexity issues are addressed at the end of Sections 3 and 5 .

## 2 Preliminaries on Bitonic Sorting Networks and Omega Networks

A bitonic sequence is a concatenation of two monotonic sequences, one ascending, and the other one descending, or a sequence such that a cyclic shift of its elements would put them in such a form.

The key components of a bitonic network are the bitonic splitters and the bitonic mergers. The splitter of size $N$ takes as input a bitonic sequence of length $N$ and partitions it in two bitonic sequences of equal length, such that all the elements in the first sequence are smaller than (or greater than) all the elements in the second sequence. A bitonic merger of size $N$ consists of a splitter of size $N$ and of two mergers of size $N / 2$, of opposite direction. It accepts as input a bitonic sequence and sorts it in ascending or descending order (direction).

As any sequence of two numbers is bitonic, the sorting network uses bitonic mergers of increasing size and alternating direction to construct bitonic sequences of increasing length. The last such merger, of size $N$, renders the whole sequence of $N$ numbers sorted.


Fig. 1. A bitonic sorting network of size $N=8$. The network can be partitioned in three stages, each containing bitonic mergers of size 2,4 , and 8 , respectively.

(a) Increasing comparator

(b) Decreasing comparator

Fig. 2. Network devices

Following [15] it is customary to represent a network as an ordered set of $N$ lines (wires) connected by a set of compare-exchange devices (comparators, for brevity). A comparator has two input terminals, $a$ and $b$, and produces two output terminals $c$ and $d$. If the comparator is increasing, Fig. 2(a), then $c=\min (a, b)$ and $d=\max (a, b)$, while if the comparator is decreasing, Fig. 2(b), $c=\max (a, b)$ and $d=\min (a, b)$. A bitonic sorting network for $N=8$ is represented in Fig. 1.

We introduce some more notations regarding the serial and parallel connections of networks $T_{1}$ and $T_{2}$, of size $N$. Their serial connection, $T_{1} T_{2}$, is a network in which the $i$-th output terminal of $T_{1}$ is connected to the $i$-th input terminal of $T_{2}$. The parallel connection, $T_{1} \circ T_{2}$, is the union of $T_{1}$ and $T_{2}$, with terminal $i$ of $T_{1}$ becoming terminal $i$ of $T_{1} \circ T_{2}$, and terminal $i$ of $T_{2}$ becoming terminal $i+N$ of $T_{1} \circ T_{2}(i=0, \ldots, N-1)$.

Definition 1 (Omega network, Fig. 3(d)). Let $D_{k}, k \geq 1$ be a one-step network of $N=2^{k}$ lines with a device between the pair of lines $(i, i+N / 2)$, for $i=0 \ldots N / 2-1$. Then the omega network $O M_{k}$ is recursively defined as $O M_{k}=D_{k}\left(O M_{k-1} \circ O M_{k-1}\right)$.

In [6] the striking similarity between the bitonic merger (Fig. 3(a), 3(b)) and the balanced merger (Fig. 3(c)) is investigated. Although prior research [11] showed that there is no permutation of lines to transform the bitonic merger into a balanced merger, a framework is developed under which it is shown that the two


Fig. 3. The bitonic merger, the balanced merger, and the omega network of size 8
mergers are isomorphic graphs, also isomorphic to the graph of the omega network (Fig. 3(d)).

As a serial connection of $\log N$ identical networks in the class of omega networks forms a sorting network [6], in what follows we will concentrate mainly on the omega network.

## 3 How They Communicate

A sorting network is a fine-grained theoretical model, containing exactly one input key on each wire. Additionally, comparators require communication between wires,
which can sometimes be more time consuming than the comparison operation itself $[1,3,10,16]$. When redesigning parallel sorting algorithms for coarse-grained PRAM, one has to pay particular attention to both communication and computation.

Given $N$ keys and $P$ processors $(N>P)$, we have to map $n=N / P$ keys to each processor, such that overall communication is minimized. Ionescu and Schauser [13, 14] investigated this problem for the bitonic sorting algorithm. As initially suggested in [10], they proposed a smart periodical switch between a blocked layout and a cyclic layout. They observed that in each stage of the sorting algorithm, the last $\log n$ steps can be performed locally under a blocked layout, while under the cyclic layout the first $\log n$ steps are local. A necessary condition for the two layouts to span enough depth to cover an entire stage of the network is $N \geq P^{2}$. In addition, the two layouts are particular to the sorting network being implemented. We shall see, for example, that the balanced merger [11, 12], which, as the bitonic merger, belongs to the class of omega networks, also admits data layouts optimizing overall communication.

An approach from the opposite side was put forth by Lee and Batcher [17]. They used a parity strategy for a shared-memory model with $N=2 P$ to store even-parity keys in local memory, while only odd-parity keys were recirculated. This decreased by a factor of 2 the number of shared memory references.

The main contribution of this paper is a general scheme to map $N$ values to $P$ processors, for any $N>P$ and for any sorting network with the topology of the omega network. Our idea captures the essence from the alternating smart layout of [14], and makes it generally applicable, even when $N<P^{2}$. The number of data layouts is no longer two, but it depends on the granularity of the processors.

### 3.1 Optimal Data Layouts for the Omega Network

In the following, without explicitly mentioning it, we assume we have to sort $N=2^{k}$ keys using $P$ processors, $N>P$, each processor holding $n=N / P$ keys. Any number $i \in\left\{0, \ldots, 2^{k}-1\right\}$ has a bit representation $i=a_{1} a_{2} \cdots a_{k}$, $a_{1}$ being the most significant bit, and $a_{k}$ the least significant one. To simplify notation, we say that a sequence of bits $a_{j} \cdots a_{i}$, where $i, j \in\{1, \ldots, k\}$ and $j>i$, stands for the void sequence. The number of parallel steps of $O M_{k}$ is $k$, and step $t$ of the omega network $O M_{k}$ contains devices linking lines whose bit representations differ of bit $t$, with $1 \leq t \leq k$. For any $t \in\{1, \ldots, k\}$, consider the function $b c_{t}:\left\{0,1, \ldots, 2^{k}-1\right\} \longrightarrow\left\{0,1, \ldots, 2^{k}-1\right\}$, the bit complement of the $t$-th bit, defined by $b c_{t}\left(a_{1} a_{2} \cdots a_{t} \cdots a_{k}\right)=a_{1} a_{2} \cdots \bar{a}_{t} \cdots a_{k}$. The function $b c_{t}$ is injective and idempotent.

First, we give a formal definition of a data layout.
Definition 2 (Data layout). A data layout of $N$ values to $P$ processors is a function $\mathcal{D}:\{0, \ldots, N-1\} \rightarrow\{0, \ldots, P-1\}$.

We introduce the following data layouts, as suggested in $[10,14]$.

(a) An omega network on size 32. Lines marked with same shape are assigned to the same processor in one data layout.

| $\begin{array}{\|l\|llll} \hline 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & =8 \\ 1 & 1 & 0 & 0 & 0=24 \end{array}$ |
| :---: |
| $00000=0$ |
| $00100=4$ |
| $00010=2$ |
| $00110=6$ |
| $00000=0$ |
| $00010=2$ |
| $00001=1$ |
| $00011=3$ |

(b) Keys mapped to processor 0 in each of the three data layouts

Fig. 4. Three data layouts for the omega network $O M_{5}$.

Definition 3 (Blocked layout). A blocked layout for mapping $N$ keys on $P$ processors is a function $\mathcal{D}^{b}:\{0, \ldots, N-1\} \rightarrow\{0, \ldots, P-1\}$, such that $\mathcal{D}^{b}(i)=$ $\lfloor i / n\rfloor$, where $n=N / P$.

Definition 4 (Cyclic layout). A cyclic layout for mapping $N$ keys on $P$ processors is a function $\mathcal{D}^{c}:\{0, \ldots, N-1\} \rightarrow\{0, \ldots, P-1\}$, such that $\mathcal{D}^{c}(i)=i$ mod $P$.

We note that Definition 5 in [13], and the definition for the cyclic layout indicated in Section 2.1 of [14] are incorrect, since if we map the $i$-th key to the $i \bmod$ $n$ processor, where $n=N / P$, we have that $n \leq P$, which implies $N \leq P^{2}$, which clearly is not the case considered.

In a blocked layout, the first $\log N-\log n$ steps require remote communication, while the last $\log n$ steps are local. In a cyclic layout, the situation is reversed: the first $\log N-\log n$ steps are local, while the last $\log n$ steps are remote. The idea
proposed in [14] when mapping $N \geq P^{2}$ values in the bitonic sort is to periodically switch between the two layouts, such that all steps are local. Moreover, as the stages in a bitonic sort have increasing size, the author proposes an improved "smart" remap such that a layout spans through multiple stages of the algorithm, achieving a total of $\log P+1$ remaps.

Our paper better highlights the reasoning behind these remaps, in the case of the bitonic sort. Consider the omega network $O M_{k}$, and consider we choose to map key 0 to processor 0 . If each processor can hold $2^{m}$ values, which other keys are mapped to processor 0 ? As we can see, at step 1 we have a device linking line 0 with line $0+2^{k-1}$. At step 2 we have a device linking line 0 with line $2^{k-2}$, and a device linking line $2^{k-1}$ and line $2^{k-1}+2^{k-2}$. We also note that in step 1 lines $2^{k-2}$ and $2^{k-1}+2^{k-2}$ were also linked with a device. We continue until step $m$, where we identify $2^{m}$ lines linked by $2^{m-1}$ devices. It would be natural to map these lines to processor 0 , as all comparisons at step $m$ are local. However, one more problem remains: all comparisons at stages 0 through $m-1$ are also local? As we shall see, the answer is yes.

The following lemma is straightforward from the definition of $O M_{k}$.
Lemma 1. At each step $1 \leq t \leq k$ of $O M_{k}$, and for any $0 \leq i<2^{k}$, line $i$ is linked by a device only with line $b c_{t}(i)$.

Lemma 2. In $O M_{k}$, for any $0 \leq i<2^{k-m}, 1 \leq m \leq k$ and $0 \leq t \leq k-m$, in steps $t+1, \ldots, t+m$ there is no device linking lines in the set $P_{i}^{t, m}=\left\{a_{1} a_{2} \cdots a_{k} \mid\right.$ $a_{1} \cdots a_{t} a_{t+m+1} \cdots a_{k}=i$, where $a_{1} \cdots a_{k}$ is a bit representation $\}$ with lines from $\left\{0, \ldots, 2^{k}-1\right\} \backslash P_{i}^{t, m}$.
Proof. Suppose there are $1 \leq r \leq m, l \in P_{i}^{t, m}$ and $l^{\prime} \notin P_{i}^{t, m}$ such that at step $t+r$ there is a device linking $l$ and $l^{\prime}$. From Lemma 1 we have that $l^{\prime}=b c_{t+r}(l)$, which implies $l^{\prime} \in P_{i}^{t, m}$, a contradiction.

We can therefore derive the data layouts for the omega network. Suppose we have $N=2^{k}, n=2^{m}$, and $P=2^{k-m}$. We first assign to each processor $P_{i}$ all values in the set $P_{i}^{0, m}$, for $0 \leq i \leq P-1$. By Lemma 2 we have that the first $\log n=m$ steps are entirely local. After $m$ steps, we remap to each processor $P_{i}$ all the values in the set $P_{i}^{m, m}$, and perform the next $m$ stages locally, and so on. We can now give the definition of our proposed data layout.

Definition 5. Given $N=2^{k}$ keys and $P=2^{k-m}$ processors, which can store $n=$ $2^{m}$ values, $m \geq 1$, the sequence of optimal data layouts consists of $\lceil\log N / \log n\rceil=$ $\lceil k / m\rceil$ data layouts. In each data layout $\mathcal{D}_{s}, 0 \leq s \leq\lceil k / m\rceil-1$, values in the set $P_{i}^{s m, m}$ are mapped to processor $P_{i}$, for all $0 \leq i \leq 2^{k-m}$. More formally, for any $0 \leq u<2^{k}$ such that $u \in P_{i}^{s m, m}$, we have $\mathcal{D}_{s}(u)=i$.

The following is a consequence of Lemma 1 of [14].
Lemma 3. The maximum number of successive steps of the omega network that can be executed locally, under any data layout is $\log n$, where $n=N / P$.


Fig. 5. The three data layouts for the omega network in Figure 4(a).

In each data layout $\mathcal{D}_{s}, 0 \leq s \leq\lceil k / m\rceil-2, \log n=m$ steps are local. For $s=\lceil k / m\rceil-1$, the last $k \bmod m$ steps of the network are local. From Lemma 3 we have that the proposed data layouts for the omega network are optimal.

In the case $N \geq P^{2}$, we notice that $2 m>k$, hence two data layouts are enough to cover the whole omega network. However, they do not coincide with $\mathcal{D}^{b}$ or $\mathcal{D}^{c}$, as in the blocked layout, the last $m$ stages are local, while in the cyclic layout, the first $k-m$ stages are local.

### 3.2 Computation Complexity

In each data layout, a processor holds $n$ values and performs $\log n$ steps locally, taking time $O(n \log n)$. As we have $\lceil\log N / \log n\rceil$ data layouts, we get an overall time complexity of the omega network of $O(n \log N)$. From [6] we have that a serial connection of $\log N$ omega networks of size $N$ is enough to sort a sequence of $N$ numbers. Hence, the complexity to sort $N$ numbers using $P$ processors, each holding $n=N / P$ values, using our proposed data layouts, is $O\left(n \log ^{2} N\right)$.

This remark has a quite profound significance. In the fine-grained theoretical model we have $n=1$, and its complexity is $O\left(\log ^{2} N\right)$. The complexity of the network using a more coarse-grained model depends linearly on the degree of parallelism of the model. At the opposite end, when $n=N$ and the entire sorting network is simulated locally, we have a complexity of $O\left(N \log ^{2} N\right)$, which is worse than $O(N \log N)$, the complexity of most sequential sorting algorithms. It would be desirable to choose $n$ such that this bound is not surpassed in the parallel model. We impose $n \log ^{2} N \leq N \log N$, which implies $n \leq N / \log N$.

An algorithm to find the minimum of a bitonic sequence of size $n$ in time $O(\log n)$, was introduced in [14]. This gives a time complexity of each data layout of $O(n)$. In the case of a network obtained from a serial connection of bitonic mergers, this observation gives an overall time complexity of $O\left(\frac{n}{\log n} \log ^{2} N\right)$.

## 4 What Happens Inside One Processor/Membrane

One processor (and the membrane which simulates it) will be capable of holding $n=N / P=2^{m}$, pieces of data. We label the data with indices in the set $\{0,1, \cdots, n-1\}$. For any such index we consider its writing as a binary string of length $m$, for instance $i=x_{1} x_{2} \cdots x_{t} \cdots x_{m}$.

Inside one processor, several comparisons are performed, in parallel, between the $n$ pieces of data, in the following manner: for every bit $t$, (starting with 1 , the most significant bit, and ending with $m$ ) we compare and exchange if necessary (to obtain an increasing order) all pairs of values codified with $a_{i}$ and $a_{b c_{t}(i)}$. More precisely, we have the following algorithm to be performed inside each processor/membrane:

## for $t \leftarrow 1$ to $m$ do

forall $i<b c_{t}(i)$ in parallel do
compare $\left(a_{i}, a_{b c_{t}(i)}\right)$;
Algorithm 1: A parallel algorithm for the bitonic merger
where by compare $\left(a_{i}, a_{j}\right)$ we denote sorting in an ascendant manner the values codified by $a_{i}$ and $a_{j}$, i.e. we end by having the minimum of the two values codified by $a_{i}$ and the maximum by $a_{j}$.

The procedure compare $\left(a_{i}, a_{j}\right)$ works in a membrane in the following manner: let $s_{i}, s_{j}$ and $t_{i}, t_{j}$ be four auxiliary symbols, for the sources and the targets of a comparator. The set of rules

$$
\left\{a_{k} \rightarrow s_{k} \mid k=i, j\right\} \cup\left\{s_{i} s_{j} \rightarrow t_{i} t_{j}, s_{i} \rightarrow t_{j} s_{j} \rightarrow t_{j}\right\} \cup\left\{t_{k} \rightarrow a_{k} \mid k=i, j\right\}
$$

implement an increasing comparator between values codified by $a_{i}$ and $a_{j}$. We first rewrite the $a$ s to $s \mathrm{~s}$, next we have the comparator which writes the minimum to $t_{i}$ and the maximum to $t_{j}$, and then we rewrite these back to $a_{i}$ and $a_{j}$ respectively.

For all the comparisons which are to be done in parallel, take auxiliary alphabets $S=\left\{s_{0}, \cdots, s_{n-1}\right\}$ and $T=\left\{t_{0}, \cdots, t_{n-1}\right\}$. We rewrite all initial symbols to symbols in $S$ :

$$
\left\{a_{i} \rightarrow s_{i} \mid i=0,1, \cdots, n-1\right\}
$$

Next we put the comparators between appropriate pairs:

$$
\left\{s_{i} s_{j} \rightarrow t_{i} t_{j}, s_{i} \rightarrow t_{j}, s_{j} \rightarrow t_{j} \mid i=0,1, \cdots, n-1, i<j=b c_{t}(i)\right\}
$$

Then we rewrite back to the original alphabet:

$$
\left\{t_{i} \rightarrow a_{i} \mid i=0,1, \cdots, n-1\right\}
$$

The parallel comparisons at each step $t$

```
forall }i<b\mp@subsup{c}{t}{}(i)\mathrm{ in parallel do
```

    compare \(\left(a_{i}, a_{b c_{t}(i)}\right)\);
    will thus be simulated in a membrane $P$ by the rules

$$
\begin{aligned}
&\left\{a_{i} \rightarrow s_{i} \mid i=0,1, \cdots, n-1\right\} \cup \\
& \cup\left\{s_{i} s_{j} \rightarrow t_{i} t_{j}, s_{i} \rightarrow\right.\left.t_{j}, s_{j} \rightarrow t_{j} \mid i=0,1, \cdots, n-1, i<j=b c_{t}(i)\right\} \cup \\
& \cup\left\{t_{i} \rightarrow a_{i} \mid i=0,1, \cdots, n-1\right\}
\end{aligned}
$$

## 5 A P System which Simulates the Omega Network

In this section we introduce a P system with dynamic communication [7], along the same general lines as the model proposed in $[8,9]$. For each of the processors
$\mathcal{P}_{i}, i \in\{0,1, \ldots, P-1\}$ we have an associated membrane, which we label $i$. The graphs we consider are sub-graphs of the complete graph, $K_{P}$, or of the identity graph.

Note that at a certain step of the sorting algorithm not all edges are involved in communication. Therefore we call active sub-graphs of $K_{P}$ those graphs containing only such edges. We also introduce the identity graph, with

$$
\begin{gathered}
V(I d)=\{0,1, \ldots, P-1\} \\
E(I d)=\{(i, i) \mid 0 \leq i \leq P-1\}
\end{gathered}
$$

for modeling internal processing steps.
In order to describe the evolution of such a P system, we use pairs of the type [graph, rules]. We have graph a sub-graph of $K_{P}$ or Id and rules a mapping from the set of all edges of graph, $E$ (graph), to the set of all symbol/object rewriting rules for routing or comparison operations.

The formal definition of the P system is

$$
\begin{aligned}
\Pi= & \left(V=\left\{a_{0}, \ldots, a_{n-1}\right\} \cup \mathcal{A},\left\langle\left[a_{0}^{x_{0}^{0}}, a_{1}^{x_{1}^{0}}, \ldots, a_{n-1}^{x_{n-1}^{0}}\right]_{0}, \ldots,\right.\right. \\
& {\left.\left.\left[a_{0}^{x_{0}^{P-1}}, a_{1}^{x_{1}^{P-1}}, \ldots, a_{n-1}^{x_{n-1}^{P-1}}\right]_{P-1}\right\rangle, R_{\mu}\right), }
\end{aligned}
$$

where the membrane indices are $\{0,1, \ldots, P-1\}$. The alphabet $\left\{a_{0}, \ldots, a_{n-1}\right\}$ is of fixed size, and the set $\mathcal{A}$ contains the auxiliary symbols necessary to simulate the omega network, as indicated in Section 4. Numbers $x_{i}^{j}$ with $0 \leq i \leq n-1$ are the values stored on the wires mapped to processor $j, 0 \leq j \leq P-1$ in the first data layout. Each of them is codified as the number of occurrences of a symbol $a_{i}$ inside membrane $j$. Finally, $R_{\mu}$ is the finite sequence of pairs [graph,rules] which guides the computation.

We will see in the sequel that $R_{\mu}$ is generated algorithmically, by concatenating sequences of pairs [graph, rules $]^{3}$.

Lemma 4. Given $N=2^{k}$ keys and $P=2^{k-m}$ membranes, which can store $n=$ $2^{m}$ values, $m \geq 1$, after the computation for the data layout $\mathcal{D}_{s}$ is finished, symbol $a_{i}$ of membrane $j$ codifies the value corresponding to wire $u \in\{0, \ldots, N-1\}$, where the bit representation of $u$ is $u=j_{1} \ldots j_{s m} i_{1} \ldots i_{m} j_{s m+1} \ldots j_{k-m}$. By $j_{1} \ldots j_{k-m}$ and by $i_{1} \ldots i_{m}$ we denoted the bit representations of $j$, and $i$, respectively.

Proof. The proof is immediate by Definitions 1, 5 and Lemma 2.
We observe that the remap of values from a data layout to the other can be done in $P+1$ steps. When passing from data layout $\mathcal{D}_{s-1}$ to $\mathcal{D}_{s}$, with $0<s \leq\lceil k / m\rceil-1$, in each step $j, 0 \leq j \leq P-1$, membrane $j$ sends its contents along the edges of the communication graph $C_{s}^{j}$. To avoid collisions in the destination membranes,

[^0]it also performs a rewriting of symbols from $a_{t}$ to $a_{t}^{\prime}$, for all $t \in\{0, \ldots, n-1\}$. In the last step $P+1$, all auxiliary symbols $a_{t}^{\prime}$ will be rewritten back to $a_{t}$ in all membranes, and the local computation can begin in each membrane.

We give below two algorithms generating the communication graphs $C_{s}^{j}$, and the rules associated to each edge.

```
\(E\left(C_{s}^{j}\right) \leftarrow \emptyset ;\)
for \(j \leftarrow 0\) to \(P-1\) do
    for \(i \leftarrow 0\) to \(n-1\) do
        let \(j\) have bit representation \(j_{1} \cdots j_{s m} j_{s m+1} \cdots j_{k-m}\);
        let \(i\) have bit representation \(i_{1} \cdots i_{m}\);
        // the destination membrane of value encoded by \(a_{i}\) in
        membrane \(j z \leftarrow j_{1} \cdots j_{s m} i_{1} \cdots i_{m} j_{(s+1) m+1} \cdots j_{k-m}\);
        // the destination symbol of value encoded by \(a_{i}\) in
        membrane \(j t \leftarrow j_{s m+1} \cdots j_{s m+m}\);
        \(E\left(C_{s}^{j}\right):=E\left(C_{s}^{j}\right) \cup\{j, z\} ;\)
        \(\operatorname{rules}_{C_{s}^{j}}((j, z)):=a_{i} \rightarrow a_{t}^{\prime} ;\)
```

Algorithm 2: Generation of the sequence of $P$ communication graphs when passing from data layout $\mathcal{D}_{s-1}$ to $\mathcal{D}_{s}$, with $0<s \leq\lceil k / m\rceil-1$.
for $j \leftarrow 0$ to $P-1$ do rules-endcomm $((j, j)):=\left\{a_{i}^{\prime} \rightarrow a_{i} \mid 0 \leq i \leq n-1\right\} ;$
Algorithm 3: Generation of the rules associated to the identity graph which rewrite back the auxiliary symbols $a_{t}^{\prime}$ when passing from any data layout $\mathcal{D}_{s-1}$ to $\mathcal{D}_{s}$, with $0<s \leq\lceil k / m\rceil-1$.

We assume that the sequence denoted by $\operatorname{SimOM}$ is the sequence of pairs [graph, rules] which simulates the omega network of size $n, O M_{m}\left(n=2^{m}\right)$. Its construction was indicated in Section 4 and is expressed algorithmically below.

```
\(\operatorname{SimOM} \leftarrow \lambda ;\)
for \(t \leftarrow 1\) to \(m=\log n\) do
    forall \(p \leftarrow 0\) to \(P-1\) in parallel do
        \(\operatorname{rules}_{t, 1}((p, p)) \leftarrow\left\{a_{i} \rightarrow s_{i} \mid i=0,1, \ldots n-1\right\} ;\)
        \(\operatorname{rules}_{t, 2}((p, p)) \leftarrow\left\{s_{i} s_{j} \rightarrow t_{i} t_{j}, s_{i} \rightarrow t_{j}, s_{j} \rightarrow t_{j} \mid i=\right.\)
        \(\left.0,1, \ldots, n-1, i<j=b c_{t}(i)\right\}\);
        \(\operatorname{rules}_{t, 3}((p, p)) \leftarrow\left\{t_{i} \rightarrow a_{i} \mid i=0,1, \cdots n-1\right\} ;\)
    SimOM \(\leftarrow \operatorname{SimOM} \cdot\left[I d\right.\), rules \(\left._{t, 1}\right] \cdot\left[I d\right.\), rules \(\left._{t, 2}\right] \cdot\left[I d\right.\), rules \(\left._{t, 3}\right] ;\)
```

Algorithm 4: Generation of the sequence $\operatorname{SimOM}$ which simulates the omega network of size $n$.
We can now give the algorithm which generates the whole sequence $R_{\mu}$ guiding the computation.

```
\(R_{\mu} \leftarrow \lambda ;\)
for \(s \leftarrow 1\) to \(\lceil k / m\rceil-1\) do
    \(R_{\mu} \leftarrow R_{\mu} \cdot \operatorname{SimOM} ;\)
    for \(j \leftarrow 0\) to \(P-1\) do
        \(R_{\mu} \leftarrow R_{\mu} \cdot\left[C_{s}^{j}\right.\), rules \(\left._{C_{s}^{j}}\right] ;\)
    \(R_{\mu} \leftarrow R_{\mu} \cdot[I d\), rules-endcomm \(] ;\)
\(R_{\mu} \leftarrow R_{\mu} \cdot \operatorname{SimOM} ;\)
```

Algorithm 5: Generation of the sequence $R_{\mu}$ which guides the computation.

### 5.1 Computation complexity

Observe that the length of the sequence $\operatorname{Sim} O M$ is $3 \log n$. As we have $\frac{\log N}{\log n}$ data layouts, and that in each data layout $3 \log n$ steps are needed for $\operatorname{SimO} M$ and another $P+1$ steps are needed for communication, the length of $R_{\mu}$ is $3 \log N+$ $\frac{N \log N}{n \log n}$. A sorting network can be obtained by a serial connection of $\log N$ omega networks, hence our model can sort in time $O\left(\log ^{2} N+\frac{N \log ^{2} N}{n \log n}\right)$. Note that when $n=N$ all computation is local, and the complexity is the best possible, $O\left(\log ^{2} N\right)$. When $n=2$ the complexity increases to $O\left(N \log ^{2} N\right)$.

## References

1. A. Aggarwal, A.K. Chandra, M. Snir, "Communication Complexity of PRAMs", Theoretical Computer Science, vol. 71, no.1, pp. 3-28, Mar. 1990.
2. M. Ajtai, J. Komlos, and E. Szemeredi, "An $O(N \log N)$ Sorting Network", Proc. 15th Ann. ACM Symp. Theory of Computing, pp. 1-9, May 1983.
3. A. Alexandrov, M. Ionescu, K.E. Schauser, C. Scheiman, "LogGP: Incorporating Long Messages into the LogP model", Journal of parallel and distributed computing, vol. 44, no. 1, pp. 71-79, 1997.
4. A. Alhazov, D. Sburlan, "Static Sorting P Systems", Chapter 8 in Applications of Membrane Computing, (G. Ciobanu, Gh. Păun, M.J. Pérez Jiménez Eds.), Springer, 2005.
5. K.E. Batcher, "Sorting networks and their applications", Proc. AFIPS Spring Joint Comput. Conf., vol. 32, pp. 307-314, Apr. 1968.
6. G. Bilardi, "Merging and Sorting Networks with the Topology of the Omega Network", IEEE Transactions on Computers, vol. 38, no. 10, pp. 1396-1403, Oct. 1989.
7. R. Ceterchi, C. Martín-Vide, "Dynamic P Systems", LNCS, vol. 2597, pp. 146-186, 2003.
8. R. Ceterchi, M.J. Pérez Jiménez, "On two-dimensional mesh networks and their simulation with P systems", LNCS, vol. 3365, pp. 259-277, 2005.
9. R. Ceterchi, M.J. Pérez Jiménez, A.I. Tomescu, "Simulating the Bitonic Sort Using P Systems", G. Eleftherakis et al. (Eds.): WMC8 2007, LNCS, vol. 4860, pp. 172-192, 2007.
10. D.E. Culler, R.M. Karp, D.A. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian, and T. von Eicken, "LogP: Towards a Realistic Model of Parallel Computation", Proc. Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp.1-12, May 1993.
11. M. Dowd, Y. Perl, M. Saks, L. Rudolph, "The balanced sorting network", Proc. Second annual ACM symp. on Principles of distributed computing, pp. 161-172, 1983.
12. M. Dowd, Y. Perl, M. Saks, L. Rudolph, "The periodic balanced sorting network", $J A C M$, vol. 36. no. 4, pp. 738-757, 1989.
13. M.F. Ionescu, "Optimizing Parallel Bitonic Sort", Tech. Report TRCS96-14, Dept. of Comp. Sci., Univ. of California, Santa Barbara, July 1996.
14. M.F. Ionescu, K.E. Schauser, "Optimizing parallel bitonic sort", Proc. 11th Int'l Parallel Processing Symp., pp. 303-309, 1997.
15. D.E. Knuth, The art of computer programming, volume 3: sorting and searching, second ed. Redwood City, CA: Addison Wesley Longman, 1998.
16. C. Kruskal, L. Rudolph, M. Snir. "A complexity theory of efficient parallel algorithms", Theoretical Computer Science, vol.71, no.1, pp. 95-132, Mar. 1990.
17. J.D. Lee, K.E. Batcher, "Minimizing Communication in the Bitonic Sort", IEEE Trans. on Parallel and Distributed Systems, vol. 11, no. 5, pp. 459-474, May 2000.
18. F. Leighton, "Tight Bounds on the Complexity of Parallel Sorting," IEEE Trans. Computers, vol. 34, no. 4, pp. 344-354, Apr. 1985.
19. M.S. Paterson, "Improved Sorting Networks with $O(\log N)$ Depth," Algorithmica, vol. 5, pp. 75-92, 1990.
20. A.I. Tomescu, "Optimal Data Layouts for Omega Networks", manuscript.

[^0]:    ${ }^{3}$ We denote the empty sequence by $\lambda$, and the concatenation of two sequences by ".".

