Abstract
Introduction
Architects of multiprocessor systems who wish to evaluate their novel ideas have a contradictory requirement. The larger their systems to be designed are, the faster their simulators must run. They would be satisfied if their architectural (or instruction level) simulators ran as fast as those for uniprocessors that achieve small slowdown of, for example, 50. The real world, however, is so hard for them as to show the fact that the simulation of their 16-way SMP machine runs 16 (or more) times as slow as the simulation of its processor element. Thus if they try to simulate the SMP with a program that will take one minute on the SMP when it comes into the world, they have to wait 800 = 50 16 min- utes, or more than a half day, until their simulator completes the work.
A natural approach to solve their problem is to simulate the SMP in parallel exploiting its inherent parallelism and mapping its element processor-cache pair (or group of them) onto a simulator node. Since only the references missing a coherent cache are transferred to other caches and the memory, the frequency of events transferred among simulator nodes are so low that we may dream we had an excellent solution. However, when we try to make the events properly ordered and timed, we will find the job is hard because the simulated processes (processor-cache pairs) are too tightly coupled to apply PDES techniques that utilize long latency between processes. We will also find opportunistic methods will not help us because the process state is too large and complicated to rollback it on an out-of-order event arrival.
Thus to design our distributed simulator Shaman, we chose a path different from those which PDES techniques guide. The simulator runs on a PC cluster but the event ordering and timing are processed by a single back-end node that simulates the physical behavior of the target memory system including coherent caches. Other front-end nodes work in parallel to simulate the instructions executed by the target multiprocessor and to generate shared memory references given to the back-end. The execution of the instructions needs the shared memory, of course, but what we need is its logical behavior such as the result of load and store operations. Thus we adopted a software distributed shared memory (DSM) technique that has been proved to work efficiently on a distributed memory environment such as a PC cluster.
This approach would fail if the front-end had to give all the memory references to the back-end for the memory system simulation. However the references necessary for the simulation are those which miss the coherent caches. Thus we proposed a reference filtering method in our previous report [3] to eliminate definitely cache-hitting references by the partial simulation of the caches in the frontend. In the literature, we proved the correctness of the fil-tering and showed it passes 1.4 % or less of the total references. Thus we confidently continued the implementation of the Shaman simulator, which is the theme of this paper.
The rest of the paper consists of the followings. Section 2 summarizes related works to show the background of our research. Section 3 overviews the Shaman simulator and the reference filtering. In Section 4 we discuss the implementation issues in detail. Then the performance evaluation results are shown in Section 5. Finally Section 6 gives the conclusion of the paper with a brief discussion of our future direction.
Related Works
A large number of execution driven simulators for shared memory multiprocessors have been proposed and developed as shown in the extensive survey by Uhlig and Mudge [14] . Most of them run on uniprocessors and typically have two components; front-end to execute a workload program and to generate memory references; and backend to simulate the behavior of a target memory system consuming the references generated by the front-end. The front-end emulates instructions of the target machine [7, 15] or directly executes instrumented code [12, 11] . In both cases, state-of-art simulators achieve small slowdown of 10 to 100 [14] .
Although multiprocessor simulation has inherent parallelism, quite a few parallel or distributed simulators have been proposed. This is mainly due to that processors to be simulated are too complicated to apply distributed asynchronous simulation techniques such as that proposed by Chandy and Misra [2] and TimeWarp [4] . For example, if we implement a uniprocessor simulator of slowdown 100 on each 1 GHz PC node of a cluster and connect the nodes by a LAN of 100 µs broadcast latency including software overhead, we have to tolerate 1,000 simulated clocks from an inter-node event occurence to its reception even if the local clocks of all the nodes are adjusted. The latency and the amount of the state saved for the rollback are obviously too large to have an efficient TimeWarp implementation.
The other problem is that a shared memory multiprocessor frequently communicates among its processor-cache pairs even though the coherent caches achieve a low miss rate. For example, if a workload executes a load/store in every eight cycles and its cache miss ratio including coherence misses is 2 %, a 16-way SMP will have a bus transaction in every 25 cycles. This means that each simulator node of the PC cluster shown above must receive a broadcast message in every 2.5 µs. This is obviously too frequent for any existing high performance clusters.
Thus parallelized simulators such as MPTrace [6] and SimOS [11] rely on the shared memory mechanism of their hosts to simulate the behavior of the target shared memory. This approach works well for the execution of the workload, but not for the accuracy of simulated timing. For example, MPtrace requires complete information of all the references to be analized by its detached back-end, which is still large after its trace compaction using "road map" and is hard to analyze due to the compaction. The parallel binary translation mode of SimOS can simulate coherent cache behavior on the fly, but the simulated timing is inaccurate because it assumes a fixed cache miss penalty ingnoring the effect of bus/memory access conflict. It also ignores the nondeterminism of cache state transition caused by the accesses to a falsely shared cache block.
One of a quite few successful parallel simulators that achieve both high speed and timing accuracy is Wisconsin Wind Tunnel (WWT) [10] . WWT adopts a sort of conservertive PDES method in which all the simulator nodes synchronize every small time quantum Q that is the minimum delay between two target caches. Thus the success of WWT depends on a realtively large Q (100 cycles) and the high speed barrier mechanism of its host CM-5, which is hardly available for PC clusters. Its successor WWT-II [8] implemented on a NOW system also showed good performance using a similar method, but Q is also large and the configuration of the NOW host, 67MHz SuperSPARCs connected by 1 Gb/sec Myrinet, is hardly mapped to current technology trend such as 2 GHz Pentium IV and Giga-bit Ethernet.
Overview of the Shaman Simulator

Target System and Workload Program
As shown in Figure 1 , a target system of Shaman is either a centralized or distributed shared memory multiprocessor having coherent caches (C) attached to processors (P) and an interconnection network of any configuration. So far Shaman emulates a simple single-pipeline processor of SPARC instruction set architecture but other ISA and/or execution mechanism can be simulated by replacing its emulation module.
The cache is, preferably, a write-back and writeinvalidate type that we refer to as MSI-extention cache [3] which has at least three cache block states M, S, I (Modified, Shared and Invalid) [13] and may have other states such as E (Exclusive). Write-through and/or write-update cache can be simulated but we cannot expect high performance because of the ineffectiveness of reference filtering. Other configuration parameters, such as capacity, associativity, block (or line) size, unified or I/D separate, and single-or multiple-level, may vary as the target system varies.
A workload program is a multi-threaded program with Solaris or POSIX thread library. The input to Shaman is a Solaris binary executable with symbolic debugging infor- Since Shaman simulates shared memory accesses by lazy release consistency (LRC) mechanism [5] as described in Section 3.3, the workload program must have data-racefree property defined by Adve and Hill [1] . This means, in practice, both acquire (S A ) and release (S R ) operations (e.g. those of a lock) have to be visible to Shaman as (a part of) synchronization primitives. Note that these requirements to the workload are not directed to the memory model of the target multiprocessor. The target memory model may be stronger than release consistency and thus, for example, weak and sequential consistency models are acceptable.
Additionally, the program has to be deterministic in order to make simulation meaningful as follows.
Definition 1
With a given number of processors, a program is deterministic if and only if every memory operation, performed outside of synchronization primitives, gives a result invariably in all its executions.
Note that non-determinism is allowed inside of synchronization primitives such as spin locks.
It is widely known that most of practical parallel programs are data-race-free. As for the determinism, many of them including most of SPMD programs fall into this category.
System Configuration
The target system is mapped onto Shaman as shown in Figure 2 . Shaman has a number of front-end nodes to execute a workload program in parallel generating memory references. Each front-end node, for example a PC in a cluster, is responsible of one or more target processors. Hereafter we simply refer to a target processor or that mapped to a front-end node as a processor.
Shared memory access is simulated using LRC software distributed shared memory (DSM) which is briefly discussed in Section 3.3. Thus a program executed in parallel on the target system is also executed in parallel by front-end nodes for efficient simulation. Moreover, a front-end node switches its context (processor) only when it encounters a synchronization primitive to minimize scheduling overhead.
Memory references generated by front-end nodes are sent to the single back-end node. If we send all the references, however, the network link to the back-end node and the reference processing on the node should be a severe performance bottleneck. Therefore, each front-end node has a filtering cache (C f ) for each processor to reduce the references to be sent to the back-end by eliminating those that hit the cache. Synchronization primitives having nondeterminism, such as spin locks, are also sent to the backend to simulate its real behavior, such as the number of spins. Each reference and synchronization sent to the backend has a local time-stamp.
The back-end has target caches (C f ) of the same configuration as the target system caches. Each reference is examined with the target cache to know if it really misses the cache or modifies a shared block. Since all the references that potentially cause cache misses including shared block modification are reported to the back-end, as discussed in Section 3.4, the back-end obtains all the events triggering memory transactions. Then, the back-end reschedules each event to convert its local time to global time simulating the network and memory behavior.
Software DSM with LRC
Here we briefly explain the LRC mechanism by which the software DSM of Shaman is managed. LRC is proposed by Keleher and its detail will be found in his dissertation [5] .
The memory space is subdivided into pages as the unit of coherence management. When a page is modified by an reference m in a processor, the modification is notified to ! m 0 , which is the happens-before-1 relation defined as the irreflective transitive clausure of the program order and paired S R /S A such as a lock transfer [1] .
In order to notify a processor of the modification by the reference m properly, LRC mechanism performs the followings. Each processor maintains its local intervals that are created each time S A or S R is executed. To the i-th interval of a processor p, denoted σ i p , write-notices are attached to indicate the pages modified by p in the time segment corresponding to the interval. A write-notice in σ i p for a page π, and invalidates pages corresponding to the write-notices attached to the intervals in S p because they are modified by happens-before-1 predecessors of σ i a . After that, when the processor a refers an invalidated page, it looks up the writenotices of the page in Σ i a to find processors which most recently modified the page 3 . Then, from those processors, diffs for the page are retrieved and applied to the page to make it valid. 1 The diff ∆ i p (π ) of the original LRC is the difference between the contents of π at the beginning and the end of σ i p , but we made the definition a little bit more strict as discussed in Section 4.1. 2 The original LRC assures only
, but it is easy to assure they are logically equivalent. 3 If the page shared falsely, there may be multiple most recent modifiers.
Reference Filtering
As stated in Section 3.2, Shaman is capable to simulate a target cache C t of write-back and write-invalidate type with MSI-protocol or its (reasonable) extension. For the target cache of l block size, w-way set associative and w γ capacity, the front-end has a direct-map filtering cache C f of l block size and γ capacity with MSI-protocol. As proved by Puzak [9] , a compulsory, capacity or conflict miss of C t always causes one of the misses of C f . Thus if we assure a coherence miss of C t also causes a miss of C f , we may conclude that the filtering by C f only eliminates the references hitting C t .
To do that, we define the state transition of a memory block b cached in C f of a processor p as follows.
1. When p retrieves a diff containing a modification of b, its state turns to I.
2. When p gives a diff containing a modification of b to another processor retrieving it and this is the first transmission of the diff generated by p itself, the state of b turns to S if it was in state M.
As we proved in the literature [3] , this transition rule assures that the filtering of the references to memory blocks is proper if they are data-race-free blocks defined as follows. Otherwise, the block is racing.
Unfortunately, even a data-race-free program may have references to a racing block because a block is not a logical data structure but is a set of individual words. For example, a block may be falsely shared by two processors and written by them asynchronously. Thus we have to detect racing blocks to inactivate the filtering the references to them and passes all the references to back-end so that it correctly simulates the cache state transition of them.
In the literature [3] , we proved the following algorithm classifies a block as data-race-free at the end of an execution.
Each memory block b in processor p has a field named
last-access-interval, denoted τ p (b), whose value is i iff the most recent reference to b is performed in σ i p .
2. To each diff ∆ i p (π ), the vector time-stamp v i p is attached. When a processor q retrieves a diff ∆ i p (π ) and applies it to its page π, a block b included in the diff is marked as
3. If a block is not marked racing at the end of an execution, the block is data-race-free.
Thus our simulated execution of a workload consists of the following two phases.
phase-1
The front-end nodes execute the workload and filter references to generate those given to the back-end supposing all the blocks are data-race-free. If a node find a racing block, it broadcasts the detection to all other nodes and back-end to stop the reference generation and processing. The execution, however, continues to mark all the racing blocks. At the end of the execution, the addresses of marked blocks are exchanged among all the front-end nodes to mark all their copies. If no marked blocks are found, the simulation completes omitting the second phase.
phase-2 The front-end nodes reexecute the program filtering references to data-race-free blocks only. In this phase, the back-end is given sufficient references to simulate the target memory system.
Implementation Issues
Creating a Diff
In Section 3.3, we defined a diff of a page π created by a processor p in its i-th interval is a encoding of the memory locations of π modified in the interval and the values of the locations at the end of the interval. The diff in original LRC and the most of other software DSM systems, however, represents the difference of the contents of π at the beginning and the end of the interval as the name implies. Thus these systems create a diff as follows.
1. At the first store on a page in a interval, the copy of the page named Twin is created. The second and subsequent stores are simply performed on the original page.
2. At the end of the interval, the page is compared with its Twin. If the value of a location is different from that in the Twin, the location and the value become a component of the diff.
This method works well for those DSM systems because it is sufficient for them that a processor retrieving the diff obtains the value of the modified location. Moreover the method has a practically important property that the first store is detected by the memory access trap and the subsequent stores are freely performed.
In our case, however, this Twin method cannot be applied because we need the fact that a location is modified. For example, if the value of a location x turns a to b, and then to a again, the diff created from the Twin does not have the component of the location as shown in Figure 3 Therefore, we create a diff from the bit-vector attached to each page to indicate each location is modified in an interval as shown in Figure 3(b) . That is, the bit-vector is cleared at the beginning of the interval, then a store to a location turns the corresponding bit on, and finally the bitvector is coupled with the values of modified locations and the vector time-stamp for the racing block detection to form the diff at the end of the interval.
Note that the temporal overhead of turning a bit on by each store is acceptably small because our DSM is for the instruction level simulation in which a store is simulated by tens of instructions, while unacceptably large in ordinary DSM in which a store is, say, a store. Even if the overhead is not negligible, its significant part is compensated by the elimination of the creation and comparison of the Twin.
As for the spatial overhead, it is reducible by using 32 bit word-wise vector rather than byte-wise that is required in general but infrequently. In our implementation, the default is the word-wise to make the spatial overhead 1/32 of the page size, which is replaced by the general byte-wise when a store-byte or -halfword instruction is performed on the page. Moreover, when the vector is coupled with the values and the vector time-stamp, it can be shrunk further to 2 n -word-wise if the length of the shortest sequence of the modified and unmodified words is 2 n . Its extreme, one bit per page, will be not a rare case especially when the workload is a scientific application 4 .
Merging Diffs
One problem of the diff-based LRC is that a large number of diffs are created if an application repeats modify-andsynchronize on a page. For example, in one of our benchmark LU decomposition, a matrix element a i j is modified min(i j) times by a processor, but the value referred by the 
Figure 4. Merging Diffs
other processors is only the last one. This consumes not only a large amount of memory of the processor that modifies the page, but also the network bandwidth when the diffs are retrieved. One solution of the problem is the home-based LRC [17] in which a page has a fixed home node. When a node modifies a page, it does not keep the diff in it but sends the diff to the home node requesting to update the master copy of the page. When a node try to read the page after the receipt of a write-notice, it does not retrieve the diffs but obtain the whole page from the home node. Since this mechanism works quite well if the home node of a page is the sole modifier, programmer, compiler and/or DSM runtime system carefully tune the mapping of a data onto a page and of a page onto a home node.
In our case, however, this method cannot be applied because the diff is essential for the filter cache state maintenance and the racing block detection. Another non-essential but important difference between our DSM and ordinary ones is that our workload is for shared memory multiprocessors and thus is not tuned for DSM. Thus we have to cope with the fact that a significantly large number of pages have multiple writers.
Our solution is that a processor merges the value part of its diffs of a page when it receives the retrieval request of them. Since our diff representation has the bit-vector, we may simply perform bitwise OR operations on the vectors of all the diffs requested to obtain the locations where the processor modifies and their most recent values. In fact, each diff is not necessary to have the value part because it may be generated from the OR result and the current image of the page itself. Thus the diff to be transferred, denoted diff T to distinguish it from the each elemental diff E created by the modifier, consists of the ORed bit-vector, the value part, and the set of diff E each of which is the pair of the bitvector and vector time-stamp as shown in Figure 4 5 . This implementation technique reduces not only the size of diff T but also of each diff E . 5 Vector time-stamps are ommitted in the figure.
Note that the bit-vectors and the vector time-stamps are hardly merged because the racing block detection using them must be safe and as accurate as possible. If we merged them to the positive direction of the time axis, the timestamp value of the diff retriever would be incorrectly large and we would miss to mark racing blocks. If the merging direction were opposite, inappropriately small time-stamp value would make a data-race-free block marked as racing. A possible optimization could be to merge a diff E to its direct predecessor if the bit-vector satisfies :b new _ b old for each bit b new in the diff E to be marged and corresponding b old of the predecessor, but we have not implemented it.
The other optimization is to reduce the cost to create diff T when it is retrieved by two or more processors. We implemented a simple caching of the diff T . Each page may have the most recent diff T transferred or retrieved. A diff retrieval hits this diff-cache if the set of write-notices completely matches the set of diff E contained in the cached diff T .
Sharing Diffs
A front-end node may have two or more processors to be simulated. This is unique to our DSM for the simulation because ordinary DSM systems have no reason to have multiple processes on a node.
Since the processors allocated on a front-end node logically share the memory, it is natural and safe that they share the values of a page on a node. However, sharing the other part of the page is not always safe. In fact, the last-accessinterval of a block has to be individually owned by each processor obviously. The write-notices are also private for each processor because a processor may have its own happensbefore-1 relationship with other processors.
The last part of the page component, diff E and the cached diff T can be shared by processors on a node. To do that, a diff E is linked from the write-notice correspondent to it, rather than attached to the page directly as shown in Figure 5 . In the part (a) of the figure, each processor p 1 and p 2 on the node n is in an interval m 1 and m 2 whose own happens-before-1 predecessor is σ i q and σ j q respectively. The diff E of the page π from the intervals have been already retrieved and applied. The intervals of p 1 and p 2 also have a common happens-before-1 predecessors σ k r and σ k+1 r . The processors p 1 has already retrieved the diff E of two intervals and the correspondent diff T is cached, while p 2 has not.
Then p 2 accesses the page π and find the page is invalid for it because the most recent write notice w k+1 r (π ) of its own does not has the corresponding diff E . Thus it snoops the most recent write-notice of p 1 before sending the diff retrieval request to r, and find the most recent write-notices of p 1 [π+0]
[π+1] 
LAI : Last Access Interval
[π+l] : value of l-th byte of page π (page size is P byte) diff cache Figure 5 . Sharing Diffs the filter cache invalidation and racing block marking have already in the node n. Therefore, p 2 utilizes each diff E for the invalidation and marking and draws a link from the corresponding write-notice to the diff E to share it to have the result shown in Figure 5 (b). The cached diff T attached to the page remains unchanged.
In the snooping of the write-notices, we abandon the sharing when we find a mismatch between the most recent write-notice sequence of the snooper and that of the snoopee. Although the snoopee could have a matching write-notice somewhere in its sequence, it is hard to bound the search process and we believe the possibility of the discovery is small. By similar reason, when a node has three or more processors, we check the most recent write-notice of each processor cyclicly and choose the first match for the sequence check eliminating other possible candidates.
If the snooping fails to obtain all the required diff E , the retrieval request is sent to the first mismatching write-notice producer. Then the cached diff T is replaced by the retrieval result.
Performance Evaluation
Environment and Workload
We have implemented the Shaman simulator on a PC cluster of Pentium III of 1 GHz and 256 MB connected by Giga-bit Ethernet. We used N = 1, 2, 4, 8, and 16 front-end nodes together with one back-end node to simulate a busconnected 16-way shared memory multiprocessor. Thus we used N + 1 nodes. The page size of the DSM is 4 KB.
Each processor of the target system executes SPARC instruction set by a single pipeline. The coherent data cache is single-level direct-map MESI type, and its capacity and block size are 64 KB and 16 B respectively. The minimum latency of a cache missing load/store is 10 cycle if the missed block is transferred from another cache, and 30 cycle otherwise (i.e. from memory). Bus and/or memory wait cycles are added to the minimum latency if the cache missing access has to wait the completion of other accesses. The latency of other instructions and cache hitting load/store is one cycle. Although the target system is quite simple for rapid development, it is enough large and functional to evaluate the basic performance of Shaman.
Two workloads are chosen from SPLASH-2 kernels [16] . One is LU-decomposition (LU in short) of 256 256 double precision matrix subdivided into 16 16 sub-matrices. The other is FFT of 2 16 double precision complexes. Although both programs are coded cache consciously, they are not intended to execute on a DSM cluster but coded and compiled for shared memory multiprocessors. As reported in [3] , the reference filtering effectively eliminates the references only passing 0.5 % in LU and 1.6 % in FFT. No racing blocks are detected in the executions of any workloads. Table 1 and Table 2 show the execution time of each simulator configuration of N = 1, 2, 4, 8 and 16. The first row Note that 'SH' is only applicable to N = 1 to 8 because nothing is shared in N = 16. It is applicable to the single node case N = 1 because copying diffs on the node may be removed. Figure 6 and Figure 7 show the execution speed V N which is defined as [8] ;
Evaluation Result
where c i is the number of simulated clocks of the processor p i and t j is the execution time of the node n j . The baseline performance of one front-end node, 51 MC/sec (million simulated clocks per second) of LU and 74 MC/sec of FFT, is excellent even though we have to take it into account that the target system is quite simple. The one node performance of 'MR' and 'MR+DC' for FFT is lower than the baseline because these techniques aim to move burdens from communication to computation. Adding the technique 'SH' is effective especially in FFT, the performance of which is improved by 40 %, because copying diffs from a holder processor to a retriever is removed.
As for the parallel performance, 335 MC/sec of LU and 392 MC/sec of FFT on 16 nodes with 'MR+DC' technique is also excellent. The parallel speed-up of N = 1 to 16, 6.6 of LU and 5.3 of FFT, is not brilliant but reasonable because we use N + 1 = 2 and N + 1 = 17 nodes respectively 6 .
The implementation technique 'MR' is most effective to the parallel performance improvement of LU. This technique improves the 8 and 16 node performance by 9 and 10 % respectively. As stated in Section 4.2, the baseline implementation creates a large number of diffs of a page and sends them to diff retrievers ignoring the fact that only the newest one is necessary. Since the 'MR' eliminates the unnecessary diffs transferred almost completely, leaving only the bit-vectors and time-stamps, network traffic is reduced to 30 % approximately.
This technique is also effective to FFT of N = 8 and 16 in which performance is improved by 6 and 11 % respectively, but degrades the performance in N = 1 to 4. The technique (a) reduces the memory space for diffs, and also (b) reduces the network traffic if multiple diffs for a page is transferred, but (c) increases the cost to make diff T to be transferred. In FFT, since an element of shared array written in a synchronization period is always referred in the next period, (b) has no effect and thus the performance degrades in N = 2 and 4 due to (c). In case of N = 8 and 16, the effect (a) overcomes (c) reducing the working set size and thus (probably) improving cache hit ratio. Adding the technique 'DC' is expected to reduce the cost of (c) and does so in N = 8 and 16 improving the performance 4 and 5 % respectively. Further analysis is required to find the reason why 'DC' is ineffective (affects negatively, in fact) in N = 2 and 4.
The effect of the technique 'SH' is very impressive in FFT. It improves the performance of N = 2, 4 and 8 by 88, 72 and 6 % respectively. FFT performs an all-to-all communication to transpose a matrix subdivided by the row-wise block decomposition method. Since the elements of the matrix is ordered row-wise and the size of a row is 4 KB (256 double precision complexes) or just one page, all the processors try to read a single set of pages (rows) for the transposition. Therefore, after a processor on a node retrieved a set of diffs, other processors on the node may use the set without inter-node communication resulting an impressive performance improvement when the node has many processors, i.e. N is small.
Conclusions
In this paper, we discussed the Shaman distributed simulator of shared memory multiprocessors. Shaman runs on a PC cluster that consists of multiple front-end nodes to simulate instructions executed by one or more processors of the target using LRC software DSM technique, and a back-end node to simulate the target memory system. The DSM implementation, which plays an important role for the reference filtering, is discussed in detail. The performance of Shaman is extraordinary, 335 MC/sec for LUdecomposition and 392 MC/sec for FFT with 16 front-end nodes. This proves our novel idea of the distributed architectural simulation is efficiently implemented. It is also proved that our DSM implementation techniques effectively improve the parallel performance.
Our main future work is to evaluate the performance of Shaman varying various parameters. We need to prove the Shaman works well with various target systems with advanced processor architecture, hardware DSM, large number of processors, and so on. Evaluation with larger and more complicated workloads is also planned. Additionally, the improvement of Shaman implementation, such as application of various techniques of ordinary DSM systems and parallelizing the back-end, is left for our future work too.
