The 
Introduction
As the speed increase of the single processor slows down, multi-core machines are becoming a standard in environments ranging from the desktop and portable computing up to servers. This makes parallelism an important concern, as in order to achieve high performance with the new architectures one must be able to extract parallelism from the application. Unfortunately, this is often difficult with the standard Von Neumman approach to computing, motivating aggressive research on alternative models such as TRIPS [14] and RAW [20] . A popular alternative was the DataFlow model, where instructions can execute as soon as they have their input operands available [7, 6, 13] , instead of following the program order. Although dataflow ideas are in use, e.g., in Tomasulo's algorithm [19] actual DataFlow machines were never popular. Arguably, a major problem with them is that it was hard to support the memory semantics required by imperative languages. Moving to DataFlow thus required both a new architecture and a new programming language.
Swanson's WaveScalar architecture is a radical rethought of WaveScalar concepts that addresses this problem by providing a memory interface that executes memory accesses according to the program order [17, 18, 16] . The key idea is that computation is divided into waves, such that each wave runs as a data-flow computation, but sequencing of waves guarantees traditional memory access order. It thus becomes possible to run imperative programs in a dataflow machine and still obtain significant speedups. Simulation results for WaveScalar show that performance can be comparable with conventional superscalar designs and even CMPs, for single threaded programs, but with 30% less area [16] . When it comes to multithreaded applications, WaveScalar performs 2 to 11 times faster than CMPs [16] .
The tiled organization of the WaveScalar architecture and the placement algorithms allow an efficient use of the processor's resources.
In the original Wavescalar design memory requests from different waves always follow program order. This can be more restrictive than current designs, where write buffering and read reordering is used extensively. Work by the WaveScalar group shows that such techniques can indeed be very useful in the WaveScalar context [16] . In this work we present and study an even more aggressive approach. We propose an optimistic speculative memory access algorithm where we can run different waves that access memory caches out-of-order. If conflicts arise, speculative waves are forced to undo their memory accesses. The approach was inspired by the observation that waves can be regarded as memory transactions, so the work in Transactional Memories should apply. Notice that although allowing a parallelism increase, this aproach requires more complex hardware (on the other hand, such hardware could also be used to support actual transactions).
We evaluated the WaveCache on a set of benchmarks from Spec 2000, Mediabench and Mibench (telecomm). Speedups ranging from 1.31 to 2.24 where observed when the benchmark doesn't perform lots of emulated function calls. Low speedups of 1.1 to slowdowns of 0.96 were observed when the opposite happens or when the memory concurrency was high.
Section 2 discusses Transactional Memory concepts that influenced the creation of Transactional WaveCache. Section 3 describes WaveScalar instruction set, memory interface and processor architecture. Section 4 presents the Transactional WaveCache and Section 5 presents and discusses experiments and results. Conclusions and future work are presented in Section 6.
Transactional Memories
The term Transactional Memory (TM) was coined by Herlihy and Moss [9] as "a new multiprocessor architecture intended to make lock-free synchronization as efficient (and easy to use) as conventional techniques based on mutual exclusion". The motivation was to avoid the complexity of lock-based programming. The advent of CMPs (singleChip MultiProcessors) has increased demand for easier to program parallel models and has thus made the subject become popular.
A transaction is an atomic block of code. In TM they can execute speculatively without blocking. To do so, a record of memory and register file changes must be kept by the transactions. In case there is a conflict between two transactions, one of them is aborted and re-executed. If a transaction finishes without conflicts, a commit is made. There are four major problems that must be considered to create a Transactional Memory solution: data versioning, conflict detection, nesting and virtualization.
In eager data versioning an undo log keeps the backup of memory regions and registers changed by transactions and its used to perform the restoration in case of an abortion. If a commit happens, the undo log is erased. In lazy data versioning a write buffer keeps the changes made by each transaction. Those changes are applied to memory and register upon a commit and erased on abortion. Conflict detection can also be eager or lazy. In the former, conflicts are detected as soon as they happen and in the later, they are just detected when a transaction finishes (it will commit if no conflicts were detected).
One further problem in implementing TM is that CPUs may start new transactions while still executing another transaction, the so-called Nested transactions. Such transactions may be treated as a unique transaction (flattening), or may be treated independently (closed and open nesting [11, 12] ). Last, Transactional Memories systems should ensure the correct execution of programs even when transaction exceeds the scheduler time slice, the caches and memory capacities or the number of independent nesting levels allowed by its hardware [3] .
Transactional memories mechanisms may be implemented in hardware [9, 2, 3] , where higher performance can be achieved, but complete solutions to nesting and virtualization problems may increase the design complexity and make it too expensive. Software implementations [15, 8, 1] allow more complex solutions with lower performance. Hybrid implementations [10, 4, 1] try to get the best of both worlds.
WaveScalar
The WaveScalar is the DataFlow architecture used to implement the Transactional WaveCache, presented in this work. This Section briefly describes WaveScalar instruction set, memory interface and processor architecture.
Instruction Set
A DataFlow graph describes a program in the DataFlow model. The nodes in the graph are instructions: instructions are intelligent in the sense that they have an associated functional unit. Edges represent the operands exchanged between instructions. The WaveScalar Instruction Set starts from the Alpha ISA [5] . The main difference is that branches must be transformed into a mechanism to select consumers of values: 
Waves
Waves are connected, acyclic fragments of the control flow graph with a single entrance. Waves extend hyper-blocks in that they can also contain joins. The WaveScalar compiler (or binary translator) partitionates a program into a set of maximal waves. Waves are then ordered through Waveordering annotations.
Notice that different iterations in a loop may execute in parallel in Dataflow mode, if there are no dependencies between its instructions. When an instruction in the loop receives an operand there must be a way to identify to which iteration the operand is destined. To accomplish that, every operand carries a tag that indicates the iteration number, or the Wave number. The Wave-Advance instruction advances the wave numbers for input operands of a wave. It is associated with every input argument, and is often merged with the next instruction to reduce overhead.
Memory-ordering
During compilation, all memory operations further receive a key < P, C, S > (Predecessor, Current and Successor), that allow the memory system to establish a chain that connects all memory requests in a Wave. A memory request can only be executed if the previous request in the chain and all memory requests from the previous Wave have already been executed. When a memory operation is the first or the last of a Wave, P ="." and S=".", respectively (wildcard "." denotes inexistent operation). Operations that occur before and after a branch block have S="?" and P ="?", respectively (wildcard "?" denotes unknown operation). If there are no memory operations in one of the paths of a branch, there is no way to establish a chain between the operations that are before and after the branch block. To solve this problem, a MemNop instruction must be inserted in that path. Figure 1 shows a piece of code with a IF-THEN-ELSE block (a) and the related DataFlow graph(b). The Wave-ordering annotations of Memory operations are also shown, with the dashed lines indicating the chain that is formed between them.
Wave-ordered memory incorporates simple run-time memory-disambiguation inside each wave by taking advantage of the fact that the address of a Store is sometimes ready before the data value. When this happens, the memory system can safely proceed with future memory operations to different addresses. Stores are broken in two requests: Store-Address-Request and Store-Data-Request. Store-Address-Requests arriving without the corresponding Store-Data-Request are inserted in partial store queues to hold future operations to the same address. Other operations that access the same address are inserted in the same queue. Once the Store-Data-Request arrives for that address, the memory system can apply all the operations in the partial 
The WaveScalar architecture
The WaveScalar architecture is called WaveCache and it comprises all the hardware, except main memory, required to run a WaveScalar program. It is designed as a scalable grid of identical dataflow processing elements, the waveordered memory hardware, and a hierarchical interconnect to support communication. The Cluster, each one with four Domains, is the construction block of the WaveCache. It has an L1 cache, the StoreBuffer that interfaces with the Waveordered memory, and a Switch to provide intra and interCluster communication. Each Domain has eight processing elements grouped in Pods of two PEs each. The Clusters are replicated across the die, forming a matrix that is connected to the L2 cache. Figure 2 (reproduced, with permission of the copyright owner, from [16] ) shows an overview of the WaveCache.
The PEs implement the DataFlow firing rule and thus execute instructions. Each PE has an ALU, memory structures to hold operands, logic to control execution and communication, and an instruction buffer. When a program executes in WaveScalar, multiple instructions are mapped to the same PE by a placement algorithm. As the program evolves, some instructions become unnecessary and are replaced by new ones. The firing rule ensures that an instruction is executed only when all its operands are available. Every PE has a matching table that holds all operands destined to instructions mapped to that PE, until the instructions are ready to be fired, causing those operands to be consumed and possibly producing result operands.
Each Cluster has a StoreBuffer (SB) that is responsible for the memory ordering mechanism of some Waves. The Wave Map holds the information about which StoreBuffer has the custody of which Wave. This Map is stored in main memory; a unique view of the lines used by each SB is guaranteed by the cache coherence protocol. Memory Requests are sent from the PEs to their local StoreBuffers and routed, if needed, to the remote SB that is responsible for the requests' Wave. The SB then inserts the request in a list for that wave. That request will be executed as soon as all the previous waves have executed all their memory operations, and also when a chain is established between the previous executed memory operation and the present one.
The Transactional WaveCache
By default, WaveScalar follows a strict ordering mechanism for memory accesses. Partial Stores can add parallelism to the execution of memory operations within a wave, but parallel execution between waves is only possible if latter waves do not need memory. The motivation for our work is that more parallelism becomes available if we allow disjoint memory operations to execute in parallel.
This work presents an alternative memory ordering mechanism that maintains the execution order of memory operations within a wave, but adds the ability to speculatively execute, out-of-order, operations form different waves. This ordering mechanism is inspired on the way Transactional Memories works. Waves are considered as atomic regions and executed as nested transactions. In a nutshell, if a wave has finished the execution of all its memory operations, it can commit as soon as the previous waves have committed. If an hazard is detected in a speculative wave, all the following Waves (children) are aborted and re-executed.
Transactions in WaveScalar
The large body of work in TM suggests a number of alternatives to implement our approach. Next, we present our algorithm. The first idea is that each Wave has a F bit that indicates whether it has finished all its memory operations. This bit is stored in the Wave Map (see page In contrast to standard TM, transaction size is not a problem here: transactions that need more resources than available should just stall until resources become available or until they become non-speculative.
The Transactional context
In Transactional Memory for Von Neumman machines, before a transaction starts it is necessary to save the content of all registers used by the program. A history of all changes in memory must also be kept. If a hazard is detected this information will be used to restore the memory and register file. In WaveScalar there is no register file but all operands used by a transaction that were produced outside the transaction need to be re-sent. Those operands form the read set of the transaction or Wave.
We observed that the Wave-Advance instructions control how operands reach a wave. In order to keep the read set for each Wave, a natural step is to modify those instructions so that they will send a copy of their operand to the StoreBuffer, that will in turn insert them into a novel structure We further need to keep a memory log: this is implemented through a table called MemOp-History (MOH). Per each operation, the MOH stores the Wave number, the Current number of the operation's memory annotation, the operation type and data backup for Store operations. Each StoreBuffer has an associated MOH. Notice that one problem is that waves belonging to different SBs can interefere. All the involved SBs would need a unique view of the MOH so that hazards between those waves could be detected: the MOH would need to reside in memory bringing high communication costs. We avoid this by ensuring that a speculative wave must use the same SB as its closest non-speculative wave. Arguably, this solution could limit parallelism, but in fact waves in the same thread are usually under the responsibility of the same SB [16] .
Speculation and Re-execution
We used the two previous data-structures so that the WaveScalar's original memory ordering mechanism would allow memory requests from different waves to be executed concurrently. In order to study how speculation can affect performance, we can impose a limit on the number of speculative waves that may execute. We call this limit the Speculation Window.
Before we go on, we observe that problems arise because an instruction may receive operands from old and recent executions. In the worst case, operands might match in the Matching Table ( MT) (see page 3) and be consumed producing wrong results. We further add an execution number ExeN to the operands' tags and change the firing rule so that ExeN is also used to match operands. An instruction that executes with operands of ExeN = K produces operands also within ExeN = K.
The StoreBuffer maintains a lastExeN attribute that holds the number of the last execution. Waves also have an CurrentExeN attribute: a memory request can only be accepted by Wave W if ExeN ≤ W.CurrentExeN .
We can now describe the rollback algorithm. If a hazard is found in Wave W , the WCTs for all waves Y such that Y.Id > W.Id are discarded; all operands in the WCT for W are then re-sent, causing the re-execution of all following waves. Notice that at the moment of a re-execution, some operands that belong to the read set of W might not be present in the WCT yet. This is not a problem: they simply did not reach the Wave-Advance instruction. When they do reach: (i) the operands will be inserted in the WCT, (ii) will have their ExeN updated, and (iii) will be sent to the new instance of wave W .
Notice that when operands are re-sent by the StoreBuffers they have their ExeN changed to a unique number that identifies the new execution. More precisely, when Wave W causes a re-execution, all waves ≥ W have their currentExeN set to StoreBuffer.lastExeN . Thus, memory requests from old executions will not be accepted by the memory system.
Hazard Detection
A hazard may exist between two memory operations if they access the same memory address. As the Transactional WaveCache maintains Wavescalar memory ordering within waves, operations in the same Wave will never cause a hazard. When an memory operation belonging to Wave X is executed, a hazard can only occur between Wave X and some other wave Y such that X.Id > Y.id.
In our architecture, the MOH structure is the key for recognizing hazards. Consider two memory operations A and B, and their respective waves X and Y , such that X.Id > Y.Id. We assume that there were no previous speculative stores between them and that A has executed. When B finally executes, the possible hazards and respective solutions are as follows:
RAW (Read After Write): Occurs when A is a Load and
B is a Store. All waves ≥ X must be aborted and re-executed. If B is speculative, it is inserted in the MOH and its backup field is either (i) copied from A's value field in the MOH or (ii) obtained from a Load.
WAW (Write After Write):
When both A and B are Stores, the Wave X doesn't need to be re-executed. If B is speculative, it is inserted in the MOH and its backup field receives A's backup. Either way, the Store needs not to go to main memory, and A's backup field in the MOH receives the value of Store B.
WAR (Write After Read):
When A is a Store and B is a Load, we need not to re-execute X. Instead, the backup field of A can be the return value for the Load. If B is speculative, it should be inserted in the MOH.
Commit and RollBack
Commits execute whenever a non-speculative wave finishes its execution in the memory system. A commit in Wave X makes wave X + 1 non-speculative. Therefore, the transactional context of X + 1 can be erased; if X + 1 completed (i.e., F = TRUE), it will commit and we can proceed to the next wave. Notice that when a Wave commits the next Wave may be under the responsibility of a different StoreBuffer. If so, a message is sent to that SB so that it knows that now it has the non-speculative wave and that it can start executing memory operations. The lastExeN attribute is also sent and updated in the destination SB, to be used in case of future re-executions.
Committing requires cleaning the MOH. One possibility would be to scan the whole structure. Our prototype uses a Search Catalog to speedup this operation. For each wave, the Search Catalogue points to the list including all operations for the wave.
We are now in a position to present the necessary steps to re-execute a program if wave X was aborted:
1. Erase the lines L in the Search Catalog where L.W ave > X;
2. Restore the memory to its previous state (using the backup field of Stores MOH) before the execution of all operations of waves ≥ X. During this process the MOH for those waves must also be erased;
Clean the WCTs I where I.W ave > X;
4. Clean all requests for W aves ≥ X in the local StoreBuffer. Request in remote SBs will be cleaned lazily when they receive requests for the new executions;
5. Increment the lastExeN attribute in the local StoreBuffer and copy that value to the currentExeN of all W aves ≥ X;
6. Re-send all the operands in X's read set.
Erasing Old Operands
One problem with speculative execution is that the operands from old executions compete for resources with operands from the current execution. In the worst case, old operands may find their way to memory access instructions, and generate memory requests that will not be responded by the memory system. To make things worst, those operands will not be consumed from the PE queues and may eventually spill into memory, further degrading system performance. We propose two mechanisms to address this problem by removing older operands from processing elements: the MT Checkups and the Execution Maps.
MT Checkups scan received operands in the matching tables in order to find all operands to the same instruction, wave, thread and application, but with different ExeN . Operands from older executions should be erased, whereas the newer operand is inserted in the matching table. If the received operand is older, it is ignored. If their execution is the same, the operand is just inserted in the matching table.
For operands that are the first to arrive in a instruction, since there are no other operands to compare to, MT Checkups will not be useful. Also, old executions are usually ahead of new ones, and the former will just be reached if it gets stuck in memory accesses instructions. Execution Maps give the PEs the ability to eliminate operands based on its local view of the execution. A table containing pairs of < W ave, ExeN > is kept in each PE to hold information on allowed operands. If a PE has the pairs < 0, 0 > and < 5, 1 > in its Execution Map, it means that from Wave 0 to Wave 4 operands with ExeN ≥ 0 can be accepted and, for waves ≥ 5 only operands with ExeN ≥ 1 can be accepted. If an operand with Wave number 3 and ExeN = 1 arrives, it will be accepted and the pair < 5, 1 > will be replaced by the pair < 3, 1 > in the Execution Map.
It is important to observe that those mechanisms do not guarantee that all old operands are going to be erased from the PEs, but they can minimize the explosion of parallelism caused by the Transactional WaveCache.
Experiments and Results

Methodology
Currently, WaveScalar programs must be compiled using the Alpha Tru64 cc compiler and then translated to WaveScalar assembly, using a binary translator. The binary translator does not support full Alpha assembly yet, hence most programs are partially run as WaveScalar, and partially run as Alpha programs. In this work, we extended the Kahuna WaveScalar architectural simulator [16] to support the Transactional WaveCache. The Transactional WaveCache do not yet support multi-threaded execution and Decoupled Stores. Alpha emulation support is available, but before control is transfered to the Alpha simulator, we need to stall speculation and wait for all waves to commit.
A set of benchmarks from Spec 2000, Mediabench and Mibench was used to evaluate performance. The benchmarks were not executed to completion and, since the number of instructions executed can vary according to the speculation window, we changed the simulator to allow a semantic cut that is done upon commit. Doing so, we can compare the memory state for every scenario with the original WaveScalar. For the Spec 2000 benchmarks (mcf, art and equake) we executed the waves corresponding to the first 2 milion instructions. They were compiled using the O3 optimization. The Mediabench (G721 and EPIC) and the Mibench (CRC) benchmarks were compiled using O2 and the first 500 thousand instructions were executed. Table 1 shows the architectural parameters used in the simulations. All the applications were executed in the original WaveScalar, without any memory disambiguation mechanisms and also with Decoupled Stores. They were also executed using Transactional WaveCache, with Speculation Window sizes of 2. In order to have a more detailed understanding of the effect of the Speculation Window size on the performance, for EQUAKE and EPIC, we also executed with Speculation Window sizes of 3, 5, 10, 20, Figure 3 shows the speedups of Transactional WaveCache (TWC) and Decouple Stores for all applications compared to the original WaveScalar without memory disambiguation. Floating point applications (ART and EQUAKE) show a significant speedup, even for a small Speculation Window (1.35 and 1.31, respectively). MCF and EPIC perform lots of function calls that are emulated in the Alpha machine. Since we had to disabled speculation to do that, we only get the overhead of Transactional WaveCache when that happens. Speedups of 1.02 for MCF and slowdowns of 0.96 for EPIC were observed. This is caused by emulated function calls, that wouldn't be necessary if we had a WaveScalar compiler that could translate all instructions. If memory accesses happen very often, memory concurrency will be high and the Transaction WaveCache will tend to not provide speedups. Since we are executing the first instructions of each benchmark, and if the variables initialization phase is to long, we will have a long period of where our mechanism will not help much on the final performance.
Results
Decoupled Stores performs better than Transactional WaveCache for most applications. Since both techniques are orthogonal, a combination of two is possible and is subject of ongoing work. Figure 4 shows the speedups achieved for Speculation Window sizes varying from 2 to Infinite for the EQUAKE benchmark. The speedup grows with the Speculation Window, although the maximum speedup (2.24) is reached for a Speculation Windows of less then 20. This shows that the Transactional WaveCache structures can be small and still provide speedups. The same experiment was performed for EPIC. The slowdown is the constant (0.96) for all Speculation Window lengths, showing it has no influence on the speedup, since speculation is disabled very often due to emulated function calls.
Figure 4. Equake -Speedup x Speculation
For all the benchmarks, ART was the only application where RAW hazards, and therefore reexecutions, happened. This was expected, for small speculation windows, but for EQUAKE and EPIC, RAW hazards were not found even for big Speculation Windows.EPIC is limited by emulated function calls and EQUAKE does not access one same memory address very often, witch explains the high speedups achieved with no hazards found.
Conclusions and Future Work
The advent of multi-cores has kindled interest of novel architectures, such as WaveScalar, a DataFlow style architecture that can run imperative programs. We present the Transactional WaveCache, a memory disambiguation tech-nique for WaveScalar that allows memory operations from different waves to execute out-of-order in a speculative way. Initial results show that very significant speedups can be achieved through this mechanism, even in the presence of hazards. We believe that our results show that aggressive speculation on memory accesses can improve performance for WaveScalar style architectures.
Our results confirm previous work indicating that the WaveScalar default memory architecture could be a limiting factor in performance. We obtain similar results to the Decouple Store mechanism, that performs memory disambiguation within a wave. Since the Transactional WaveCache acts in different waves, both contributions are orthogonal, suggesting that a combination of the two techniques should be a subject of further study.
Our work motivates a large number of future research directions. First, our implementation relies on prior work the Kahuna simulator and the WaveScalar binary translator. As a next step, we plan to improve both systems so that the Transactional WaveCache to run more applications. This will allow a more complete understanding of the benefits and limitations of our technique.
We also should notice that our design is one point in the very large design space that has been explored by the Transactional Memory community. In the future, and benefitting from our experience with the Transactional WaveCache, we plan to study alternative designs, namely towards simplifying design complexity. We also plan to study other user and/or low-level mechanisms to throttle parallelism, and to study whether these data-structures could also be used to achieve synchronization.
