High-end embedded systems such as smart phones, game consoles, GPS-enabled automotive systems, and home entertainment centers, are becoming ubiquitous. Like their general-purpose counterparts, and for many of the same energy-related reasons, embedded systems are turning to multicore architectures. Moreover, as the demand for more compute-intensive capabilities for embedded systems increases, these multicore architectures will evolve into manycore systems for improved performance or performance/area/Watt. These systems are often organized as cluster based Non-Uniform Memory Access (NUMA) architectures that provide the programmer with a shared-memory abstraction, with the cost of sharing memory (in terms of performance, energy, and complexity) varying substantially depending on the locations of the communicating processes. This paper investigates one of the principal challenges presented by these emerging NUMA architectures for embedded systems: providing efficient, energy-effective and convenient mechanisms for synchronization and communication. In this paper, we propose an initial solution based on hardware support for speculative synchronization.
INTRODUCTION
A speculative execution in a multiprocessor system is one in which a processor executes a block of code in the presence of potential data conflicts with concurrent computations. Speculation is attractive in situations where conflict avoidance is expensive, but actual conflicts are rare. Speculative mechanisms have been the subject of prior work; however, they tend to focus on throughput and ease of use. While we share these goals, we feel that embedded systems demand an equal focus on energy-efficiency and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MES '13 Locks, which are often used to guarantee memory consistency in shared memory systems, typically require energy-expensive readmodify-write operations that traverse the memory hierarchy. In contrast, speculative computations are typically restricted to the more energy-efficient L1 (or sometimes L2) caches, relying on native cache-coherence mechanisms to detect conflicts. Support for short speculative executions can be provided via a hardware mechanism similar to existing cache coherence mechanisms. However, without careful consideration for energy-efficiency, the extra hardware required to manage data versioning, conflict detection, and recovery policies may not provide enough of a performance/Watt improvement to justify its adaptation, especially in embedded systems. Moreover, while existing designs for speculative mechanisms are a natural fit for single-cluster architectures where the cores communicate though a shared cache, they require careful rethinking for NUMA systems where inter-cluster communication is more restricted.
The principal objective of our work is to investigate how transparent application of speculative hardware can improve performance and energy consumption for multi-threaded applications in embedded many-core, cluster-based NUMA architectures. Any practical implementation of transparent speculation for NUMA-based embedded systems must emphasize simplicity; complex hardware designs that require extensive changes to established protocols, are unlikely to influence current practice. Similarly, transparent speculation must be integrated into practical and familiar programming environments to be influential. This is a work in progress so our emphasis for this paper is description of implementation details and challenges.
BACKGROUND
Various works have been proposed on speculative execution. Transactional memory, Lock Elision, and Transactional Lock Removal (TLR) are hardware speculation techniques that allow critical sections without run-time conflicts to execute in parallel, without acquiring locks [4] , [12] , [13] , [9] . If a data conflict does take place, it is detected, and one or more of the conflicting threads is rolled back and restarted, either speculatively, or by acquiring a lock. Speculative Lock Elision (SLE) [12] is (almost) transparent: there is no need to restructure applications. Instead, it relies on annotations to delimit speculative critical sections. Transactional Lock Removal (TLR), uses SLE to transparently convert lock-based critical sections to lock-free optimistic transactions. While SLE and TLR are both focusing into transparent speculation they do not examine the energy implications of their proposed strategies.
Speculation for embedded devices raises two challenges: how to keep the underlying hardware simple, and how to keep the software interface simple. While the aforementioned speculative techniques provide performance advantages on general purpose platforms their implementations are not focused on embedded platforms, which are more constrained in terms of both hardware and software resources. In addition, the proposed solutions do not consider clusterbased NUMA architectures, which may limit their scalability. The principal challenge addressed by this project is how to rethink the design of transparent hardware speculation mechanisms for embedded platforms in the context of cluster-based NUMA architectures.
Recently, Intel [5] and IBM [1] announced new processors with direct hardware support for speculative transactions, and it seems likely that others will follow suit. The IBM transactional memory mechanism, like ours, is intended for a clustered architecture, although IBM's mechanism works only within a single cluster.
Several researchers have investigated hardware transactional memory designs that exploit network-on-chip (NoC) communication. Kunz et. al [6] evaluated a LogTM [9] implementation on a NoC architecture and found that it provides better performance and energy savings than locking for certain configurations. Meunier and Petrot [8] evaluated the performance of a novel embedded HTM implementation based on a write-through cache coherence policy. Our proposed research also focuses on both performance and energy, and how they are affected by the choice of conflict resolution policy, but our emphasis will be on scalability, transparency, and simplicity of design.
TARGET ARCHITECTURE
The proposed work is developed within a virtual platform environment called MICSIM, which is capable of simulating a clusterbased many-core architecture at a cycle-accurate level and is especially designed for embedded MPSoC design space explorations [2] . MICSIM is capable of simulating a cluster-based many-core architecture at a cycle-accurate level and is especially designed for embedded MPSoC design space explorations [2] . Inspired by the most recent many-core chips such as Plurality Hypercore Architecture Line (HAL) [11] , ST Microelectronics p2012/STHORM [7] , or even GPGPUs such as NVIDIA Fermi [10] , the platform is composed of multiple computing clusters and is highly modular. To integrate hundreds of cores in the same chip in a scalable manner, the mentioned products consider a hierarchical design, where simple processing units (PU) are grouped into small-medium sized subsystems (the clusters) sharing high-performance local interconnection and memory. Scaling to larger system sizes is enabled by replicating clusters and interconnecting them with a scalable networkon-chip (NoC) medium. The basic cluster architecture, shown in Figure 1 , consists of a configurable number of 32-bit ARMv6 processors, L1 private instruction caches, one for each processor, and a shared multi-ported and multi-banked L1 tightly coupled data memory (TCDM). A logarithmic interconnect supports communication between the processors and the L1 memory banks. The cluster can handle N simultaneous memory request to the TCDM, where N is the number of processors in the cluster. If all N request are to different banks, all requests are services simultaneously; otherwise requests to the same bank are serialized. The logarithmic interconnect behaves as a Mesh-of-Trees and provides fine-grained address interleaving on the memory banks to reduce banking conflicts. Accesses from a process to memory outside its cluster go through a peripheral interconnection. The off-cluster (L2/L3) bus coordinates the accesses and services the requests in round-robin fashion.
IMPLEMENTATION
Since embedded systems are resource-constrained, specialized hardware plays an important (but not exclusive) role in the synchronization mechanisms we develop. Next we describe our design choices and present the implementation of our algorithm.
Data Versioning
One very important aspect of the transactional support design, is Data Versioning, ie. the process that is used to keep the speculative and non-speculative versions of a data memory in the system. As conflict detection and resolution, Data Versioning can be implemented in an eager or lazy manner. The lazy data versioning leaves the old copies of transactional data in place and stores the new speculative data in other memory locations or transactional buffers. Keeping the original data in their initial location, makes the abort scenario very fast, but it has the disadvantage of increasing the transactional time because on commit, extra time is necessary in order to write the speculative data back to memory.Eager data versioning on the other hand, stores the speculative data in place and keeps the original values of the data elsewhere. This makes commits faster but increases the abort recovery time, so it could be problematic for specific applications with high contention. Since we expect frequent commits in a transactional system though, the eager data versioning should yield significant performance benefits.
For our design, we decided to adopt an eager data versioning scheme. In particular, we borrow the idea of [9] in which the original values are stored in a software log, that has a stack structure. In [9] a per-thread transaction log is created in cacheable virtual memory, which holds original values as well as their virtual addresses, for all the data blocks that have being modified during a transaction carried by the thread. In our case though, instead of saving a log per transaction, we decided to use a full mirroring technique. This means that for every address in the memory space, we create a mirroring address that will save a copy of the original data for data restoring during a potential abort. Although this choice is more wasteful in terms of memory resources than keeping a dedicated log for each transaction, it simplifies the design for now. This gives us an initial point of reference to see what the potential benefit in terms of performance may be. In future work, we are planning on implementing a log-based scheme with some similarity to [9] , that utilizes the off-chip L2 memory to allocate part of the logs.
Transactional Support Modules
One of the main challenges imposed by the clustered based MIC-SIM platform is the fact that there is no data caching system available, hence we cannot utilize part of the data caches to accommodate transactional data, as it was done in previous works. Having no hardware caching support to handle transactions, we have to rely on the TCDM memory without requiring the addition of extra dedicated hardware modules to keep transactional data. This means that the TCDM memory should be used to keep both speculative and non-speculative data.
Another challenge that we face when we migrate to a NUMA architecture, is that the lack of a shared bus complicates conflict detection and management. In previous works, such as [3] , the bus data traffic was monitored by a centralized module (called Bloom Module) that was responsible for detecting conflicts and making decisions on which cores should be aborted upon a conflict. In our NUMA system, monitoring all ongoing traffic transferred through the logical interconnection is more complicated, since now not only one but up to N transactions (where N is the number of cores), could be using the logical interconnection on the same cycle. To adopt a centralized solution on our platform, we would need to redirect all ongoing transactional traffic to a central module which would create a significant overhead in the system. All the ongoing transactions would need to stall until copies of their data have reached and have been served by the central module and conflict management decisions have been completed. This could potentially create a big bottleneck to the system performance. Furthermore, this could lead to a scalability problem. As we would scale up to more processors and clusters, it would become very complicated to handle all the transactional management in a centralized manner.
For these reasons, we concluded that the transactional management needs to be distributed. We divide the conflict detection and resolution responsibilities between the TCDM memory banks. Since the banks already receive all ongoing memory requests in the system, either transactional or not, they are automatically aware of the ongoing system traffic. By adding Transaction Support Modules (TSM) at the banks, we can decentralize the conflict detection and the resolution decision process. In this way we resolve the scalability problem, since the transactional management bandwidth will scale naturally with the number of banks.
The TSM of each bank receives all transactions coming to the bank and is aware of which core is executing in speculative (ie. transactional memory) mode. The decision is made locally at each bank's TSM and the corresponding processor is notified and either abort or commit according to the decision. For each accessed data line of a bank, the bank should be aware of all the processors that have transactionally read and written that location in order to be able to decide on whether a conflict is triggered by each transactional access to that data line. The extra hardware requirement to manage this bookkeeping process, is an array of k r-bit vectors at each bank, where k is the number of address locations of each bank, r = N + log2N and N is the number of cores. The first log2N bits of each vector are used to store the ID of the owner of the memory location (i.e. the core that is writing the location transactionally) and the following N bits are used to indicate for each core of the system, whether it is reading the memory location. In our experiments we are using: N=16 cores, M=16 banks and 20-bit flag vectors.
During any transactional access to a bank, its TSM should check the corresponding vector to find out whether a conflict is triggered or not. A transactional write to a memory location that is currently being read by another core, will trigger a conflict (and vice versa). A transactional read to a memory location that is also being read by other cores is not triggering any conflict. In this way we resolve the issue of Transactions Bookkeeping and Conflict Detection.
As mentioned before, we need to integrate a Transactional Support mechanism at each bank of the TCDM memory to manage bookkeeping and logging. We would like this control logic to be able to send a request to the log space in order to save a log. For this reason, one extra master port needs to be added at each bank. In Figure 2 you can see the modified design where the necessary additional logic is noted in red. The red dashed line shows an example of a path that could be followed when a log is saved. In this example, suppose that an address in Bank 0 is written transactionally. Before this write, the extra control logic of Bank 0 will send a request through the red path, to the demux, indicating that it has data that need to be saved in the log. Those data will be the original data of the address that is written. The demux will decide internally, whether the log can be saved in the L1 space or in the L2 memory and will direct those data through the appropriate path. If the log is saved in the L1 memory, then it will be sent through the Logical Interconnect to the appropriate bank that keeps the log. Following the opposite direction, the data can be retrieved in case of abort.
Transaction Control Flow
In this section, we are going to describe the algorithm that is implemented through our transactional support framework. When a transaction starts, a memory read operation is triggered, to a special memory mapped register that we call transactional register. When this request reaches the memory, a bit corresponding to the core that makes the request is set, to signal the fact that a transaction is being executed by that processor. Upon the start of a transaction, the core saves its internal state (program counter, stack pointer and other internal registers), in order to be able to revert back if an abort occurs during the transaction. All transactionally executed memory accesses are marked with a special transactional bit, that is set by the core wrapper when the memory accesses are issued to the system. When the end of a transaction is reached, it triggers another special memory access, to a memory mapped register, the commit request register. This activates a special process in the memory banks level, that cleans all the transactional flags and saved logs associated with that core's transactional accesses. Figure 3 depicts the control flow for the Transactional Support Module (TSM) at each memory bank. Each TSM has a process that in each cycle, checks whether the data received in the slave port of the corresponding bank are transactional or not, by reading its transactional bit. If the data are transactional then it finds out whether the request is for a read or for a write memory operation. If it is a read request, then it could be one of the following two cases: Either it is a normal read request coming from a core that is performing a transactional read or it is a request coming from another bank, to read a saved log value. In the later case, the data saved in the log are read. In the first case, though a different process is followed.
First, a check_conflit() function is called, to check whether the read operation triggers a conflict, by checking the read/write flag vector of the corresponding address. If the same address is being read by other cores in transaction, simultaneously, without being written by any core, then no conflict will be triggered. In that case, the update_flags() function is called, that adds the new read flags to the memory location accessed, and then the read operation is performed normally. If the address is owned by a core, ie. a core has written the address during a transaction, then a conflict will be triggered. A resolve_conflict() function will be then called, that will decide which of the cores that are currently accessing the address location transactionally, have to abort. This decision depends on the conflict resolution policy that we choose to follow. As a starting point, we chose to follow a simple rule: The requestor, ie. the core which issued the access that triggered the conflict, has to be aborted. When the resolve_conflct() function returns, it calls the abort_transaction() function, which has the responsibility of notifying the cores that need to abort. The cores, upon receiving that notification, restore their internal saved state and respond with an acknowledgment. Control is passed back to the abort_transaction() process, that now has to call the restore_data() and clean_flags() functions. The first function is responsible for restoring the original save data from the logs back to their original address locations and the second function for cleaning the read/write flags of the aborted core.
In case of a write request, the control flow is very similar to the read request. One important difference now is that the decision of whether a conflict is triggered on not, is based on different criteria: If the memory location is currently being either read or written by any other core, then a conflict will be triggered. If no conflict is triggered, then the update_flags() function is called, to add the ID of the new owner in the address location. If the address is owned by another core, then that should automatically trigger an abort for that core, since we follow an eager conflict detection policy. Another important difference of the write request compared to the read request is that, before the write is performed, the original data of the memory location must be saved into the log space. Note that, to accelerate this process, some of the aforementioned operations are happening simultaneously. At the same time that the update_flags() function is called, we can read the original value of the address that is written, through the master port of the bank that this address belongs to. Next, the address is saved into the log space through an extra write_log() function.
CONCLUSIONS AND FUTURE WORK
In this paper we propose a transactional support mechanism for handling transactions in a NUMA based architecture. We introduced the idea of distributing conflict detection and resolution to multiple transaction support modules. As a first step, we have restricted transactions within a single cluster and we are currently focusing on the simplest transactional handling schemes, such as eager conflict detection and resolution. We expect that our preliminary experimentation will give us a good understanding of how
