As multicore and manycore processor architectures are emerging and the core counts per chip continue to increase, it is important to evaluate and understand the performance and scalability of Parallel Discrete Event Simulation (PDES) on these platforms. Most existing architectures are still limited to a modest number of cores, feature simple designs and do not exhibit heterogeneity, making it impossible to perform comprehensive analysis and evaluations of PDES on these platforms. Instead, in this paper we evaluate PDES using a full-system cycle-accurate simulator of a multicore processor and memory subsystem. With this approach, it is possible to flexibly configure the simulator and perform exploration of the impact of architecture design choices on the performance of PDES. In particular, we answer the following four questions with respect to PDES performance and scalability: (1) For the same total chip area, what is the best design point in terms of the number of cores and the size of the on-chip cache? (2) What is the impact of using in-order vs. out-of-order cores? (3) What is the impact of a heterogeneous system with a mix of in-order and out-oforder cores? (4) What is the impact of object partitioning on PDES performance in heterogeneous systems? To answer these questions, we use MARSSx86 simulator for evaluating performance, and rely on Cacti and McPAT tools to derive the area and latency estimates for cores and caches.
INTRODUCTION
Parallel Discrete Event Simulation (PDES) is a fine-grained application with irregular communication patterns and frequent synchronization. For these reasons, its performance and scalability have been severely constrained in computing platforms with high communication delays. The continuing emergence of multi-core and many-core architectures offers a significant promise with respect to PDES performance. Several recent studies evaluated the performance impact of multi-core chips on PDES [2, 14, 15, 7, 31] . However, these studies are limited to the existing hardware platforms with modest number of cores per chip [2, 14] , or to specialized chips with larger core counts, such as the Tilera Tile64 architecture [15] . Therefore, while these prior works are still useful and more such studies are likely to emerge as the new multi-core processors are developed and brought to market, they can only answer a limited set of questions, and it is unclear whether the conclusions of these studies can be easily generalized beyond specific platforms and organizations being investigated.
In this paper, we expand the scope of PDES investigations on multi-core and many-core systems by employing a simulation-based approach. Specifically, instead of executing PDES on real platforms, we study its behavior using a cycle-accurate full-system simulator of a multi-core system. Simulation-based studies are extensively used by the computer architecture community where standard and emerging applications are evaluated against future hypothetical designs of single-core, or multi-core, processors and memory hierarchies. A major advantage of such an approach is that a simulator can be flexibly configured and the impact of a wide range of design options (most of which are not currently available in real designs, but could appear in some form in future processors) can be evaluated. In this framework, PDES serves as a benchmark of interest which we evaluate using the MarSSx86 simulator [27] . Marss is a recently developed cycle-accurate simulator of x86-based multi-core processors and memory hierarchy.
To the best of our knowledge, this is the first study that evaluates PDES on emerging architectures (rather than on existing architectures). Such simulation-based approach allows us to evaluate a much wider range of systems and explore the issues related to the architecture impact on PDES performance and scalability that are otherwise impossible to evaluate using traditional measurements on existing systems. Furthermore, using circuit, timing and area analysis tools, we study and compare PDES performance for various chip designs with the equivalent area, taking into account the area used by the cores and the on-chip cache hierarchy. Alternatively, we can also express possible performance and scalability improvements in PDES as a function of the additional chip area needed to achieve these improvements.
Specifically, this paper focuses on the following explorations and addresses the following questions:
First, we evaluate the trade-off between increasing the core count on a chip and using larger on-chip last-level cache. The optimal breakdown between these two resources depends on the applications running on the system. While placing more cores on a chip would allow for a higher degree of parallelism without having to cross the chip boundary, larger caches reduce expensive off-chip memory accesses. To understand this tradeoff in the context of PDES, we study the performance of PHOLD executing under ROSS simulator [6] on simulated hardware systems with several core/cache configurations ranging from the design where most of the chip area is dedicated to processing cores to the design where most of the area is dedicated to the on-chip caches.
Second, we evaluate the impact of the individual core designs on the performance of PDES. Specifically, we consider the choice between large out-of-order superscalar cores, smaller out-of-order cores, and simple in-order cores. In scenarios where fast event processing is on the critical path, more complex cores can offer a significant performance boost. However, since PDES is often dominated by communication, synchronization and rollback overhead, it is also conceivable that in-order cores will not result in significant overall slowdown for many models. At the same time, the use of simpler and smaller cores allows to place more of them on the same chip for the same total chip area, thus increasing the low-latency parallelization opportunities; our study considers these trade-offs.
Third, we investigate PDES performance issues on a heterogeneous many-core system that is composed of a number of high performance out-of-order cores and a number of small in-order cores. A particular question that we investigate on these systems with respect to PDES is whether or not a reconsideration of PDES object-to-core partitioning algorithms is required to adjust to the system heterogeneity. While previous work only considered evenly-sized partitions, we evaluate scenarios where a larger number of objects (and therefore, more work) is assigned to more powerful cores.
All simulation-based studies presented in this paper have been conducted using MARSS [27] , a cycle accurate full system x86-64 architecture simulator. To obtain area estimates for various core designs with private caches, we used Mc-PAT tool [21] , so that all our designs are compared for the same total area. We obtained the latencies for differentlysized last-level caches using CACTI 5.3 [29] tool. The PDES workloads are based on classical PHOLD [10] models.
The key results and conclusions of our experiments are the following:
• For an area-unconstrained chip where the Level 3 (L3) cache size and the number of cores is fixed, and only the core counts vary, the best core choice is the modestly aggressive 2-way superscalar out-of-order core. For the simple PHOLD models, these cores perform nearly identical to much more aggressive 4-way outof-order superscalar cores, but at the same time they significantly outperform simple in-order cores.
• For an area-constrained chip, a larger number of simple cores results in higher performance compared to the design with smaller number of more powerful cores for PHOLD models. That is, the advantages of threadlevel parallelism that can be extracted from multiple cores on the same chip outweigh the advantages of instruction-level parallelism available within each core. Specifically, we show that the best performance is obtained when the largest possible number of small inorder cores are used.
• The size of the shared L3 cache has a limited impact on simulation performance (for PHOLD models) and it is more advantageous for performance to increase the number of cores at the expense of smaller lastlevel cache. This is because a significant locality in private L1 and L2 caches make the L3 cache less critical for performance. This conclusion is true even for the variation of PHOLD that intentionally increases the memory pressure by introducing an array manipulation loop inside each event.
• Heterogeneous multi-cores are detrimental to PDES performance, as the synchrony of the simulation progress on multiple (heterogeneous) cores is naturally distorted, resulting in the loss of efficiency. However, the degree to which performance is impacted depends on the composition of the system (e.g. the nature and number of individual cores), and some designs have only modest performance degradation. In general, the performance is limited by the slowest cores in the system, thus wasting the extra processing capabilities of more powerful cores.
• A heterogeneity-aware PDES object partitioning may somewhat alleviate the performance challenges of heterogeneous designs, but our initial results show that naive partitioning modifications (such as doubling the number of objects placed on the more powerful cores) only lead to further performance problems. Future research is needed to determine the proper partitioning for such systems.
The rest of the paper is organized as follows. Section 2 provides PDES background of the key PDES features relevant to this study. Section 3 describes our evaluation methodology and tools used for evaluating performance, area and latency. Section 4 presents the results of our experiments and provides discussion of these results. Section 5 describes the related work and we conclude in Section 6.
PDES BACKGROUND
In Discrete Event Simulation (DES), a model consists of a set of simulation objects with associated state [8, 20, 34] . For example, in a logic simulation, the objects may be the different gates in the model. The simulation begins with a number of scheduled events. For example, the initial list of scheduled events may be a set of test vectors presented at specified times to the inputs of the circuit. Events (containing timestamps that denote when the event is to take effect) are ordered by simulation time in a pending event queue. Simulation proceeds by processing the event with the earliest timestamp, which can cause changes in the simulation state (e.g., by changing the state of a gate) and schedule one or more future events (e.g., changing the value at the input of gates connected to the gates whose output logic level changed). Simulation time then advances to the timestamp of the next pending event. Simulation terminates when there are no more events or when a predetermined simulation state is reached. Parallel Discrete Event Simulation (PDES) leverages parallel processing to accelerate the performance and capacity of DES [11] . The simulation model is partitioned across multiple simulation processes (called Logical Processes, or LPs). Each LP maintains a local pending event queue and carries out simulation as described above, repeatedly processing the earliest time stamp event. A locally processed event may generate events to remote LPs (that is, affecting state changes of an object managed by a remote LP). Thus, LPs communicate by exchanging time-stamped event messages [3, 9, 23, 28] . Correct simulation requires that all events be processed in their causal order such that parallel execution produces simulation results that are consistent with serial execution. Therefore, a synchronization model is needed to ensure that remote events are processed in their proper causal order.
Optimistically synchronized PDES simulators (that we consider in this study) do not enforce causality during event processing. When an event is received with an earlier processing time than the current simulation time, a causality error is detected [18] . The error is recovered from by rolling back the simulation to an earlier state and canceling event messages that were sent prematurely. Optimistic synchronization potentially hides the latency of communication by allowing LPs to process speculatively instead of waiting for events. However, optimism can lead to a number of problems as the speculative computation can generate premature events that later prove to be erroneous, leading to cascading (propagating) rollbacks [11] .
PDES is historically difficult to parallelize: it is a fine grained application with frequent communication and complex dynamic dependency patterns. These characteristics can limit the ability of a parallel simulator to exploit the naturally available parallelism in models. Event processing is typically of low computational complexity, as each event updates the state of an object and possibly schedules future events. Thus, significant communication occurs as events are generated to other objects. The amount of communication across cores depends on the locality of event messages. If event messages are frequently exchanged with remote LPs, significant communication occurs, placing substantial pressure on the memory subsystem of a multi-core processor. In the PHOLD benchmark that we use for this study, the amount of remote communication can be explicitly controlled. Additionally, as the degree of parallelism increases, contention for the use of the shared queues starts to play an important role.
EVALUATION METHODOLOGY
A number of tools and simulators were used in this study. Our modelling framework is summarized in Figure 1 .
In order to perform cycle-accurate simulation of PDES, we used MARSSx86 [27] -a full-system simulator for x86 multi-core architectures. MARSS models different types of processing cores, including both in-order and out-of-order designs, and also provides flexible configurations for the onchip cache hierarchy.
In order to estimate the area requirements of the individual cores, we used McPAT tool [21] . The three types of cores that we considered in this study include: a) large outof-order core (Large OoO); b) small out-of-order core (Small OoO), and c) small in-order core. The configurations of each of these cores are listed in Table 1 .
For the cache area, we used a cache byte equivalent area (CBE) model to estimate the size of the shared cache from available area [13] . Under this model, the cache size S can be calculated from A avail , the available area and Aunit, the area of a fixed unit size of cache, as shown in Equation 1.
We used 4.75 sq. mm per MB as the unit area for the shared L3 cache. For different cache sizes, we derived cache access time estimates from Cacti 5.3 tool [29] . We used 32nm technology node for both CACTI and McPAT. As a PDES simulation engine, we used ROSS-MT [14] , a multi-threaded optimistic simulator specifically designed for multi-core environments. ROSS-MT is based on ROSS (Rensselaer's Optimistic Simulation System) [6] , a PDES simulation engine with support for both conservative and optimistic time warp simulations.
Typically, PDES performance studies on the real systems are performed using PHOLD benchmark [10] . PHOLD is simple but effective synthetic model for testing the performance of the simulation system. In basic PHOLD, the simulation model is a collection of Logical Processes (LPs). Each Processing Element (PE) is allocated an equal number of LPs. Each LP is initialized with the same number of initial events. The simulation progresses by each LP randomly selecting a target and scheduling an event to that target. When the target receives an event, it randomly selects another LP and schedules an event. At any time during the simulation, the total event population is kept the same. For this study, we utilized PHOLD, but extended it with two additional parameters. The first parameter controls percentage of events that are generated remotely (i.e. to a simulation object running on a different core). The second parameter controls the event processing computational granularity (EPC), which reflects the amount of time that the CPU spends processing each event.
Using the modelling components described above, we performed evaluations in the following directions:
• A study of area-unconstrained homogeneous systems with fixed size of shared cache and variable number of cores.
• A study of area-constrained homogeneous systems, with a trade-off between the size of the on-chip cache and the number of cores. This includes the evaluation of PHOLD models with high memory pressure.
• A study of area-constrained heterogeneous systems composed of multiple types of cores.
For all area-constrained experiments, we assumed 220 square mm chip area for cores and caches, similar to Intel's IvyBridge design [17] . The clock rate is modeled at 3.2 GHz for all of the experiments. Within that constraint, we vary the number and type of processing cores and the amount of shared on-chip L3 cache, to determine the design point that provides the best performance for PDES applications.
Our area-unconstrained model assumes an L3 shared cache of 16 MB, with the cache hit time of 17 cycles. Such a cache occupies 76 sq. mm of area under the CBE model. The configuration of ROSS that we used for most of our experiments is shown in Table 2 . All alterations of this configuration are explicitly described whenever they are used. 
Cache/Core Area Tradeoff
For evaluating the area tradeoffs, we conduct the experiments with increasing core count. Each core has a private L1 and L2 cache, which are kept fixed and included in the core area. The shared L3 cache size is calculated based on the available area remaining for it, using the CBE model mentioned above. CACTI 5.3 is used to derive the latency estimates for the considered L3 cache sizes. Tables 3 and  Table 4 provide details of the cache sizes used in our experiments, their corresponding latencies (in cycles) were derived from CACTI.
Modelling High Memory Pressure
In an attempt to stress-test the core-cache trade-offs, we designed a modified PHOLD model that increases the memory pressure. This was achieved by adding several memory operations during each event processing. Specifically, we used event-private circular buffers, such that every circular buffer contains an array of pointers to a number of discrete blocks in memory. In the event handler, the contents of each block are copied to another block (not contiguous with the source block) within the same circular buffer. This involves a series of memory operations which affects a memory area of BlockSize × BlocksP erCircularBuf f er in size for each event. The configuration of ROSS used for this experiment is shown in Table 5 . 
Evaluating Heterogeneous Processors
For this evaluation, we only consider systems with two different types of cores. Using the three types of core models described above, we formed three possible combinations of cores:
• Large Out-of-order + Small Out-of-order • Large Out-of-order + Inorder
• Small Out-of-order + Inorder As these experiments focus on the tradeoff between different types of cores, we fixed the shared L3 cache size to the 16 Megabytes cache with a latency of 17 cycles. This results in the remaining area of 144 sq.cm on the processor chip usable for the cores. The core combinations (with the number of cores of each type) used in our experiments are shown in the Table 6 . As the cores in a heterogeneous processor would have different capability, we also implemented workload partitioning for heterogeneous systems in an attempt to bridge the gap between the core processing capability and its actual load. The partitioning is performed by mapping different number of Logical Processors(LPs) to different cores using a specific ratio. With such partitioning strategy, cores with more computing power will receive more LPs, while the overall number of LPs would stay the same.
For example, for a heterogeneous processor with two small out-of-order cores and one in-order core, if the overall number of LPs is 2500 and the partitioning ratio is 2:1, then the two out-of-order cores will receive 1000 LPs each, and the in-order core will receive 500 LPs.
In this experiment, we tested 3 different partitioning ratios, as shown in Table 7 . 
PERFORMANCE EVALUATION
In this section we present performance evaluation of ROSS-MT using MARSSx86 simulator. In particular, we first evaluate the tradeoff between the L3 cache size and the core counts, and its impact on the performance of ROSS-MT. We follow this by evaluating the performance of ROSS-MT in the system with heterogeneous cores and examining the impact of object partitioning among the cores on the performance.
The Impact of L3 Cache Size and Core Counts
In our first experiment, we evaluated the performance of ROSS-MT using large out-of-order cores (large OoO), small out-of-order cores (small OoO), and small in-order cores (inorder) respectively, as shown in Figure 2 . In this experiment, we fixed the size of the L3 cache at 16 MBytes. As shown in Figure 2 , the performance of ROSS-MT scales when more cores are used. In addition, the performance of simulation when running on large out-of-order cores is close to the one running on small out-of-order cores, but is better than the performance obtained with in-order cores. This indicates that for the PHOLD model, a large out-of-order core has a similar processing speed with small out-of-order core, but is significantly faster than an in-order core. Thus, if the number of cores on the chip is fixed, the best performing design is the one that uses modestly aggressive out-of-order cores. In the next experiment, we evaluate the performance of ROSS-MT when the L3 cache size varies. In this experiment, the chip area was fixed, meaning that the available size of the L3 cache becomes smaller, as more cores are placed on the chip. The size of the L3 cache and the core counts are calculated by Equation 1. Figure 3 (a) and Figure 3(b) show the total cycles and the L3 hit rate respectively when the core count increases. At the point where 41 in-order cores are used (the maximum number of in-order cores that could fit on the chip according to our area model), PDES performance is better compared to using either 11 large outof-order cores or 16 small out-of-order ones (the maximum number of out-of-order cores fitting in the same area).
In addition, the available size of the L3 cache decreases as the core count increases, thus reducing the L3 hit rate, as shown in Figure 3(b) . As the penalty of a L3 cache miss cannot be ignored, we expect that the L3 cache misses can dominate the PDES performance at some point when the size of L3 cache is small. However, the experimental results in Figure 3 indicate that the performance of ROSS-MT is always better with more cores, even if the memory pressure exerted by each processing event is increased, as shown in Figure 4 . The reason is that due to the high locality in the local L1 and L2 caches, the absolute number of accesses to L3 is fairly small, so the L3 performance has a limited impact in the simulation models that we consider for this study.
PDES on Heterogeneous Multi-Cores
In this subsection, we evaluate the impact of heterogeneous core combinations on the performance of ROSS-MT. In particular, we consider three different core compositions: large OoO + small OoO (Figure 5 ), large OoO + Inorder (Figure 6 ), and small OoO + inorder (Figure 7) . In Figure 5 , Figure 6 , and Figure 7 , the x-axis N x T indicates that the combination consists of N larger cores and T smaller ones. For example, the case marked "3x6" in Figure 5 , refers to the design with 3 large out-of-order cores and 6 small outof-order cores.
To study the performance of ROSS-MT on heterogeneous cores, a partitioning ratio was introduced in the previous section, with the purpose of distributing objects between larger cores and smaller ones. The motivation is to partially hide the impact of heterogeneity by assigning a larger fraction of objects to the more powerful cores. For example, when the partitioning ratio is set to four, the number of objects assigned to a larger core is four times more than those assigned to a smaller core.
We used the simulation efficiency to evaluate the quality of partitioning. In PDES, efficiency is calculated by dividing committed events to the total processed events. If objects are not properly partitioned among the cores, the progress of simulation on an overloaded core will lag behind, thus causing a large number of rollbacks (assuming an optimistic simulation). Figure 5 (b), Figure 6 (b) and Figure 7(b) show the simulation efficiency in each group of heterogeneous cores respectively. We selected three values of partitioning ratio: 1, 2, and 4. We observed that the power of the in-order cores (in terms of instruction throughput) is roughly half that of the out-of-order cores, and therefore the partitioning ratio of 2 represents an attempt to directly compensate for this disparity in order to achieve a more synchronous execution. Partitioning ratio of 4 overloads the fast cores even more.
When the partitioning ratio is set as either 2 or 4, the efficiency increases with the number of larger cores. This indicates that larger cores become overloaded , and are more likely to cause the simulation on the smaller ones to be rolled back. When the number of larger cores increases, the likelihood of communication occurring between larger cores and smaller ones reduces, thus increasing the efficiency. However, of these three partitioning ratios, the best efficiency is achieved for the ratio of 1, indicating that objects in the PHOLD model should be equally distributed among cores even though cores are heterogeneous. Therefore, a naive partitioning that matches the number of objects to the core's throughput capabilities does not work. It is likely that some fractional partitioning (the ratio being between 1 and 2) would lead to a better performance than that of equal partitioning, but additional experiments are needed to validate that claim.
We also evaluated the number of cycles for the PDES simulation in each group of heterogeneous cores, as shown in Figure 5 (a), Figure 6 (a), and Figure 7 (a) respectively. In the case of partitioning ratio of 4, the performance of simulation in each experiment becomes better when more larger cores are used. Such performance improvement is obtained by improving the simulation efficiency. On the other hand, the trends are different when the partitioning ratio is set as either 1 or 2. For example, Figure 5 (a) shows the number of cycles for simulation when both large out-oforder and small out-of-order cores are used. As both types of cores have a similar processing speed, the performance of simulation is mainly dependent on the total number of cores being used when the partitioning ratio is set as 1. In other words, the more cores are used, the better performance can be achieved. In the case of partitioning ratio of 2, the performance of simulation is dominated by the rollbacks. Finally, Figure 6 (a) and Figure 7 (a) shows the number of cycles for ROSS-MT when the system is composed of large out-of-order cores and in-order cores, and small out-of-order cores and in-order cores respectively. In both experiments, when the partitioning ratio is set as either 1 or 2, the performance of simulation becomes worse when the number of in-order cores in the composition reduces. As we described in the previous subsection, it is advantageous to use more in-order cores when the size of chip area is limited. Again, notice that partitioning ratio of 1 provides the best performance out of these three points, and additional experiments are needed to determine the best-performing partitioning ratio.
RELATED WORK
The multi-core and many-core processor architectures can substantially reduce the communication cost of PDES. Several prior works analyzed PDES performance on these emerging systems and also explored performance optimization opportunities. Jagtap et al. [14, 15] designed a multi-threaded version of ROSS simulator, called ROSS-MT, to explicitly exploit shared memory hierarchy available on multi-core systems. This is in contrast to the process-based communication model used in baseline ROSS simulator, and in many other PDES engines. The performance of ROSS-MT was evaluated on two emerging many-core platforms: 48-core AMD Opteron Magny-Cours [14] and Tilera Tile64 [15] . The experimental results showed that ROSS-MT can scale well on both platforms, especially when a number of performance optimizations (specific to each platform) are applied.
Vitali et al. [32, 31] developed a different multi-threaded PDES simulator on multi-core platforms. A load-sharing scheme was implemented, allowing each simulation kernel instance to be executed by multiple threads. Wilsey et al. [7] Several studies focused on investigating the design tradeoffs in multi-core architectures in order to improve the area efficiency of these systems. A study of [13] is an example of CMP design space exploration in terms of selecting different core designs (in-order and out-of-order) and also investigating the trade-offs between the core count and the cache capacity. In this paper, we perform a similar study in the context of multi-threaded PDES. Several studies focused on developing analytical models for the core, cache and on-chip interconnect area [1, 26, 33, 19] . The cache subsystem design has also been studied in great detail. For example, last level cache designs for data mining applications were explored in [16] . The work of [35] evaluated the impact of CMP cache sharing on multi-threaded applications and demonstrated benefits of cache-sharing aware program transformations. Oh et al. [26] and Wentzlaff [33] proposed simple analytical models to study trade-offs between the core count and the cache capacity under finite die area. Analytical models for thermal implications of CMP designs [24] and its impact on network scalability [4] have also been developed.
The performance of parallel systems, however, depends not just on the CMP design but also on the characteristics of the workload [5] . Some studies have explored such interplay. Configuration workload characterization [25] demonstrated that architectural configuration alone or workload characterization alone were not sufficient to predict the performance of the system. The work of [12] developed analytical models for both architecture and workloads and studied the applications performance and power implications on a range of architectures. Lotfi et al. proposed Scale-out processor design [22] to address the inefficiency of current architectures (which are designed for high single thread performance) to achieve high server throughput. They proposed a chip architecture based on pods -an optimal performance-density building block specifically for scale-out workloads. The work by Vitali et al. [30] considered the memory access patterns 
CONCLUSIONS
We presented a simulation methodology for evaluating performance of PDES applications on emerging hardware many-core architectures. Our approach allows for the exploration of the design space, including the evaluation of the designs with large number of cores as well as the designs with heterogeneous cores. To the best of our knowledge, ours is the first work that uses PDES as an application for such simulation-based studies.
The conclusions of our initial experiments with this simulation framework are that the shared L3 cache has a minimal performance impact for PHOLD-style PDES applications, and it is more advantageous for performance to utilize the available chip area for additional cores rather than caches. While modestly aggressive out-of-order cores provide noticeable performance benefits compared to the in-order cores for area-unconstrained designs, a better performance is achieved with a larger number of simple in-order cores when the chip area is constrained. We also demonstrated that core heterogeneity negatively impacts PDES performance as it naturally distorts the synchrony, increases the number of rollbacks and degrades simulation efficiency. We also showed that specific performance impact depends on the composition of cores.
The simulation framework described in this paper can also be applied to a number of more detailed studies, such as examining and understanding the causes and the impact of rollbacks in PDES in a repeatable and controlled fashion. The low-level details of other PDES subsystems can also be studied using this methodology -this is left for our future work. Future work will also examine further modifications to PDES object partitioning algorithms to tailor them to heterogeneous platforms and mitigate the negative impact of core heterogeneity on PDES performance.
