We present Raexplore, a performance modeling framework for architecture exploration. Raexplore enables rapid, automated, and systematic search of architecture design space by combining hardware counter-based performance characterization and analytical performance modeling. We demonstrate Raexplore for two recent manycore processors IBM BlueGene/Q compute chip and Intel Xeon Phi, targeting a set of scientific applications. Our framework is able to capture complex interactions between architectural components including instruction pipeline, cache, and memory, and to achieve a 3-22% error for same-architecture and cross-architecture performance predictions. Furthermore, we apply our framework to assess the two processors, and discover and evaluate a list of architectural scaling options for future processor designs.
Introduction
Over 20 years ago, supercomputer pioneer Seymour Cray famously made an analogy on computer design "If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?" Today, we see both types of computers in the marketplace, and computer architects face an even wider spectrum of design choices on core complexity, memory hierarchies, parallelism, and special-purpose accelerators. Furthermore, the processor design landscape is becoming increasingly more dynamic, as we see mainstream processors both trickling up (e.g., ARM, DSP, GPU) and trickling down (e.g., Atom, Xeon Phi) in their design space to meet the demands for a range of emerging applications in scientific computing, data analytics, gaming, wearable devices, computer vision, etc.
For a big-picture view of this background, Figure 1 sketches today's processor landscape and macro trends in terms of single-thread performance and throughput performance. In Figure 2 , we select eight representative processors and position them in a multi-dimensional design space in terms of their architectural features. The main observation is that the design space is vast in terms of both high dimensionality and large dynamic range for each dimension. We list eight major architectural features (dimensions) ranging from core complexity to memory hierarchies, not to mention other relatively minor features such as branch prediction, prefetching, and memory management. We use the ratio between the highest and lowest value to measure the span of the dynamic range of each feature (dimension). The observed span ranges from 4× up to 78×.
GT200%
GF110% GK110%
Xeon%Phi%(Knights%Corner)% Cortex%A8%
PowerPC%450%(BG/P)% Fundamentally, architecture design is driven by applications. Architectural evaluation and comparison for a diverse set of current processors are challenging because they often require significant human efforts to port applications to different architectures [8, 31] . Studying the performance of future processors is even more challenging, as the hardware is not yet available. While commonly used simulation-based techniques could provide highly accurate results, they are prohibitively slow to handle the combinatorial explosion of design choices in a multi-dimensional space and thus often limited to studying kernels and benchmarks rather than larger programs, miniapps, and even full applications [3, 39, 41, 42] .
In this work, we aim to address this architecture design challenge by developing Raexplore (Rapid architecture explore, pronounced as ray-xplore), a performance modeling framework to reduce the needs for application porting in architectural comparison, and to serve as a fast, first-order architecture explorer to complement slower but more accurate simulation-based techniques. In particular, we make the following contributions in this paper. First, we develop a methodology that combines experimental performance characterization and analytical performance modeling to enable rapid and systematic architecture exploration. Second, we develop analytical models for two recent manycore processors IBM Blue Gene/Q compute chip and Intel Xeon Phi. We show that our models could capture complex interactions between architectural components including instruction pipeline, cache, and memory, and achieve a 3-22% error for same-architecture The numbers in parentheses below the horizontal axis are the ratios between the highest and lowest value, which measure the span of the dynamic range of each feature (dimension) for the eight processors.
and cross-architecture performance predictions. Third, using our framework, we analyze processor performance for a set of scientific applications, and suggest and evaluate a list of architectural design choices. The rest of the paper is organized as follows. Section 2 and Section 3 describe our performance modeling methodology and the developed analytical models. Section 4 presents the experiments to validate our performance models. Section 5 and Section 6 apply our framework to analyze processor performance for a set of scientific applications and explore the design space for future architectures. Section 7 and Section 8 respectively discuss related work and the software release status of Raexplore. Section 9 concludes and describes future research directions.
Methodology
We have three goals in mind in developing a performance modeling methodology: (1) handle real-world large programs/applications, instead of kernels or benchmarks, as it is not common in practice that a single kernel dominates the application runtime; (2) explore architecture design space rapidly; and (3) capture complex effects and interactions of major architectural features and predict performance accurately. To this end, we develop a methodology that combines experimental hardware counter-based performance characterization and analytical performance modeling. The hardware counter-based approach provides a fast way to characterize performance for large applications on an existing baseline architecture. To project performance for future or different architectures, we develop analytical models that take in baseline performance characteristics and produce performance predictions. Analytical modeling reduces the process of architecture exploration to a matter of merely evaluating a set of mathematical formulas, enabling fast search of the vast design space. Figure 3 shows our performance modeling framework. It first takes in a set of targeted applications and characterizes their performance on a baseline architecture. The performance characteristics are a set of performance events measured by hardware counter-based tools such as PAPI [38] , Intel VTune [25] , and IBM HPM [21] . The performance characteristics are then calibrated for a target architecture configuration using analytical performance models. The target could be either a future design of the baseline architecture, or a different current architecture (e.g., Blue Gene/Q as baseline and a hypothetical next Blue/Gene processor as target, or Blue Gene/Q as baseline and Xeon Phi as target). Finally, the analytical models produce performance analysis for the baseline architecture and performance predictions for the future/target architecture. Note that the analytical models include both the models for performance analysis (e.g., to derive the time of instruction execution, memory access, and their overlap) and the models for performance characteristics calibration to account for differences in architecture features such as cache size and instruction latency.
In order to project from a baseline to target architecture, analytical models should be able to capture the architectural changes in instruction pipeline and memory hierarchy. The wider the architectural difference is, the more challenging to model it. We list three types of baseline/target scenarios in the order of increased challenge: (1) project within the same processor line (e.g., from BGP to BGQ 1 , from NVIDA Fermi to Kepler GPU), (2) project across similar architectures (e.g., from BGQ to Xeon Phi), and (3) project across different architectures (e.g., from BGQ to GPU, from BGQ to x86 multicores). For example, to project from BGQ to GPU, we need to model the GPU's hardware mechanism to coordinate massively parallel threads and their collective memory access. Projection across different architectures is further complicated by ISA and compiler differences (Section 4.4). In this work, we select two relatively similar architectures for our study: BGQ and Xeon Phi. We model their major architecture features, but currently leave out the software aspects such as compiler differences and thread management overhead.
In comparison with a pure analytical modeling approach [40] , which relies on static program analysis for model inputs (e.g., operation count) and thus has difficulties in han- [41] , which are more accurate but several orders of magnitude slower than hardware execution, our approach could serve a fast, firstorder architecture explorer. Statistical techniques and machine learning [26, 32] have been shown to be effective in handling the complexity of design space. However, because these techniques are driven by experimental/simulation data instead of hardware inner working mechanism, they are less explicative and insightful than our mechanism-driven analytical modeling approach. Furthermore, these techniques cannot predict performance for future processors with new features (e.g., a GPU memory coalescing feature that the processor training set does not cover), while the analytical approach could in principal model and incorporate such new features. Table 1 compares our methodology with existing approaches in terms of their capabilities.
Analytical performance models
We develop performance models for BGQ and Xeon Phi. We use a set of performance events (Table 2 ) monitored by hardware counters to characterize application performance on a baseline processor (Figure 3 ). To describe a baseline/target architecture configuration, we choose a set of hardware parameters that reflect major architectural features and by general practice have good performance impact (listed in Table 3). The architecture parameters are according to the references [19, 23, 24, 37] . Note that the bandwidth numbers are not their theoretical design values, but measured peak values using synthetic stream benchmarks [22, 37] . The inputs to our analytical models are the number of performance events measured by hardware counters and baseline/target architecture configurations; the outputs are analyzed and predicted runtime and its breakdown in terms of instruction execution and memory access.
Accurate performance models should be able to capture the effects and interactions between architectural components, and are the key to the success of our methodology. There are two major challenges. The first one is to deal with the complexity of the studied architecture, which has multiple cores, multiple pipelines, and multiple levels of memory hierarchy. We need deep understanding and analysis of performance to develop accurate models at the right level of abstraction. The second challenge is to deal with uncertainties that arise when hardware performance counters do not provide sufficient information required by our models (e.g., instruction-and memory-level parallelism, integer and FP instruction execution overlap). In this case, we need to develop upper and lower bounds of interested quantities. We approximate the unknown quantity using the average of its lower and upper bound. To improve the accuracy of such approximation, we divide the application to code blocks (individual functions or loops); because if the code blocks are fine-grained enough, the performance of each code block tends to be dominated by a single performance factor, which could be instruction execution, memory latency, or memory bandwidth. All code blocks together cover the whole application. In general, the finer granularity they are specified at, the more accurate the performance prediction is, at the cost of annotation effort (insert hardware counter monitors) and monitoring overhead. We will present both top time-consuming individual code blocks (together cover more than 90% of application) and aggregated performance breakdown of all code blocks (cover 100% of application) to guarantee representativeness; users can optionally examine the rest code blocks. We now describe our performance models. As a reference, Table 4 lists the short names for performance metrics used in our models.
For a given application, we divide it to code blocks and model the total application execution time 
where timeInst is the instruction execution time, timeMem is the memory access time, and timeOverlap is the overlap time between instruction execution and memory access. Ideally, we want memory access time to be completely hidden (overlapped by instruction execution) using hardware features such as caching and simultaneous multithreading. However, this is often not the case in reality due to cache misses and the lack of instruction parallelism. While timeOverlap is not directly measurable, it could be estimated on the baseline architecture as timeOverlap base = timeInst base + timeMem base − timeCodeBlock base , where timeCodeBlock base is the measured time, and timeInst base and timeMem base are modeled time on the baseline architecture. For a target architecture, we assume timeOverlap scales along with timeInst and timeMem: timeOverlap = λ × timeOverlap base , where the scaling factor λ is estimated as the average of the time scaling ratio of instruction execution time and memory access time The following subsections describe the models of the instruction pipeline and memory subsystem that are respectively used to estimate timeInst and timeMem.
Instruction pipeline
We model the instruction execution time, which includes the pipeline stalls due to instruction dependencies and structural hazards, but excludes the stalls due to dependencies on memory access (assuming zero memory latency). Section 3.2 will separately model the memory access time.
Both BGQ and Xeon Phi feature simultaneous multithreading (SMT), where multiple threads could be executed in an interleaved fashion to increase the instruction-level parallelism (ILP) and to hide memory latency. Both the BGQ and Xeon Phi core have two instruction pipelines: one supports (vector) floating-point (FP) instructions, and the other does not. We will refer to the two pipelines in loosely defined terms as the integer pipeline and the FP pipeline; we will also refer to all general-purpose instructions (including control flow and load/store) as integer instructions. The integer pipeline on both BGQ and Xeon Phi support vector loads/stores instructions. The FP pipeline of Xeon Phi is actually versatile, as it can execute all other general-purpose instructions as well.
In terms of utilizing the two pipelines, BGQ could simultaneously issue one integer instruction and one FP instruction to the two pipelines, but these two instructions have to be from two different threads. On Xeon Phi, a thread could issue two instructions in one cycle to both the integer and FP pipelines subject to certain instruction pairing rules, but a thread cannot consecutively issue instructions in back-to-back cycles (in the next cycle, a different thread has to take the turn to issue instructions). Effectively, both BGQ and Xeon Phi require at least two threads to fully utilize both pipelines. One advantage of Xeon Phi is that, in the case of using a single thread per core, Xeon Phi pipeline observes more ILP than BGQ, because each Xeon Phi thread has two instruction streams per thread, while a BGQ thread has only one instruction stream; however, we have not observed this ILP advantage of Xeon Phi in the case of multiple threads per core.
We model the instruction execution time as timeInst = instC IPC , where IPC is instructions per cycle (assuming zero memory latency as discussed earlier), and instC is the effective number of instructions taking into account the overlap between the execution of integer and FP instructions (effectively we treat two pipelines as a single pipeline in our abstract processor model). In an ideal situation of sufficient instruction-level parallelism (ILP), instruction execution is fully pipelined and IPC = 1. In reality, ILP is limited by instruction dependency and contention for functional units.
The effective number of instructions instC depends on the degree of execution overlap of integer and FP instructions. A complete overlap means a better utilization both integer and FP pipelines. However, this is not possible in reality due to limited instruction and thread parallelism. As discussed earlier, both BGQ and Xeon Phi require at least two threads to fully utilize the two pipelines. If the baseline and target architecture use the same number of threads per core, instC = α * instC base , where α is the factor that takes into account ISA and compiler differences; if the target architecture uses a different number of threads per core (T PC), we estimate instC using the following conditional function, which essentially assumes maximum integer and FP instruction overlap if using more than two threads per core, minimum overlap if using one thread per core, and an average overlap if using two threads per core. . ILP is derived from ILP base taking into account the differences in threads per core (T PC) and instruction streams per thread (SPT ),
, where 
We estimate timeInst base as the average of its lower and upper bound respectively using a maximum and minimum ILP base because ILP base is not directly measurable by hardware counters. Note that timeInst base(max) cannot exceed timeCodeBlock base , the measured execution time for this code block. 
Memory subsystem
The memory performance could be either latency bound or bandwidth bound. Therefore, we model memory access time as timeMem = max(timeLat,timeBW ), where timeLat is the sum of memory access latency for all memory references and timeBW is the time to transfer all memory traffic (including prefetch traffic) over the memory bus. We estimate timeLat base as the average of its lower and upper bound (note that timeLat max cannot exceed timeCodeBock base ). At the lower bound of timeLat base , we use the minimum memory latency (MPC = 1); at the upper bound, the maximum memory latency is bound by average access latency to all memory hierarchies. [20] . To account for the effects of cache contention among multiple threads, we allocate an evenly divided portion of cache to each thread. Although our framework allows more sophisticated cache and contention models [2, 9, 10, 18 ] to be incorporated, we observe power law approximation and uniform allocation provide sufficiently accurate results, as we will show in Section 4. Regarding modeling cache coherency, our baseline measurements do include the effect of coherence traffic and we assume a linear scaling of this effect in performance prediction; our framework is extensible for advanced coherency models to predict the non-linear effect and the impact of changed coherence protocols. 
Model validation
We validate our models for (1) same-architecture performance prediction for threads scaling, cache contention, and simultaneous multithreading on BGQ, and (2) cross-architecture performance prediction from BGQ to Xeon Phi.
Threads scaling
We use our models to predict the runtime of executions on BGQ that use up to 16 threads based on the performance characterization for an execution that uses a single thread. We select several code blocks from a fluid dynamics code Nekbone [14] , that solves a Poisson equation using a conjugate gradient method. The code blocks are representative of different scaling performance. Figures 4, 5 , and 6, show the measured and predicted runtime respectively for code blocks named add2s, dp, and grad.
The prediction errors at 16 threads range from 3-22%. For example, in Figure 4a , the predicted time follows well with the measured time, and the performance stops scaling linearly at 4 threads. This is because the memory latency time becomes smaller than the memory bandwidth time at 4 threads (Figure 4b) , which changes the performance from latency-bound to bandwidth-bound. Similarly, the linear performance scaling stops at 8 threads for dp ( Figure 5) ; and for grad, the performance continues to scale linearly up to 16 threads because memory latency has always been the bottleneck, and it scales linearly with the number of threads (Figure 6 ).
Cache contention
We use our cache models based on power law approximation and uniform allocation to predict cache hit rate for 2-threads- Threads scaling performance prediction for dp. per-core and 4-threads-per-core cases based on the performance characterization for the 1-thread-per-core case. Both baseline and target processor are BGQ. Table 5 lists the prediction results for four code blocks. Our simple cache model works very well and achieves a cache hit rate prediction error between 0.3% and 3.48%. For example, for the code block dp, we accurately predict the cache hit drop due to the contention of multiple threads. 
Simultaneous multithreading (SMT)
We use our models to predict SMT performance for 2-threadsper-core and 4-threads-per-core cases based on the performance characterization for the 1-thread-per-core case. Both baseline and target processor are BGQ. Figures 7, 8 , and 9 show the measured and predicted runtime respectively for code blocks grad, add2s, and dp. The prediction errors for 4 threads per core range from 13.2-19.2%. Take code block grad for example ( Figure 7a ). As we use more threads per core, both instruction execution time (red column) and memory access time (yellow column) decrease. The reduction of the instruction time is due to the increased ILP and the overlap of integer and FP instructions. The reduction of the memory access time is due to the increased MLP, despite the slight increase of average memory access latency aML due to the cache contention of simultaneous threads (Figure 7b) .
In contrast, the total runtime for code block add2s does not reduce much at more threads per core (Figure 8a ). This is because the majority of runtime is taken by the memory access time, which is bound by bandwidth and does not change with the number of threads per core (Figure 8b) . The runtime at 4 threads per core actually increases rather than decreases. Further examination reveals that more dynamic instructions are executed, most likely due to the extra OpenMP thread management overhead. For dp (Figure 9a) , from 2 to 4 threads per core, the runtime does not decrease as much as for grad because the memory access time becomes bandwidth-bound from latency-bound at 2 threads per core (Figure 9b ).
Cross-architecture performance prediction
We use the performance characterization on the baseline processor BGQ to predict the performance on a target processor Xeon Phi. Figure 10 shows measured and predicted performance for three code blocks. We compare the prediction results for three models: "naive", "model", and "with inst diff". The "naive" model simply scales the runtime according to the difference in dynamic instruction count caused by the compiler and ISA differences. Both "model" and "with inst diff" use our performance models; "model" does not take into account dynamic instruction count difference, while "with inst diff" does. We examined the sources of the instruction : Predict SMT performance for dp. count difference and discovered they are mostly from compilergenerated instructions for prefetching and vector load/store pack/unpacking. For different code blocks, the observed instruction count for Xeon Phi could be up to 20% less and up to 40% more than that for BGQ. Overall, "model" produces significantly more accurate predictions than "naive"; "with inst diff" further reduces the errors to 5-16%.
One difficulty in cross-architecture performance prediction lies in a different dynamic instruction count resulted from ISA and compiler differences. We currently measure the dynamic instruction count on Xeon Phi to gauge the impact of this factor. Nevertheless, we do not require users to measure instruction count on target platforms; this is just an optional extra step to improve accuracy, especially for projections across very different architectures. Without it, we can still achieve reasonable accuracy and provide performance insights. Future work could use compiler techniques to estimate the instruction count change. Furthermore, we expect much smaller ISA and compiler differences in projecting performance within a processor line (e.g., BGP to BGQ, Xeon Phi to its next generation).
Summary
We have showed our models are able to capture complex effects and interactions of architectural components and performance factors including instruction pipeline, cache, memory bandwidth, and number of threads. Our models achieve good accuracy across a variety of validation experiments including threads scaling (3-22% error), cache contention (0.3-3.48% error), SMT performance (13-19% error) , and cross-architecture prediction (5-16% error).
Performance Analysis
We demonstrate our models in analyzing processor performance for a set of applications. In our analysis, we provide instruction and memory time breakdown. For instruction time, we further break it down to integer and FP instruction time; for memory time, we further break it down to latency and bandwidth time (the latency time could be further divided to time to different levels of memory hierarchy). This type of analysis is different from performance characterization using raw hardware counter data [16] in that it processes the raw hardware counter-based performance characteristics with our models and provides important information on performance bottlenecks of the studied processor.
Application performance analysis
For our study, we select five scientific codes from the CORAL benchmarks [1] : Nekbone, Qbox, LULESH, AMG, and UMT. The CORAL benchmarks are formed according to the mission needs of the U.S. Department of Energy and currently used by three national labs (Oak Ridge, Argonne, and Livermore) to evaluate and design future architectures. All application codes have been parallelized using both MPI and OpenMP. In our experiments, we always select a combination of MPI tasks and OpenMP threads to fully utilize all cores of a processor and to minimize the time to solution. Note that this is a node-level study on processor and memory architecture, and we run MPI tasks on cores within a processor (all MPI communication traffic are included in the memory traffic). Figure 11 shows the performance analysis for major code blocks in Nekbone and Qbox on BGQ. Two major observations are: (1) there is no single code block that takes more than 50% of the total application time, and (2) different code blocks observe different performance bottlenecks in integer/FP pipeline, memory latency, and memory bandwidth. Note that the memory performance for a code block could be either latency-bound or bandwidth-bound, so the memory time is completely taken by either latency time or bandwidth time. Figure 12 shows the performance analysis for all five applications on BGQ. For each application, we aggregate the instruction and memory timings of all code blocks to derive the application-level performance analysis. The major observations are: (1) with the exception of Nekbone, most applications are bound more by memory latency time than by memory bandwidth time, which suggests cache improvements will benefit the performance; (2) most applications spend the majority of instruction execution time in processing integer instructions, which suggests adding more integer pipelines in a core would benefit the performance.
Architecture comparison
To compare the architecture features between BGQ and Xeon Phi, we analyze the performance of two code blocks that are respectively representative for two scenarios: (1) instruction execution (and memory latency) bound, and (2) memory bandwidth bound. Figure 13 shows the performance analysis results. To compare on a per-core basis, we have scaled the timings according to the core count difference between BGQ and Xeon Phi. For code block grad, BGQ and Xeon Phi have comparable performance (Figure 13a ). The performance of grad is mostly instruction execution bound and memory latency bound, and most of the memory latency is hidden (overlapped with instruction execution). The slightly longer instruction time on Xeon Phi is mostly due to the increased dynamic instruction count as a result of compiler-inserted memory prefetch instructions. The memory latency time on Xeon Phi is slightly higher, mostly due to its longer DDR access latency ( Figure 13d and Table 3 ). Xeon Phi has a larger L1 cache (32 KB) than BGQ (16 KB), which results in a higher L1 cache hit rate (Figure 13c ) and more time spent in accessing L1 and less time in LLC (Figure 13d ). For code block add2s, Xeon Phi performs about 2× better than BGQ (Figure 13b ). This is mainly because add2s is mostly memory bandwidth bound, and Xeon Phi has 1.7× more per-core bandwidth than BGQ (Table 3) .
In summary, BGQ and Xeon Phi have comparable instruction pipeline and cache performance. Although Xeon Phi has a larger L1 cache, this advantage is offset by its longer DDR latency. Xeon Phi has 1.7× more bandwidth per core than BGQ, which translates to almost the same ratio of real performance benefit for bandwidth-bound program.
Architecture Exploration
We demonstrate Raexplore in exploring architecture scaling options for BGQ. The studied architectural features include core count, L1 and LLC size, and memory bandwidth. By scaling these features in both directions (up and down), this type of study has a two-fold purpose: (1) evaluate the design balance of a current/baseline processor, and (2) explore scaling opportunities for its potential future design. -99%  -80%  -84%  -92%   -0%  0%  -0%  -0%  0%  27%  50%  32%  30%  45%  40%  74%   43%  41%  57%  46%  87%   48%  46% 
Core scaling
We scale the BGQ core count (16) to see its performance impact. As shown in Figure 14 , if we reduce the number of the cores to half, the runtimes of all applications are almost doubled. If we double the number of cores, the runtime reduction is between 27% (AMG) to 50% (LULESH). A further increase of core count will have diminishing performance return because the performance becomes more and more memory bound (by either memory latency or bandwidth) as shown in Figure 14 . This means the memory resources including cache and bandwidth need to keep up to accommodate more cores. Overall, the core count is in a good balance with the rest of the architecture and leans towards the underdesign side. However, since the design is also constrained by chip area and power, increasing core count may not be possible. Note that an overdesign would mean changing (either increase or decrease) the core count does not affect performance much, and an underdesign would mean changing it would affect performance near linearly.
L1 cache scaling
We scale the baseline L1 cache size (16 KB) to see its performance impacts. As shown in Figure 15b , the memory latency time is very sensitive to the L1 size; the larger the cache, the more L1 hits and L1's contribution to the total latency time, but the less LLC's contribution. We see diminished returns of increasing the L1 size, as LLC's contribution becomes less and less. The improved latency time translates to a overall runtime reduction of up to 12% for LULESH and 6% on average for all applications if we increase the L1 size by 4× (Figure 15a ). AMG shows a minimum performance improvement with increased L1 size. Our investigation reveals that if we double the L1 size, the memory time of AMG becomes primarily bandwidth bound ( Figure 16 ) and thus no longer affected by the L1 size. Overall, the designed L1 size is in a good balance with the rest of the architecture, and increasing it will also give considerable performance benefit. -24%  -13%  -12%  -14%  -5%  -10%  -5%  -5%  -6%  1%  7%  3%  3%  4%  1%  12%  5%  5%  7%  1%  16%  7%  6% half only gives a small performance penalty of up to 7% for Qbox and 3.8% on average for all applications; doubling it only gives a slight performance increase between 1-5%. Figure 18 shows how memory bandwidth scaling will impact BGQ's performance. If we decrease the bandwidth by half, the -2%  -4%  -17%  -11%  -5%  -1%  -1%  -7%  -5%  1%  1%  1%  5%  3%  1%  1%  1%  6%  6%  1%  1%  2%  6%  7% Inst Inst-mem-overlap Mem -35%  -22%  -11%  -2%  -5%  -15%  -9%  8%  1%  4%  10%  6%  13%  2%  6%  18%  11%  17%  3%  8%  23%  14% L1 latency L1p latency LLC latency DDR latency -129%  -137%   -73%  -46%   0%   -37%  -40%  -11%  2%  0%  8%  2%  1%  2%  0%  10%  2% total runtime will increase significantly except LULESH, this is because the memory time changes to bandwidth bound from latency bound for all applications except LULESH as shown in Figure 19 . On the other hand, doubling the bandwidth will not result in much performance benefit except with Nekbone, because the memory time of all applications except Nekbone is primarily bound by memory latency (Figure 12 ). The designed bandwidth is just in the right balance with the rest of the architecture.
Last level cache (LLC) scaling

Bandwidth scaling
Constraint-based exploration
We demonstrate our modeling framework in exploring architectural tradeoffs under chip area constraint. The hypothetical scenario for our study is: to develop a future processor based on BGQ with a 2× chip area (transistor) budget. How should we optimally allocate the chip area to cores and LLC? While a straightforward scaling option is to double both the cores and LLC, we also include other options which all keep the total chip area same (based on our measurement on the BGQ die photo; one BGQ core roughly takes the same area as 1 MB of LLC). For the future processor, we hypothetically increase the L1 cache size by 4× to 64 KB, and increase the memory bandwidth by 3× (anticipating new technologies such as stacked memory). We assume the higher L1 size and memory bandwidth do not significantly affect the chip area. Figure 20 shows the predicted time for the different core count and LLC size options. The (56 cores, 8 MB LLC) option turns out to be the sweet spot for most applications, which is on average 14% better than the straightforward scaling option (32 cores, 32 MB LLC). This suggests that having more cores would benefit more than having more LLC. However, we should note that power is the main design constraint in today's processors, and having too many cores may exceed the design power envelope.
We have demonstrated our performance modeling frame- 14%  13%  20%  17%  13%   54%  56%  55%  54%  54%  64%  70%  66%  64%  66%  66%  74%  69%  64%  68%  65%  75%  69%  61% work for evaluating the design balance and exploring future scaling options for BGQ. In summary, the designed core count and L1 cache size are in good balance with the rest of the architecture and lean towards the underdesign side. The LLC size seems overdesigned. The bandwidth is balanced well with the rest of the architecture. For a future processor based on BGQ, more cores, together with correspondingly larger L1 cache and higher bandwidth, will continue to scale the performance, while the LLC size could be kept same or even shrink. Note that these recommendations should be taken with two caveats in mind: (1) they are specific to the selected applications of our interests, and (2) they should be considered along with the design constraints in chip area and power.
Related Work
In this work, we develop a performance modeling framework. By combining hardware counter-based performance measurements and analytical processor models, we reduce the process of architecture exploration to a matter of evaluating a set of mathematical formulas, which enables rapid and systematic search of processor design space for large programs and even full applications. We see several works that are most similar in spirit to ours. Luo et al. [5, 33] use hardware counter data combined with analytical models to estimate the overlap of instruction execution and memory access for out-of-order processors. However, the focus of their work is to analyze the performance of existing processors, rather than exploring architecture for future processors. Saavedra and Smith [40] build machine and program execution models to estimate execution time for arbitrary machine/program combinations. Their technique relies heavily on static program analysis (assisted with runtime profiling) and thus has difficulties in taking into account compiler optimizations and dynamic program behaviors; in contrast, we use hardware counter data as inputs to our models so that these issues are easily and automatically handled. Carrington et al. [7, 43] build a modeling framework to predict application performance on future systems by combining simulation traces and machine profiles (collected by micro-benchmarks). In comparison, our approach relies on fast hardware counterbased profiling instead of several orders of magnitude slower program simulation. Because hardware counter data do not provide complete performance information, it requires more reasoning ability of our models to develop lower and upper bounds of unknown factors. Krishna et al. [30] estimate upper performance bounds of applications through static program analysis. The advantages of their technique are the ease of use and not requiring runtime profiling. However, it is at the cost of not considering dynamic program behaviors; their technique also uses relatively simple hardware models.
There have been studies on analytical models of specific architecture features such as cache [2, 18, 45] and superscalar pipeline [13, 28] . Their focus is on the performance prediction of exact architecture features, while ours is on a methodology and an implemented framework to enable rapid, automated architecture search, as well as a demonstration for a set of applications on two recent processors.
Orthogonal and complementary to analytical modeling, simulation-based techniques have seen a great deal of recent developments. To speed up simulation (or reduce the number of simulations in architecture exploration), a variety of methods have been proposed including efficient parallelization [41] , combined analytical modeling [6, 39] , application abstraction [17] , statistical sampling [46] , and machine learning [26, 27, 29] .
Domain-specific languages for modeling program behaviors are also developed [34, 35, 44] . Such languages require userwritten code annotations or skeletons and need to be combined with hardware performance models to make a performance prediction. Other related works include very high-level boundbased Roofline performance modeling [47] and model-guided compiler and program optimizations [11, 12, 15, 36] . These studies require detailed, often manual analysis to model algorithm/program behaviors; they also have an application focus and use relatively simple processor models.
Software Release
Raexplore is implemented in Python (with plotting features using matplotlib). We will release Raexplore as an opensource tool. We have also developed a web application for it with features for users to manage and share architecture configurations and application profiles. Raexplore currently accepts hardware counter data from Intel VTune and IBM HPM, but is made modular and extensible for other profiling tools (e.g., PAPI [38] ) or architecture simulators (e.g., M5 [4] ), as well as integrating hardware models for other processors (e.g., GPU, DSP, ARM, and x86 multicores).
Conclusion
We have developed a novel performance modeling methodology that combines experimental and analytical approaches to overcome their shortcomings and gain the merits of both: fast, accurate, insightful. As a result, this hybrid approach is able to model, predict, and analyze the cross-architecture performance for full applications in little time. To our knowledge, such capability cannot be achieved by current technologies that are purely based on measurement, analytical modeling, or simulation, yet this capability is highly desired in co-designing nextgeneration manycore processors driven by a set of applications. The paper describes our first step proposing this methodology and focuses on trending power-efficient BGQ/Xeon Phi-style architectures for a set of HPC applications.
The great promise of analytical modeling is that it expresses the relation between performance and architectural components in mathematical formulas and thus allows a formal basis for computer architects to reason about and optimize architecture design. By combining with experimental evaluation, we made this approach practical for large applications. We envision our performance modeling framework could serve as a platform for future researchers to build a model library for a variety of both conventional and novel processors and to analytically and systematically compare their strengths and weaknesses targeting various applications. In addition to performance models, it would also be highly desired to combine them with chip area and power models and thus mathematically formulate the architecture design problem as a performance optimization problem with resource constraints.
