RISC vs. CISC wars raged in the 1980s when chip area and processor design complexity were the primary constraints and desktops and servers exclusively dominated the computing land scape. To day, energy and power are the primary design con straints and the computing lands cape is significantly dif f erent: growth in tablets and smartphones running ARM (a RISC ISA) is surpassing that of desktops and laptops running x86 (a CISC ISA). Further, the traditionally low-power ARM ISA is enter ing the high-peiformance server market, while the traditionally high-peiformance x86 ISA is entering the mobile low-power de vice market. Thus, the question of whether ISA plays an intrinsic role in performance or energy ejficiency is becoming important, and we seek to ans wer this question through a detailed mea surement based study on real hardware running real applica tions. We analyze measurements on the ARM Cortex-A8 and Cortex-A9 and Intel Atom and Sandybridge i7 microprocessors over workloads spanning mobile, desktop, and server comput ing. Our methodical investigation demonstrates the role of ISA in modern microprocessors' peiformance and energy ejficiency. We find that ARM and x86 processors are simply engineering design points optimized for dif f erent levels of peiformance, and there is nothing fundamentally more energy ejficient in one ISA class or the other. The ISA being RISC or CISC seems irrelevant.
Introduction
The question of ISA design and specifically RISC vs. CISC ISA was an important concern in the 1980s and 1990s when chip area and processor design complexity were the primary constraints [24, 12, 17, 7] . It is questionable if the debate was settled in terms of technical issues. Regardless, both flourished cOlmll ercially through the 1980s and 1990s. In the past decade, the ARM ISA (a RISC ISA) has dominated mobile and low power embedded computing domains and the x86 ISA (a CISC ISA) has dominated desktops and servers.
Recent trends raise the question of the role of the ISA and make a case for revisiting the RISC vs. CISC question. First, the computing landscape has quite radically changed from when the previous studies were done. Rather than being exclusively desk tops and servers, today's computing landscape is significantly shaped by smartphones and tablets. Second, while area and chip 978-1-4673-5587-2/13/$31.00 ©20 13 IEEE design complexity were previously the primary constraints, en ergy and power constraints now dominate. Third, from a com mercial standpoint, both ISAs are appearing in new markets: ARM-based servers for energy efficiency and x86-based mo bile and low power devices for higher performance. Thus, the question of whether ISA plays a role in performance, power, or energy efficiency is once again important.
Related Work:
Early ISA studies are instructive, but miss key changes in today's microprocessors and design constraints that have shifted the ISA's effect. We review previous COlll parisons in chronological order, and observe that all prior COlll prehensive ISA studies considering commercially implemented processors focused exclusively on performance.
Bhandarkar and Clark compared the MIPS and VAX ISA by comparing the M/2000 to the Digital VAX 8700 implementa tions [7] and concluded: "RISC as exemplified by MIPS pro vides a significant processor performance advantage." In an other study in 1995, Bhandarkar compared the Pentium-Pro to the Alpha 21164 [6] , again focused exclusively on performance and concluded: " ... the Pentium Pro processor achieves 80% to 90% of the performance of the Alpha 21164 ... It uses an aggres sive out-of-order design to overcome the instruction set level limitations of a CISC architecture. On floating-point intensive benchmarks, the Alpha 21164 does achieve over twice the per formance of the Pentium Pro processor." Consensus had grown that RISC and CISC ISAs had fundamental differences that led to performance gaps that required aggressive microarchitecture optimization for CISC wh ich only partially bridged the gap.
Isen et al. [22] compared the performance of Power5+ to Intel Woodcrest considering SPEC benchmarks and concluded x86 matches the POWER ISA. The consensus was that "with ag gressive microarchitectural techniques for ILP, CISC and RISC ISAs can be implemented to yield very sirnilar performance."
Many informal studies in recent years claim the x86's "crufty" CISC ISA incurs many power overheads and attribute the ARM processor's power efficiency to the ISA [1, 2] . These studies suggest that the microarchitecture optimizations from the past decades have led to RISC and CISC cores with similar per formance, but the power overheads of CISC are intractable.
In light of the prior ISA studies from decades past, the signif icantly modified computing landscape, and the seemingly vastly different power consumption of ARM implementations (1-2 W) to x86 implementations (5 -36 W), we feel there is need to revisit this debate with a rigorous methodology. Specifically, considering the dominance of ARM and x86 and the multi pronged importance of the metrics of power, energy, and perfor mance, we need to compare ARM to x86 on those three metrics. Macro-op cracking and decades of research in high-performance microarchitecture techniques and compiler optimizations seem ingly help overcome x86's performance and code-effectiveness bottlenecks, but these approaches are not free. The crux of our analysis is the following: After decades of research to mitigate CISC peiformance overheads, do the new approaches introduce fundamental energy inefficiencies?
Challenges:
Any ISA study faces challenges in separating out the multiple implementation factors that are orthogonal to the ISA from the factors that are inftuenced or driven by the ISA. ISA-independent factors include chip process technology node, device optimization (high-performance, low-power, or low-standby power transistors), memory bandwidth, VO device effects, operating system, compiler, and workloads executed. These issues are exacerbated when considering energy measure ments/analysis, since chips implementing an ISA sit on boards and separating out chip energy from board energy presents addi tional challenges. Further, some microarchitecture features may be required by the ISA, while others may be dictated by perfor mance and application domain targets that are ISA-independent.
To separate out the implementation and ISA effects, we con sider multiple chips for each ISA with similar microarchitec tures, use established technology models to separate out the technology impact, use the same operating system and com piler front-end on all chips, and construct workloads that do not rely significantly on the operating system. Figure 1 presents an overview of our approach: the four platforms, 26 workloads, and set of measures collected for each workload on each pI at form. We use multiple implementations of the ISAs and specifi cally consider the ARM and x86 ISAs representing RISC against CISe. We present an exhaustive and rigorous analysis using workloads that span smartphone, desktop, and server applica tions. In our study, we are primarily interested in whether and, if so, how the ISA impacts performance and power. We also discuss infrastructure and system challenges, missteps, and soft ware/hardware bugs we encountered. Limitations are addressed in Section 3. Since there are many ways to analyze the raw data, this paper is accompanied by a public release of all data atwww.cs.wisc.edu/vertical/isa-power-struggles.
Key Findings:
The main findings from our study are:
o Large performance gaps exist across the implementations, al though average cycle count gaps are :s; 2.5 x.
o Instruction count and mix are ISA-independent to first order. o Performance differences are generated by ISA-independent microarchitecture differences.
o The energy consumption is again ISA-independent. o ISA differences have implementation implications, but mod ern microarchitecture techniques render them moot; one ISA is not fundamentally more efficient.
o ARM and x86 implementations are simply design points optimized for different performance levels.
Implications:
Our findings confirm known conventional (or suspected) wisdom, and add value by quantification. Our resuIts imply that microarchitectural effects dominate performance, power, and energy impacts. The overall implication of this work is that the ISA being RISC or CISC is largely irrelevant for to day's mature microprocessor design world.
Paper organization:
Section 2 describes a framework we de velop to understand the ISA's impacts on performance, power, and energy. Section 3 describes our overall infrastructure and rationale for the platforms for this study and our limitations, Section 4 discusses our methodology, and Section 5 presents the analysis of our data. Section 6 concludes.
Framing Key Impacts of the ISA
In this section, we present an intellectual framework in wh ich to ex amine the impact of the ISA-assuming a von Neu mann model-on performance, power, and energy. We con sider the three key textbook ISA features that are central to the RISC/CISC debate: format, operations, and operands. We do not consider other textbook features, data types and control, as they are orthogonal to RISC/CISC design issues and RISC/CISC approaches are similar. Table 1 presents the three key ISA fea tures in three columns and their general RISC and CISC char acteristics in the first two rows. We then discuss contrasts for each feature and how the choice of RISC or CISC potentially and historically introduced significant trade-offs in performance and power. In the fourth row, we discuss how modern refine ments have led to similarities, marginalizing the choice of RISC or CISC on performance and power. Finally, the last row raises empirical questions focused on each feature to quantify or val idate this convergence. Overall, our approach is to understand all performance and power differences by using measured met rics to quantify the root cause of differences and whether or not ISA differences contribute. The remainder of this paper is cen tered around these empirical questions framed by the intuition presented as the convergence trends.
Although whether an ISA is RISC or CISC seems irrelevant, ISAs are evolving; expressing more semantic information has led to improved performance (x86 SSE, larger address space), better security (ARM Trustzone), better virtualization, etc. Ex amples in current research include extensions to allow the hard ware to balance accuracy with energy efficiency [15, 13] and ex tensions to use specialized hardware for energy efficiency [18] . We revisit this issue in our conclusions.
Infrastructure
We now describe our infrastructure and tools. The key take away is that we pick four platforms, doing our best to keep them on equal footing, pick representative workloads, and use rigor ous methodology and tools for measurement. Readers can skip ahead to Section 4 if uninterested in the details.
Implementation Rationale and Challenges
Choosing implementations presents multiple challenges due to differences in technology (technology node, frequency, high performance/low power transistors, etc.); ISA-independent rni croarchitecture (L2-cache, memory controller, memory size, etc.); and system effects (operating system, compiler, etc.). Fi nally, platforms must be conunercially relevant and it is unfair to compare platforms from vastly different time-frames.
We investigated a wide spectrum of platforms spanning In tel Nehalem, Sandybridge, AMD Bobcat, NVIDIA Tegra-2, NVIDIA Tegra-3, and Qualcomm Snapdragon. However, we did not find implementations that met all of our criteria: same technology node across the different ISAs, identical or similar microarchitecture, development board that supported necessary measurements, a well-supported operating system, and similar VO and memory subsystems. We ultimately picked the Beagle board (Cortex-A8), Pandaboard (Cortex-A9), and Atom board, as they include processors with similar rnicroarchitectural fea tures like issue-width, caches, and main-memory and are from similar technology nodes, as described in Tables 2 and 7 . They are all relevant commercially as shown by the last row in Ta ble 2. For a high performance x86 processor, we use an Intel i7 Sandybridge processor; it is significantly more power-efficient than any 45nm offering, including Nehalem. Importantly, these choices provided usable software platforms in terms of operat ing system, cross-compilation, and driver support. Overall, our choice of platforms provides a reasonably equal footing, and we perform detailed analysis to isolate out microarchitecture and technology effects. We present system details of our platforms for context, although the focus of our work is the processor core.
A key challenge in running real workloads was the rela tively small memory (512MB) on the Cortex-A8 Beagleboard. While representative of the typical target (e.g., iPhone 4 has 512MB RAM), it presents achallenge for workloads like SPEC CPU2006; execution times are dominated by swapping and OS overheads, making the core irrelevant. Section 3.3 describes how we handled this. In the remainder of this section, we discuss the platforms, applications, and tools for this study in detail.
Implementation Platforms
Hardware platform:
We consider two chip implementations each for the ARM and x86 ISAs as described in Table 2 . lntent: Keep non-processor features as sirnilar as possible. Across all platforms, we run the same stable Linux 2.6 LT S kernel with some minor board-specific patches to obtain accurate results when using the performance counter subsystem. We use perf's l program sampling to find the fraction of time spent in the kernel while executing the SPEC benchmarks on all four boards; overheads were less than 5% for all but GemsFDTD and perlbench (both less than 10%) and the fraction of time spent in the operating system was virtually iden tical across platforms spanning ISAs. Intent: Keep OS effects as similar as possible across platforms.
Compiler:
Our too1chain is based on a validated gcc 4.4 based cross-compiler configuration. We intentionally chose gcc so that we can use the same front-end to generate all binaries. All target independent optimizations are enabled (03); machine specific tuning is disabled so there is a single set of ARM bi naries and a single set of x86 binaries. For x86 we target 32-bit since 64-bit ARM platforms are still under development. For ARM, we disable THUMB instructions for a more RISC-like ISA. We ran experiments to determine the impact of machine specific optimizations and found that these impacts were less than 5% for over half of the SPEC suite, and caused performance variations of ±20% on the remaining with speed-ups and slow downs equally likely. None of the benchmarks include SIMD code, and although we allow auto-vectorization, very few SIMD instructions are generated for either architecture. Floating point is done natively on the SSE (x86) and NEON (ARM) units. Ven dor compilers may produce better code for a platform, but we use gcc to eliminate compiler inftuence. As seen in Table 12 in Appendix I of an accompanying technical report [10] , static code size is within 8% and average instruction lengths are within 4% using gcc and icc for SPEC INT, so we expect that compiler does not make a significant difference.
Intent: Hold compiler effects constant across platforms.
l perf is a Linux utility to access performance counters. Since both ISAs are touted as candidates for mobile clients, desktops, and servers, we consider a suite of workloads that span these. We use prior workload studies to guide our choice, and where appropriate we pick equivalent workloads that can run on our evaluation plattorms. A detailed description follows and is summarized in Table 3 . All workloads are single-threaded to ensure our single-core focus.
Mobile dient:
This category presented challenges as mobile client chipsets typically include several accelerators and careful analysis is required to determine the typical workload executed on the programmable general-purpose core. We used CoreMark (www . coremark. org), widely used in industry white-papers, and two WebKit regression tests informed by the BBench study [19] . BBench, a recently proposed smartphone bench mark suite, is a "a web-page rendering benchmark comprising 11 of the most popular sites on the internet today" [19] . To avoid web-browser differences across the platforms, we use the cross platform WebKit with two of its built-in tests that mimic real world HTML layout and performance scenarios for our study 2 .
Desktop:
We use the SPECCPU2006 suite (www . spec. org) as representative of desktop workloads. SPECCPU2006 is a weil understood standard desktop benchmark, providing insights into core behavior. Due to the large memory footprint of the train and reference inputs, we found that for many benchmarks the memory constrained Cortex-A8, in particular, ran of mem ory and execution was dominated by system effects. Instead, we report results using the test inputs, which fit in the Cortex-A8's memory footprint for 10 of 12 INT and 10 of l7 FP benchmarks.
Server:
We chose server workloads informed by the Cloud Suite workloads recently proposed by Ferdman et al. [16] . Their study characterizes server/cloud workloads into data analytics, data streaming, media streaming, software testing, web search, and web serving. The actual software implementations they provide are targeted for large memory-footprint machines and their intent is to benchmark the entire system and server clus ter. This is unsuitable for our study since we want to iso late processor effects. Hence, we pick implementations with small memory footprints and single-node behavior. To represent data-streaming and data-analytics, we use three database ker nels cOlmnonly used in database evaluation work [26, 23] that capture the core computation in Bayes classification and data- 
Tools
The four main tools we use in our work are described below and Table 5 in Section 4 describes how we use them.
Native execution time and microarchitectural events:
We use wall-cIock time and performance-counter-based clock-cycle measurements to determine execution time of programs. We also use performance counters to understand microarchitecture influences on the execution time. Each of the processors has different counters available, and we examined them to find com parable measures. Ultimately three counters explain much of the program behavior: branch mis-prediction rate, Level-l data cache miss rate, and Level-l instruction-cache miss rate (all measured as misses per kilo-instructions). We use the perf tool for performance counter measurement.
Power:
For power measurements, we connect a Wattsup (www.wattsuprneters. eorn) meter to the board (or desktop) power supply. This gives us system power. We run the bench mark repeatedly to find consistent average power as explained in Table 5 . We use a control run to determine the board power alone when the processor is halted and subtract away this board power to determine chip power. Some recent power studies [14, 21, 9] accurately isolate the processor power alone by measuring the current supply line of the processor. This is not possible for the SoC-based ARM development boards, and hence we deter mine and then subtract out the board-power. This methodology allows us to eliminate the main memory and VO power and ex amine only processor power. We validated our strategy for the i7 system using the exposed energy counters (the only platform we consider that includes isolated power measures). Across all three benchmark suites, our WattsUp methodology compared to the processor energy counter reports ranged from 4% to 17% less, averaging 12%. Our approach tends to under-estimate core power, so our results for power and energy are optimistic. We saw average power of 800mW, 1.2W, 5.5W, and 24W for A8, A9, Atom, and i7 (respectively) and these fall within the typical vendor-reported power numbers.
Technology scaling and projections:
Since the i7 processor is 32nm and the Cortex-A8 is 65nm, we use technology node characteristics from the 2007 ITRS tables to normalize to the 45nm technology node in two results where we factor out tech nology; we do not account for device type (LOP, HP, LSTP). For our 45nm projections, the A8's power is scaled by 0.8x and the iTs power by 1.3 x. In some results, we sc ale frequency to 1 GHz, accounting for DVFS impact on voltage using the mappings disclosed for Intel SCC [5] . When frequency scal ing, we assume that 20% of the iTs power is static and does not scale with frequency; all other cores are assumed to have negligible static power. When frequency scaling, A8's power is scaled by 1.2x, Atom's power by 0.8x, and iTs power by 0.6x. We acknowledge that this scaling introduces some error to our technology-scaled power comparison, but feel it is a reasonable strategy and doesn't affect our primary findings (see Table 4 ).
Emulated instruction mix measurement:
For the x86 ISA, we use DynamoRIO [11] to measure instruction mix. For the ARM ISA, we leverage the gem5 [8] simulator's functional em ulator to derive instruction mixes (no ARM binary emulation available). Our server and mobile-client benchmarks use many system calls that do not work in the gem5 functional mode. We do not present detailed instruction-mix analysis for these, but instead present high-level mix determined from performance counters. We use the MICA tool to find the available ILP [20] .
3.S. Limitations or Concerns
Our study's limitations are classified into core diversity, do main, tool, and scaling effects. The full iist appears in Table 4 . Throughout our work, we focus on wh at we believe to be the first order effects for performance, power, and energy and feel our analysis and methodology is rigorous. Other more detailed methods may exist, and we have made the data publicly available at www.es . wise. edu/vertieal/isa-power-struggles to allow interested readers to pursue their own detailed analysis.
Methodology
In this section, we describe how we use our tools and the overall flow of our analysis. Section 5 presents our data and analysis. Table 5 describes how we employ the aforementioned tools and obtain the measures we are interested in, namely, ex ecution time, execution cycles, instruction-mix, microarchitec ture events, power, and energy.
about whether the ISA forces microarchitecture features.
Power and Energy Analysis Flow
Step 1: Present per benchmark raw power measurements.
Our overall approach is to understand all performance and power diff erences and use the measured metrics to quantify the root cause of differences and whether or not ISA differences contribute, answering empirical questions from Section 2. Un less otherwise explicitly stated, all data is measured on real hard ware. The flow of the next section is outlined below.
Performance Analysis Flow
Step 1: Present execution time for each benchmark.
Step 2: Normalize frequency's impact using cyc1e counts.
Step 3: To und erstand differences in cycle count and the influ ence of the ISA, present the dynamic instruction count measures, measured in both macro-ops and micro-ops.
Step 4: Use instruction mix, code binary size, and average dy namic instruction length to understand ISA's influence.
Step 5: To und erstand performance differences not attributable to ISA, look at detailed microarchitecture events.
Step 6: Attribute performance gaps to frequency, ISA, or ISA independent microarchitecture features. Qualitatively reason
Step 2: To factor out the impact of technology, present technology-independent power by scaling all processors to 45nm and normalizing the frequency to 1 GHz.
Step 3: To understand the interplay between power and perfor mance, examine raw energy.
Step 4: Qualitatively reason about the ISA influence on microar chitecture in terms of energy.
Trade-off Analysis Flow
Step 1: Combining the performance and power measures, COlll pare the processor implementations using Pareto-frontiers.
Step 2: Compare measured and synthetic processor implemen tations using Energy-Performance Pareto-frontiers.
Measured Data Analysis and Findings
We now present our measurements and analysis of perfor mance, power, energy, and the trade-offs between them. We conc1ude the section with sensitivity studies projecting perfor mance of additional implementations of the ARM and x86 ISA using a simple performance and power model. We present our data for all four platforms, often comparing A8 to Atom (both dual-issue in-order) and A9 to i7 (both 000) since their implementations are pair-wise similar. For each step, we present the average measured data, average in-order and 000 ratios if applicable, and then our main findings. When our analy sis suggests that so me benchmarks are outliers, we give averages with the outliers included in parentheses.
Performance Analysis
Step 1: Execution Ti me Comparison Data: Figure 2 shows execution time normalized to i7; av erages including outliers are given using parentheses. Average ratios are in the table below. Per benchmark data is in Figure 16 of Appendix I in an accompanying technical report [10] . Outliers: A8 perforrns particularly poorly on WebKit tests and lighttpd, skewing A8/ Atom differences in the mobile and server data, respectively; see details in Step 2. Five SPEC FP benchmarks are also considered outliers; see Table 8 . Where outliers are listed, they are in this set.
Finding P 1: Large performance gaps are platform and bench mark dependent: A9 to i7 performance gaps range from 5 x to 102 x and A8 to Atom gaps range from 2 x to 997 x . Key Finding 1: Large performance gaps exist across the four platforms studied, as expected, since frequency ranges from 600 MHz to 3.4 GHz and microarchitectures are very different.
Step 2: Cycle-Count Comparison Data: Figure 3 shows cycle counts normalized to i7. Per benchmark data is in Figure 7 . Finding P2: Per suite cycle count gaps between out-of-order implementations A9 and i7 are less than 2.5x (no outliers). Finding P3: Per suite cycle count gaps between in-order im plementations A8 and Atom are less than 1.5x (no outliers). Key Finding 2: Performance gaps, when normalized to cycle counts, are less than 2.5x when comparing in-order cores to each other and out-of-order cores to each other.
Step 3: Instruction Count Comparison Data: Figure 4a shows dynamic instruction (macro) counts on A8 and Atom normalized to Atom x86 macro-instructions. Per benchmark data is in Figure 17a and derived CPIs are in Table  11 in Appendix I of [10] .
Data: Figure 4b shows dynamic micro-op counts for Atom and i7 normalized to Atom macro-instructions 5 . Per benchmark data is in Figure 17b in Appendix I of [10] . Outliers: For wkperf and lighttpd, A8 executes more than twice as many instructions as A9 6 . We report A9 instruction counts for these two benchmarks. For CLucene, x86 machines execute 1.7 x more instructions than ARM machines; this ap pears to be a pathological case 01' x86 code generation ineffi ciencies. For cactusADM, Atom executes 2.7x more micro-ops than macro-ops; this extreme is not seen for other benchmarks.
Finding P4: Instruction count similar across ISAs. Implies gcc picks the RISC-like instructions from the x86 ISA.
Finding P5: All ARM outliers in SPEC FP due to transcendental FP operations supported only by x86.
Finding P6: x86 micro-op to macro-op ratio is often less than 1.3 x, again suggesting gcc picks the RISC-like instructions. Key Finding 3: Instruction and cycle counts imply CPI is less on x86 implementations: geometric mean CPI is 3.4 for A8, 2.2 for A9, 2.1 for Atom, and 0.7 for i7 across all suites. x86 ISA overheads, if any, are overcome by microarchitecture.
Step 4: Instruction Format and Mix Data: Table 6a shows average ARM and x86 static binary sizes, measuring only the binary's code sections. Per benchmark data is in Table 12a in Appendix I of [10] . Table 6b shows average dynamic ARM and x86 in struction lengths. Per benchmark data is in Table 12b in Ap pendix I of [10] . Outliers: CLucene binary (from server suite) is almost 2 x larger for x86 than ARM; the server suite thus has the largest span in binary sizes. ARM executes correspondingly few in structions; see outliers discussion in Step 3.
Data:
Finding P7: Average ARM and x86 binary sizes are similar for SPEC INT, SPEC FP, and Mobile workloads, suggesting similar code densities.
Finding P8: Executed x86 instructions are on average up to 25% shorter than ARM instructions: short, simple x86 instruc tions are typical.
Finding P9: x86 FP benchmarks, which tend to have more complex instructions, have instructions with longer encodings (e.g., cactusADM with 6.4 Bytes/inst on average). Figure 5 shows average coarse-grained ARM and x86 instruction mixes for each benchmark suite 7 . Data: Figure 6 shows fine-grained ARM and x86 instruction mixes normalized to x86 for a subset of SPEC benchmarks 7 . 7 x86 instructions with memory operands are cracked into a memory opera tion and the original operation.
Finding P 10: Fraction of loads and stores similar across ISA for all suites, suggesting that the ISA does not lead to significant differences in data accesses.
Finding P 11: Large instruction counts for ARM are due to absence of FP instructions like fsincon, fy12xpl, (e.g., tonto in Figure 6 's many special x86 instructions correspond to ALU/logical/multiply ARM instructions).
Key Finding 4: Combining the instruction-count and mix findings, we conclude that ISA effects are indistinguishable be tween x86 and ARM implementations.
Step 5: Microarchitecture Data: Figure 7 shows the per-benchmark cycle counts for more detailed analysis where performance gaps are large.
Data: Table 7 compares the A8 microarchitecture to Atom, and A9 to i7, focusing on the primary structures. These details are from five Microprocessor Report articles 8 and the A9 num bers are estimates derived from publicly disclosed information on A15 and A9/A15 comparisons. Finding P 12: A9 and iTs different issue widths (2 versus 4, respectively) LO explain performance differences up to 2 x, as suming sufficient ILP, a sufficient instruction window and a weil balanced processor pipeline. We use MICA to confirm that our benchmarks a11 have limit ILP greater than 4 [20] .
Finding P 13: Even with different ISAs and significant differ ences in microarchitecture, for 12 benchmarks, the A9 is within 2 x the cycle count of i7 and can be explained by the difference in issue width. Figures 8, 9 , and 10 show branch mispredictions & LI data and instruction cache misses per 1000 ARM instructions .
Finding P 14: Observe large microarchitectural event count differences (e.g., A9 branch misses are more common than i7 branch rnisses). These differences are not because of the ISA, but rather due to microarchitectural design choices (e.g., A9's BTB has 512 entries versus iTs 16K entries). wkJay: webkitJayout, wk_perf: webkiLperf, libq: libquantum, perl: perlbeneh, omnt: omnetpp, Gems: GemsFDTD, cactus: cactusADM, db: database, light: lighttpd Finding P 15: Per benchmark, we can attribute the largest gaps in i7 to A9 performance (and in Atom to A8 performance) to specific microachitectural events. In the interest of space, we present example analyses for benchmarks with gaps greater than 3x in Table 8 ; bwaves details are in Appendix 11 of [10] .
Key Finding 5:
The microarchitecture has significant impact on performance. The ARM and x86 architectures have similar in struction counts. The highly accurate branch predictor and large caches, in particular, effectively allow x86 architectures to sus tain high performance. x86 performance inefficiencies, if any, are not observed. The rnicroarchitecture, not the ISA, is respon sible for performance differences.
Step 6: ISA influence on microarchitecture Key Finding 6: As shown in Table 7 , there are significant dif ferences in microarchitectures. Drawing upon instruction mix and instruction count analysis, we fee I that the only case where the ISA forces larger structures is on the ROB size, physical rename file size, and scheduler size since there are almost the same number of x86 micro-ops in flight compared to ARM instructions. The difference is small enough that we argue it is not necessary to quantify further. Beyond the translation to micro ops, pipelined implementation of an x86 ISA introduces no addi tional overheads over an ARM ISA for these performance levels.
Power and Energy Analysis
In this section, we normalize to A8 as it uses the least power. Per benchmark data corresponding to Figures 11, 12 , and 13 is in Figures 18, 19 , and 20 in Appendix I of [10] .
Step 1: Average Power Data: Figure 11 shows average power normalized to the A8. Overall x86 implementations consume signifi cantly more power than ARM implementations.
Step 2: Average Te chnology Independent Power Data: Figure 12 shows technology-independent average power-cores are scaled to 1 GHz at 45nm (normalized to A8). Finding EI: With frequency and technology scaling, ISA ap pears irrelevant for power optimized cores: A8, A9, and Atom are all within 0.6 x of each other (A8 consumes 29% more power than A9). Atom is actually lower power than A8 and A9.
Finding E2: i7 is performance, not power, optimized. Per suite power costs are 6.1x to 7.6x higher for i7 than A9 with 1.7x to 7.0x higher frequency-independent performance (Fig  ure 3 cycIe count performance) . Key Finding 8: The choice of power or performance optimized core designs impacts core power use more than ISA.
Step 3: Average Energy Data: Figure 13 shows energy (product of power and time). Finding E3: Despite power differences, Atom consumes less energy than A8 and i7 uses only slightly more energy than A9 due primarily to faster execution times, not ISA.
Finding E4: For "hard" benchmarks with high cache miss rates that leave the core poorly utilized (e.g., many in SPEC FP), fixed energy costs from structures provided for high performance make iT s energy 2 x to 3 x worse than A9. Step 4: ISA impact on microarchitecture. Data: Table 7 outlined microarchitecture features. Finding E5: The energy impact of the ISA is that it requires micro-ops translation and an additional micro-ops cache. Fur ther, since the number of micro-ops is not significantly higher, the energy impact of x86 support is small.
Finding E6: Other power-hungry structures like a large L2-cache, highly associative TLB, aggressive prefetcher, and large branch predictor seem dictated primarily by the performance level and application domain targeted by the Atom and i7 pro cessors and are not necessitated by x86 ISA features.
Trade-off Analysis
Step 1: Power-Performance Tr ade-of{s Data: Figure 14 shows the geometric mean power performance trade-off for all benchmarks using technology node scaled power. We generate a cubic curve for the power performance trade-off curve. Given our small sampIe set, a core's location on the frontier does not imply that it is optimal. Performance (BIPS) Figure 14 . Power Performance Trade-offs.
6
Finding Tl: A9 provides 3.5 x better performance using 1.8 x the power of A8.
Finding T2: i7 provides 6.2 x better performance using 10.9 x the power of Atom.
Finding T3: i7's microarchitecture has high energy cost when performance is low: benchmarks with the smallest performance gap between i7 and A9 (star in Figure 14) II have only 6x better performance than A9 but use more than lO x more power.
Key Finding 10:
Regardless of ISA or energy-efficiency, high-performance processors require more power than lower performance processors. They follow weil established cubic power/performance trade-offs.
Step 2: Energy-Performance Tra de-offs Data: Figure 15 shows the geometric mean energy performance trade-off using technology node scaled energy. We generate a quadratic energy-performance trade-off curve. Again, a core's location on the fron tier does not imply optimality. Syn thetic processor points beyond the four processors studied are shown using hollow points; we consider a performance targeted ARM core (A15) and frequency scaled A9, Atom, and i7 cores. A15 BIPS are from reported CoreMark scores; details on syn thetic points are in Appendix III of [10] . 
Performance (BI PS) Figure 15 . Energy Performance Trade-offs.
Finding T4: Regardless of ISA, power-only or performance only optimized cores have high energy overheads (see A8 & i7).
Finding T5: Balancing power and performance leads to energy-efficient cores, regardless of the ISA: A9 and Atom pro cessor energy requirements are within 24% of each other and use up to 50% less energy than other cores.
Finding T6: DVFS and microarchitectural techniques can provide high energy-efficiency to performance-optimized cores, regardless of the ISA: i7 at 2 GHz provides 6x performance at the same energy level as an A9.
Finding T7: We consider the energy-delay metric (ED) to capture both performance and power. Cores designed balancing power and performance constraints show the best energy-delay producl: A15 is 46% lower than any other design we considered.
Finding T8: When weighting the importance of performance only slightly more than power, high-performance cores seem best suited. Considering EDI .4 , i7-a performance optimized core-is best (lowest product, and 6x higher performance). Con sidering ED 2 , i7 is more than 2 x better than the next best design. See Appendix IV in [10] for more details. Key Finding 11: It is the microarchitecture and design method ologies that really matter.
11 Seven SPEC, all mobile, and the non-database server benchmarks. � between x86 and ARM Fig-5 & 6 c... 5 �architecture, not the ISA, responsible Table-8  for performance di fferences   6 Beyond micro-op translation, x86 ISA It is the �-architecture and design E--methodology that really mauers 6. Conclusions In this work, we revisit the RISC vs. CISC debate consid ering contemporary ARM and x86 processors running modern workloads to understand the role of ISA on performance, power, and energy. During this study, we encountered infrastructure and system challenges, missteps, and software/hardware bugs. Table 9 outlines these issues as a potentially useful guide for similar studies. Our study suggests that whether the ISA is RISC or CISC is irrelevant, as summarized in Table 10 , wh ich includes a key representative quantitative measure for each analysis step. We reflect on whether there are certain metrics for wh ich RISC or CISC maUers, and place our findings in the context of past ISA evolution and future ISA and microarchitecture evolution.
Considering area normalized to the 45nm technology node, we observe that A8's area is 4.3mm 2 , AMD's Bobcat's area is 5.8mm 2 , A9's area is 8.5 mm 2 , and Intel's Atom is 9.7 mm 2 [4, 25, 27] . The smallest, the A8, is sm aller than Bobcat by 25%. We feel much of this is explained by simpler core design (in-order vs 000), and sm aller caches, predictors, and TLBs. We also observe that the A9's area is in-between Bobcat and Atom and is cIose to Atom's. Further detailed analysis is required to determine how much the ISA and the microarchitec ture structures for performance contribute to these differences.
A related issue is the performance level for wh ich our re sults hold. Considering very low performance processors, like the RISC AT mega324PA microcontroller with operating fre quencies from 1 to 20 MHz and power consumption between 2 and 50mW [3] , the overheads of a CISC ISA (specifically the complete x86 ISA) are cIearly untenable. In similar domains, even ARM's full ISA is too rich; the Cortex-MO, meant for low power embedded markets, incIudes only a 56 instruction subset of Thumb-2. Our study suggests that at performance levels in the range of A8 and higher, RISC/CISC is irrelevant for perfor mance, power, and energy. Determining the lowest performance level at which the RISC/CISC ISA effects are irrelevant for all metrics is interesting future work.
While our study shows that RISC and CISC ISA traits are irrelevant to power and performance characteristics of mod ern cores, ISAs continue to evolve to better support exposing workload-specific semantic information to the execution sub strate. On x86, such changes include the transition to Intel64 (Iarger word sizes, optimized calling conventions and shared code support), wider vector extensions like AV X, integer crypto and security extensions (NX), hardware virtualization exten sions and, more recently, architectural support for transactions in the form of HLE. Similarly, the ARM ISA has introduced shorter fixed length instructions for low power targets (Thumb), vector extensions (NEON), DSP and bytecode execution exten sions (Jazelle DBX), Trustzone security, and hardware virtual ization support. Thus, while ISA evolution has been continuous, it has focused on enabling specialization and has been largely agnostic of RISC or CISe. Other examples from recent research incIude extensions to allow the hardware to balance accuracy and reliability with energy efficiency [15, 13] and extensions to use specialized hardware for energy efficiency [18] .
It appears decades of hardware and compiler research has enabled efficient handling of both RISC and CISC ISAs and both are equally positioned for the coming years of energy constrained innovation.
