411 research outputs found
A Detailed Analysis of Contemporary ARM and x86 Architectures
RISC vs. CISC wars raged in the 1980s when chip area and processor design complexity were the primary constraints and desktops and servers exclusively dominated the computing landscape. Today, energy and power are the primary design constraints and the computing landscape is significantly different: growth in tablets and smartphones running ARM (a RISC ISA) is surpassing that of desktops and laptops running x86 (a CISC ISA). Further, the traditionally low-power ARM ISA is entering the high-performance server market, while the traditionally high-performance x86 ISA is entering the mobile low-power device market. Thus, the question of whether ISA plays an intrinsic role in performance or energy efficiency is becoming important, and we seek to answer this question through a detailed measurement based study on real hardware running real applications. We analyze measurements on the ARM Cortex-A8 and Cortex-A9 and Intel Atom and Sandybridge i7 microprocessors over workloads spanning mobile, desktop, and server computing. Our methodical investigation demonstrates the role of ISA in modern microprocessors? performance and energy efficiency. We find that ARM and x86 processors are simply engineering design points optimized for different levels of performance, and there is nothing fundamentally more energy efficient in one ISA class or the other. The ISA being RISC or CISC seems irrelevant
Fluid Stochastic Petri Nets: From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networks
© 2012 Mitrevski and Kotevski, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Fluid Stochastic Petri Nets: From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networ
Reducing the complexity of the register file in dynamic superscalar processors
Journal ArticleDynamic superscalar processors execute multiple instructions out-of-order by looking for independent operations within a large window. The number of physical registers within the processor has a direct impact on the size of this window as most in-flight instructions require a new physical register at dispatch. A large multi-ported register file helps improve the instruction-level parallelism (ILP), but may have a detrimental effect on clock speed, especially in future wire-limited technologies. In this paper, we propose a register file organization that reduces register file size and port requirements for a given amount of ILP. We use a two-level register file organization to reduce register file size requirements, and a banked organization to reduce port requirements. We demonstrate empirically that the resulting register file organizations have reduced latency and (in the case of the banked organization) energy requirements for similar instructions per cycle (IPC) performance and improved instructions per second (IPS) performance in comparison to a conventional monolithic register file. The choice of organization is dependent on design goals
An evaluation of the TRIPS computer system
The TRIPS system employs a new instruction set architecture (ISA) called Explicit Data Graph Execution (EDGE) that renegotiates the boundary between hardware and software to expose and exploit concurrency. EDGE ISAs use a block-atomic execution model in which blocks are composed of dataflow instructions. The goal of the TRIPS design is to mine concurrency for high performance while tolerating emerging technology scaling challenges, such as increasing wire delays and power consumption. This paper evaluates how well TRIPS meets this goal through a detailed ISA and performance analysis. We compare performance, using cycles counts, to commercial processors. On SPEC CPU2000, the Intel Core 2 outperforms compiled TRIPS code in most cases, although TRIPS matches a Pentium 4. On simple benchmarks, compiled TRIPS code outperforms the Core 2 by 10% and hand-optimized TRIPS code outperforms it by factor of 3. Compared to conventional ISAs, the block-atomic model provides a larger instruction window, increases concurrency at a cost of more instructions executed, and replaces register and memory accesses with more efficient direct instruction-to-instruction communication. Our analysis suggests ISA, microarchitecture, and compiler enhancements for addressing weaknesses in TRIPS and indicates that EDGE architectures have the potential to exploit greater concurrency in future technologies
Branch Prediction For Network Processors
Originally designed to favour flexibility over packet processing performance, the future of the programmable network processor is challenged by the need to meet both increasing line rate as well as providing additional processing capabilities. To meet these requirements, trends within networking research has tended to focus on techniques such as offloading computation intensive tasks to dedicated hardware logic or through increased parallelism. While parallelism retains flexibility, challenges such as load-balancing limit its scope. On the other hand, hardware offloading allows complex algorithms to be implemented at high speed but sacrifice flexibility. To this end, the work in this thesis is focused on a more fundamental aspect of a network processor, the data-plane processing engine.
Performing both system modelling and analysis of packet processing functions; the goal of this thesis is to identify and extract salient information regarding the performance of multi-processor workloads. Following on from a traditional software based analysis of programme workloads, we develop a method of modelling and analysing hardware accelerators when applied to network processors. Using this quantitative information, this thesis proposes an architecture which allows deeply pipelined micro-architectures to be implemented on the data-plane while reducing the branch penalty associated with these architectures
Exploring the design space for 3D clustered architectures
Journal Article3D die-stacked chips are emerging as intriguing prospects for the future because of their ability to reduce on-chip wire delays and power consumption. However, they will likely cause an increase in chip operating temperature, which is already a major bottleneck in modern microprocessor design. We believe that 3D will provide the highest performance benefit for high-ILP cores, where wire delays for 2D designs can be substantial. A clustered microarchitecture is an example of a complexity-effective implementation of a high-ILP core. In this paper, we consider 3D organizations of a single-threaded clustered microarchitecture to understand how floorplanning impacts performance and temperature. We first show that delays between the data cache and ALUs are most critical to performance. We then present a novel 3D layout that provides the best balance between temperature and performance. The best-performing 3D layout has 12% higher performance than the best-performing 2D layout
Improving processor efficiency by exploiting common-case behaviors of memory instructions
Processor efficiency can be described with the help of a number of desirable
effects or metrics, for example, performance, power, area, design
complexity and access latency.
These metrics serve as valuable tools used in designing new processors
and they also act as effective standards for comparing current processors.
Various factors impact the efficiency of modern out-of-order processors
and one important factor is the manner in which instructions are processed
through the processor pipeline.
In this dissertation research, we study the impact of load and store
instructions
(collectively known as memory instructions) on processor efficiency,
and show how to improve efficiency by exploiting common-case or
predictable patterns in the behavior of memory instructions.
The memory behavior patterns that we focus on in our research are
the predictability of memory dependences, the predictability in
data forwarding patterns,
predictability in instruction criticality and conservativeness
in resource allocation and
deallocation policies.
We first design a scalable and high-performance memory dependence
predictor and then apply
accurate memory dependence prediction to improve the efficiency of
the fetch engine of a simultaneous multi-threaded processor.
We then use predictable data forwarding patterns to eliminate power-hungry
hardware in the processor with no loss in performance. We then move to
studying instruction criticality to improve
processor efficiency. We study the behavior of critical load instructions
and propose applications that can be optimized using predictable,
load-criticality
information. Finally, we explore conventional techniques for
allocation and deallocation
of critical structures that process memory instructions and propose new
techniques to optimize the same. Our new designs have the potential to reduce
the power and the area required by processors significantly without losing
performance, which lead to efficient designs of processors.Ph.D.Committee Chair: Loh, Gabriel H.; Committee Member: Clark, Nathan; Committee Member: Jaleel, Aamer; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin S.; Committee Member: Prvulovic, Milo
- …