200 research outputs found

    STATISTICAL MACHINE LEARNING BASED MODELING FRAMEWORK FOR DESIGN SPACE EXPLORATION AND RUN-TIME CROSS-STACK ENERGY OPTIMIZATION FOR MANY-CORE PROCESSORS

    Get PDF
    The complexity of many-core processors continues to grow as a larger number of heterogeneous cores are integrated on a single chip. Such systems-on-chip contains computing structures ranging from complex out-of-order cores, simple in-order cores, digital signal processors (DSPs), graphic processing units (GPUs), application specific processors, hardware accelerators, I/O subsystems, network-on-chip interconnects, and large caches arranged in complex hierarchies. While the industry focus is on putting higher number of cores on a single chip, the key challenge is to optimally architect these many-core processors such that performance, energy and area constraints are satisfied. The traditional approach to processor design through extensive cycle accurate simulations are ill-suited for designing many-core processors due to the large microarchitecture design space that must be explored. Additionally it is hard to optimize such complex processors and the applications that run on them statically at design time such that performance and energy constraints are met under dynamically changing operating conditions. The dissertation establishes statistical machine learning based modeling framework that enables the efficient design and operation of many-core processors that meets performance, energy and area constraints. We apply the proposed framework to rapidly design the microarchitecture of a many-core processor for multimedia, computer graphics rendering, finance, and data mining applications derived from the Parsec benchmark. We further demonstrate the application of the framework in the joint run-time adaptation of both the application and microarchitecture such that energy availability constraints are met

    Performance and power comparisons between Fermi and Cypress GPUs

    Get PDF
    In recent years, modern graphics processing units have been widely adopted in high performance computing areas to solve large scale computation problems. The leading GPU manufacturers Nvidia and AMD have introduced series of products to the market. While sharing many similar design concepts, GPUs from these two manufacturers differ in several aspects on processor cores and the memory subsystem. In this work, we conduct a comprehensive study to characterize and compare the architectural features of Nvidia’s Fermi and AMD’s Cypress GPUs. We first investigate the performance and power consumptions of an AMD Cypress GPU. By employing a rigorous statistical model to analyze the execution behaviors of representative general-purpose GPU (GPGPU) applications, we conduct insightful investigations on the target GPU architecture. Our results demonstrate that the GPU execution throughput and the power dissipation are dependent on different architectural variables. Furthermore, we design a set of micro-benchmarks to study the power consumption features of different function units on the GPU. Based on those results, we derive instructive principles that can guide the design of power-efficient high performance computing systems. We then make the concentration shift to the Nvidia Fermi GPU and compare it with the product from AMD. Our results indicate that these two products have diverse advantages that are reflected in their performance for different sets of applications. In addition, we also compare the energy efficiencies of these two platforms since power/energy consumption is a major concern in the high performance computing system

    Accelerating the Performance of a Novel Meshless Method Based on Collocation With Radial Basis Functions By Employing a Graphical Processing Unit as a Parallel Coprocessor

    Get PDF
    In recent times, a variety of industries, applications and numerical methods including the meshless method have enjoyed a great deal of success by utilizing the graphical processing unit (GPU) as a parallel coprocessor. These benefits often include performance improvement over the previous implementations. Furthermore, applications running on graphics processors enjoy superior performance per dollar and performance per watt than implementations built exclusively on traditional central processing technologies. The GPU was originally designed for graphics acceleration but the modern GPU, known as the General Purpose Graphical Processing Unit (GPGPU) can be used for scientific and engineering calculations. The GPGPU consists of massively parallel array of integer and floating point processors. There are typically hundreds of processors per graphics card with dedicated high-speed memory. This work describes an application written by the author, titled GaussianRBF to show the implementation and results of a novel meshless method that in-cooperates the collocation of the Gaussian radial basis function by utilizing the GPU as a parallel co-processor. Key phases of the proposed meshless method have been executed on the GPU using the NVIDIA CUDA software development kit. Especially, the matrix fill and solution phases have been carried out on the GPU, along with some post processing. This approach resulted in a decreased processing time compared to similar algorithm implemented on the CPU while maintaining the same accuracy

    Dynamic cache reconfiguration based techniques for improving cache energy efficiency

    Get PDF
    Modern multicore processors are employing large last-level caches, for example Intel's E7-8800 processor uses 24MB L3 cache. Further, with each CMOS technology generation, leakage energy has been dramatically increasing and hence, leakage energy is expected to become a major source of energy dissipation, especially in last-level caches (LLCs). The conventional schemes of cache energy saving either aim at saving dynamic energy or are based on properties specific to first-level caches, and thus these schemes have limited utility for last-level caches. Further, several other techniques require offline profiling or per-application tuning and hence are not suitable for product systems. In this research, we propose novel cache leakage energy saving schemes for single-core and multicore systems; desktop, QoS, real-time and server systems. We propose software-controlled, hardware-assisted techniques which use dynamic cache reconfiguration to configure the cache to the most energy efficient configuration while keeping the performance loss bounded. To profile and test a large number of potential configurations, we utilize low-overhead, micro-architecture components, which can be easily integrated into modern processor chips. We adopt a system-wide approach to save energy to ensure that cache reconfiguration does not increase energy consumption of other components of the processor. We have compared our techniques with the state-of-art techniques and have found that our techniques outperform them in their energy efficiency. This research has important applications in improving energy-efficiency of higher-end embedded, desktop, server processors and multitasking systems. We have also proposed performance estimation approach for efficient design space exploration and have implemented time-sampling based simulation acceleration approach for full-system architectural simulators.Comment: PhD thesis, dynamic cache reconfiguratio

    The 1992 4th NASA SERC Symposium on VLSI Design

    Get PDF
    Papers from the fourth annual NASA Symposium on VLSI Design, co-sponsored by the IEEE, are presented. Each year this symposium is organized by the NASA Space Engineering Research Center (SERC) at the University of Idaho and is held in conjunction with a quarterly meeting of the NASA Data System Technology Working Group (DSTWG). One task of the DSTWG is to develop new electronic technologies that will meet next generation electronic data system needs. The symposium provides insights into developments in VLSI and digital systems which can be used to increase data systems performance. The NASA SERC is proud to offer, at its fourth symposium on VLSI design, presentations by an outstanding set of individuals from national laboratories, the electronics industry, and universities. These speakers share insights into next generation advances that will serve as a basis for future VLSI design

    Design Space Exploration and Resource Management of Multi/Many-Core Systems

    Get PDF
    The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends

    Modular Architectures And Optimization Techniques For Power And Reliability In Future Many Core Microprocessors

    Full text link
    Power and reliability issues are expected to increase in future multicore systems with a higher degree of component integration. As the feature sizes of transistors continue to shrink, more resources can be incorporated in microprocessors to address a broader spectrum of different application requirements. However, power constraints will limit the amount of resources that can be powered on at any given time. Recent studies have shown that future multicore systems will be able to power on less than 80% of their transistors in the near future, and less than 50% in the long term. The most difficult challenge is deciding which transistors should be powered on at any given time to deliver high performance under strict power constraints. At the same time, device reliability issues - the proliferation of devices that will either be defective at manufacturing time or will fail in the field with usage - are projected to be exacerbated by the continued scaling of device sizes. We present a modular, dynamically reconfigurable architecture as a promising unified solution to the problems of dark silicon (the inability to power all available computing resources) and reliability. Our modular architecture implements deconfigurable lanes within the decoupled sections of a superscalar pipeline that can be easily powered on or off to isolate faults or create an energy-efficient hardware configuration tailored to the needs of the running software. At the system level, we propose a novel framework that uses surrogate response surfaces and heuristic global optimization algorithms to characterize the behavior of applications at runtime and dynamically redistribute the available chip-wide power to obtain hardware configurations customized for the software diversity and system goals. Our reconfigurable architecture is able to provide high performance under a strict power budget, maintain a certain performance level at a reduced power cost, and in the case of hard faults, restore the system's performance to pre-fault levels

    Performance Models for Electronic Structure Methods on Modern Computer Architectures

    Get PDF
    Electronic structure codes are computationally intensive scientic applications used to probe and elucidate chemical processes at an atomic level. Maximizing the performance of these applications on any given hardware platform is vital in order to facilitate larger and more accurate computations. An important part of this endeavor is the development of protocols for measuring performance, and models to describe that performance as a function of system architecture. This thesis makes contributions in both areas, with a focus on shared memory parallel computer architectures and the Gaussian electronic structure code. Shared memory parallel computer systems are increasingly important as hardware man- ufacturers are unable to extract performance improvements by increasing clock frequencies. Instead the emphasis is on using multi-core processors to provide higher performance. These processor chips generally have complex cache hierarchies, and may be coupled together in multi-socket systems which exhibit highly non-uniform memory access (NUMA) characteristics. This work seeks to understand how cache characteristics and memory/thread placement affects the performance of electronic structure codes, and to develop performance models that can be used to describe and predict code performance by accounting for these effects. A protocol for performing memory and thread placement experiments on NUMA systems is presented and its implementation under both the Solaris and Linux operating systems is discussed. A placement distribution model is proposed and subsequently used to guide both memory/thread placement experiments and as an aid in the analysis of results obtained from experiments. In order to describe single threaded performance as a function of cache blocking a simple linear performance model is investigated for use when computing the electron repulsion integrals that lie at the heart of virtually all electronic structure methods. A parametric cache variation study is performed. This is achieved by combining parameters obtained for the linear performance model on existing hardware, with instruction and cache miss counts obtained by simulation, and predictions are made of performance as a function of cache architecture. Extension of the linear performance model to describe multi-threaded performance on complex NUMA architectures is discussed and investigated experimentally. Use of dynamic page migration to improve locality is also considered. Finally the use of large scale electronic structure calculations is demonstrated in a series of calculations aiming to study the charge distribution for a single positive ion solvated within a shell of water molecules of increasing size

    CPUアーキテクチャを考慮した性能モデルの導入によるデータベース・クエリ最適化のためのコスト計算の精度向上

    Get PDF
    Non-volatile memory is applied not only to storage subsystems but also to the main memory of computers to improve performance and increase capacity. In the near future, some in-memory database systems will use non-volatile main memory as a durable medium instead of using existing storage devices, such as hard disk drives or solid-state drives. In addition, cloud computing is gaining more attention, and users are increasingly demanding performance improvement. In particular, the Database-as-a-Service (DBaaS) market is rapidly expanding. Attempts to improve database performance have led to the development of in-memory databases using non-volatile memory as a durable database medium rather than existing storage devices. For such in-memory database systems, the cost of memory access instead of Input/Output (I/O) processing decreases, and the Central Processing Unit (CPU) cost increases relative to the most suitable access path selected for a database query. Therefore, a high-precision cost calculation method for query execution is required. In particular, when the database system cannot select the most appropriate join method, the query execution time increases. Moreover, in the cloud computing environment the CPU architecture of different physical servers may be of different generations. The cost model is also required to be capable of application to different generation CPUs through minor modification in order not to increase database administrator\u27s extra duties. To improve the accuracy of the cost calculation, a cost calculation method based on CPU architecture using statistical information measured by a performance monitor embedded within the CPU (hereinafter called measurement-based cost calculation method) is proposed, and the accuracy of estimating the intersection (hereinafter called cross point) of cost calculation formulas for join methods is evaluated. In this calculation method, we concentrate on the instruction issuing part in the instruction pipeline, inside the CPU architecture. The cost of database search processing is classified into three types, data cache access, instruction cache miss penalty and branch misprediction penalty, and for each a cost calculation formula is constructed. Moreover, each cost calculation formula models the tendency between the statistical information measured by the performance monitor embedded within the CPU and the selectivity of the table while executing join operations. The statistical information measured by the performance monitor is information such as the number of executed instructions and the number of cache hits. In addition, for each element separated into elements repeatedly appearing in the access path of the join, cost calculation formulas are formed into parts, and the cost is calculated combining the parts for an arbitrary number of join tables. First, to investigate the feasibility of the proposed method, a cost formula for a two-table join was constructed using a large database, 100 GB of the TPC Benchmark(TM) H database. The accuracy of the cost calculation was evaluated by comparing the measured cross point with the estimated cross point. The results indicated that the difference between the predicted cross point and the measured cross point was less than 0.1% selectivity and was reduced by 71% to 94% compared with the difference between the cross point obtained by the conventional method and the measured cross point. Therefore, the proposed cost calculation method can improve the accuracy of join cost calculation. Then, to reduce the operating time of the database administration, the cost calculation formulawas constructed under the condition that the database for measuring the statistical value was reduced to a small scale (5 GB). The accuracy of cost calculations was also evaluated when joining three or more tables. As a result, the difference between the predicted cross point and the measured cross point was reduced by 74% to 95% compared with the difference between the cross point obtained by the conventional method and the measured cross point. It means the proposed method can improve the accuracy of cost calculation. Finally, a method is also proposed for updating the cost calculation formula using the measurement-based cost calculation method to support a CPU with architecture from another generation without requiring re-measurement of the statistical information of that CPU. Our approach focuses on reflecting architectural changes, such as cache size and associativity, memory latency, and branch misprediction penalty, in the components of the cost calcula-tion formulas. The updated cost calculation formulas estimated the cost of joining different generation-based CPUs accurately in 66% of the test cases. In conclusion, the in-memory database system using the proposed cost calculation method can select the best join method and can be applied to a database system with CPUs from different generations.首都大学東京, 2019-03-25, 博士(工学)首都大学東
    corecore