Search CORE

701 research outputs found

A Power Cap Oriented Time Warp Architecture

Author: Ciciani Bruno
Cingolani Davide
Conoci Stefano
Di Sanzo Pierangelo
Pellegrini Alessandro
Quaglia Francesco
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

Controlling power usage has become a core objective in modern computing platforms. In this article we present an innovative Time Warp architecture oriented to efficiently run parallel simulations under a power cap. Our architectural organization considers power usage as a foundational design principle, as opposed to classical power-unaware Time Warp design. We provide early experimental results showing the potential of our proposal

Crossref

Archivio della Ricerca - Università di Roma 3

ART

Archivio della ricerca- Università di Roma La Sapienza

Recommended from our members

AN ARCHITECTURE EVALUATION AND IMPLEMENTATION OF A SOFT GPGPU FOR FPGAs

Author: Andryc Kevin
Publication venue: ScholarWorks@UMass Amherst
Publication date: 25/10/2018
Field of study

Embedded and mobile systems must be able to execute a variety of different types of code, often with minimal available hardware. Many embedded systems now come with a simple processor and an FPGA, but not more energy-hungry components, such as a GPGPU. In this dissertation we present FlexGrip, a soft architecture which allows for the execution of GPGPU code on an FPGA without the need to recompile the design. The architecture is optimized for FPGA implementation to effectively support the conditional and thread-based execution characteristics of GPGPU execution without FPGA design recompilation. This architecture supports direct CUDA compilation to a binary which is executable on the FPGA-based GPGPU. Our architecture is customizable, thus providing the FPGA designer with a selection of GPGPU cores which display performance versus area tradeoffs. This dissertation describes the FlexGrip architecture in detail and showcases the benefits by evaluating the design for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of 23x, on average, versus a MicroBlaze microprocessor are achieved for designs which take advantage of the conditional execution capabilities offered by FlexGrip. We also show FlexGrip can achieve an 80% average reduction of dynamic energy versus the MicroBlaze microprocessor. The dissertation furthers discussion by exploring application-customized versions of the soft GPGPU, thus exploiting the overlay architecture. We expand the architecture to multiple processors per GPGPU and optimizing away features which are not needed by certain classes of applications. These optimizations, which include the effective use of block RAMs and DSP blocks, are critical to the performance of FlexGrip. By implementing a 2 GPGPU design, we show speedups of 44x on average versus a MicroBlaze microprocessor. Application-customized versions of the soft GPGPU can be used to further reduce dynamic energy consumption by an average of 14%. To complete this thesis, we augmented a GPGPU cycle accurate simulator to emulate FlexGrip and evaluate different levels of cache design spaces. We show performance increases for select benchmarks, however, we also show that 64% and 45% of benchmarks exhibited performance decreases when L1D cache was enabled for the 1 SMP and 2 SMP configurations, and only one benchmark showed performance improvement when the L2 cache was enabled

ScholarWorks@UMass Amherst

Matching non-uniformity for program optimizations on heterogeneous many-core systems

Author: Wu Bo
Publication venue: W&M ScholarWorks
Publication date: 01/01/2014
Field of study

As computing enters an era of heterogeneity and massive parallelism, it exhibits a distinct feature: the deepening non-uniform relations among the computing elements in both hardware and software. Besides traditional non-uniform memory accesses, much deeper non-uniformity shows in a processor, runtime, and application, exemplified by the asymmetric cache sharing, memory coalescing, and thread divergences on multicore and many-core processors. Being oblivious to the non-uniformity, current applications fail to tap into the full potential of modern computing devices.;My research presents a systematic exploration into the emerging property. It examines the existence of such a property in modern computing, its influence on computing efficiency, and the challenges for establishing a non-uniformity--aware paradigm. I propose several techniques to translate the property into efficiency, including data reorganization to eliminate non-coalesced accesses, asynchronous data transformations for locality enhancement and a controllable scheduling for exploiting non-uniformity among thread blocks. The experiments show much promise of these techniques in maximizing computing throughput, especially for programs with complex data access patterns

College of William & Mary: W&M Publish

Time-Sharing Time Warp via Lightweight Operating System Support

Author: Pellegrini Alessandro
Quaglia Francesco
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

The order according to which the different tasks are carried out within a Time Warp platform has a direct impact on performance, given that event processing is speculative, thus being subject to the possibility of being rolled-back. It is typically recognized that not-yet-executed events having lower timestamps should be given higher CPU-schedule priority, since this contributes to keep low the amount of rollbacks. However, common Time Warp platforms usually execute events as atomic actions. Hence control is bounced back to the underlying simulation platform only at the end of the current event processing routine. In other words, CPU-scheduling of events resembles classical batch-multitasking scheduling, which is recognized not to promptly react to variations of the priority of pending tasks (e.g. associated with the injection of new events in the system). In this article we present the design and implementation of a time-sharing Time Warp platform, to be run on multi-core machines, where the platform-level software is allowed to take back control on a periodical basis (with fine grain period), and to possibly preempt any ongoing event processing activity in favor of dispatching (along the same thread) any other event that is revealed to have higher priority. Our proposal is based on an ad-hoc kernel module for Linux, which implements a fine grain timer-interrupt mechanism with lightweight management, which is fully integrated with the modern top/bottom-half timer-interrupt Linux architecture, and which does not induce any bias in terms of relative CPU-usage planning across Time Warp vs non-Time Warp threads running on the machine. Our time-sharing architecture has been integrated within the open source ROOT-Sim optimistic simulation package, and we also report some experimental data for an assessment of our proposal

ART

Archivio della ricerca- Università di Roma La Sapienza

Dynamic Hardware Resource Management for Efficient Throughput Processing.

Author: Sethia Ankit
Publication venue
Publication date: 01/01/2015
Field of study

High performance computing is evolving at a rapid pace, with throughput oriented processors such as graphics processing units (GPUs), substituting for traditional processors as the computational workhorse. Their adoption has seen a tremendous increase as they provide high peak performance and energy efficiency while maintaining a friendly programming interface. Furthermore, many existing desktop, laptop, tablet, and smartphone systems support accelerating non-graphics, data parallel workloads on their GPUs. However, the multitude of systems that use GPUs as an accelerator run different genres of data parallel applications, which have significantly contrasting runtime characteristics. GPUs use thousands of identical threads to efficiently exploit the on-chip hardware resources. Therefore, if one thread uses a resource (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource. This contention will eventually saturate the performance of the GPU due to contention for the bottleneck resource,leaving other resources underutilized at the same time. Traditional policies of managing the massive hardware resources work adequately, on well designed traditional scientific style applications. However, these static policies, which are oblivious to the application’s resource requirement, are not efficient for the large spectrum of data parallel workloads with varying resource requirements. Therefore, several standard hardware policies such as using maximum concurrency, fixed operational frequency and round-robin style scheduling are not efficient for modern GPU applications. This thesis defines dynamic hardware resource management mechanisms which improve the efficiency of the GPU by regulating the hardware resources at runtime. The first step in successfully achieving this goal is to make the hardware aware of the application’s characteristics at runtime through novel counters and indicators. After this detection, dynamic hardware modulation provides opportunities for increased performance, improved energy consumption, or both, leading to efficient execution. The key mechanisms for modulating the hardware at runtime are dynamic frequency regulation, managing the amount of concurrency, managing the order of execution among different threads and increasing cache utilization. The resultant increased efficiency will lead to improved energy consumption of the systems that utilize GPUs while maintaining or improving their performance.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113356/1/asethia_1.pd

Deep Blue Documents at the University of Michigan

Parallelization and Optimization of Iterative Solvers on High Performance Architectures

Author: Coronado Barrientos Edoardo Emilio
Publication venue
Publication date: 01/01/2021
Field of study

The main objective of this thesis is to develop an optimal sparse matrix storage format and implement efficient computing kernels that accelerate the execution of the sparse matrix vector (SpMV) product on modern computer architectures. The SpMV product is an essential building brick for a myriad of numerical application codes, especially for iterative solvers and numerical simulators. Improving the performance of the SpMV product is of special interest for researchers, because it is the major bottleneck for codes where it is required. Optimizing this product on modern computer architectures requires knowledge of parallel programing paradigms, efficient parallel algorithms and a basic idea of the device architecture being targeted

Repositorio Institucional da Universidade de Santiago de Compostela

Datacenter Design for Future Cloud Radio Access Network.

Author: Zheng Qi
Publication venue
Publication date: 01/01/2015
Field of study

Cloud radio access network (C-RAN), an emerging cloud service that combines the traditional radio access network (RAN) with cloud computing technology, has been proposed as a solution to handle the growing energy consumption and cost of the traditional RAN. Through aggregating baseband units (BBUs) in a centralized cloud datacenter, C-RAN reduces energy and cost, and improves wireless throughput and quality of service. However, designing a datacenter for C-RAN has not yet been studied. In this dissertation, I investigate how a datacenter for C-RAN BBUs should be built on commodity servers. I first design WiBench, an open-source benchmark suite containing the key signal processing kernels of many mainstream wireless protocols, and study its characteristics. The characterization study shows that there is abundant data level parallelism (DLP) and thread level parallelism (TLP). Based on this result, I then develop high performance software implementations of C-RAN BBU kernels in C++ and CUDA for both CPUs and GPUs. In addition, I generalize the GPU parallelization techniques of the Turbo decoder to the trellis algorithms, an important family of algorithms that are widely used in data compression and channel coding. Then I evaluate the performance of commodity CPU servers and GPU servers. The study shows that the datacenter with GPU servers can meet the LTE standard throughput with 4× to 16× fewer machines than with CPU servers. A further energy and cost analysis show that GPU servers can save on average 13× more energy and 6× more cost. Thus, I propose the C-RAN datacenter be built using GPUs as a server platform. Next I study resource management techniques to handle the temporal and spatial traffic imbalance in a C-RAN datacenter. I propose a “hill-climbing” power management that combines powering-off GPUs and DVFS to match the temporal C-RAN traffic pattern. Under a practical traffic model, this technique saves 40% of the BBU energy in a GPU-based C-RAN datacenter. For spatial traffic imbalance, I propose three workload distribution techniques to improve load balance and throughput. Among all three techniques, pipelining packets has the most throughput improvement at 10% and 16% for balanced and unbalanced loads, respectively.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120825/1/qizheng_1.pd

Deep Blue Documents at the University of Michigan

Architecture--Performance Interrelationship Analysis In Single/Multiple Cpu/Gpu Computing Systems: Application To Composite Process Flow Modeling

Author: Haney Richard Harrison
Publication venue: Aggie Digital Collections and Scholarship
Publication date: 01/01/2013
Field of study

Current developments in computing have shown the advantage of using one or more Graphic Processing Units (GPU) to boost the performance of many computationally intensive applications but there are still limits to these GPU-enhanced systems. The major factors that contribute to the limitations of GPU(s) for High Performance Computing (HPC) can be categorized as hardware and software oriented in nature. Understanding how these factors affect performance is essential to develop efficient and robust applications codes that employ one or more GPU devices as powerful co-processors for HPC computational modeling. The present work analyzes and understands the intrinsic interrelationship of both hardware and software categories on computational performance for single and multiple GPU-enhanced systems using a computationally intensive application that is representative of a large portion of challenges confronting modern HPC. The representative application uses unstructured finite element computations for transient composite resin infusion process flow modeling as the computational core, characteristics and results of which reflect many other HPC applications via the sparse matrix system used for the solution of linear system of equations. This work describes these various software and hardware factors and how they interact to affect performance of computationally intensive applications enabling more efficient development and porting of High Performance Computing applications that includes current, legacy, and future large scale computational modeling applications in various engineering and scientific disciplines

North Carolina Agricultural and Technical State University: NC A&T SU Bluford Library's Aggie Digital Collections and Scholarship

An architecture and technology for Ambient Intelligence Node

Author: Jankowski Tomasz M
Publication venue: University of New Hampshire Scholars\u27 Repository
Publication date: 01/01/2008
Field of study

The era of separate networks is over. The existing technology leaders are preparing a big change in recreation of environment around us. There are several faces for this change. Names like Ambient Intelligence, Ambient Network, IP Multimedia Subsystem and others were created all over the Globe. Regardless of which name is used the new network will combine three main functional principles---it will be: contextual aware, ubiquitous access and intelligent interfaces unified network. Within this thesis two major aspects are defined. First, the definition of the Ambient Intelligence Environment concept is presented. Secondly the architecture vectors for the technology are named. A short overview of the existing technology is followed by details for the chosen technology---FPGA. The overall specifications are incorporated in the design and demonstration of a basic Ambient Intelligence Node created in the System on the Chip (SoC) FPGA technology

UNH Scholars' Repository