Search CORE

66 research outputs found

A Power Cap Oriented Time Warp Architecture

Author: Ciciani Bruno
Cingolani Davide
Conoci Stefano
Di Sanzo Pierangelo
Pellegrini Alessandro
Quaglia Francesco
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

Controlling power usage has become a core objective in modern computing platforms. In this article we present an innovative Time Warp architecture oriented to efficiently run parallel simulations under a power cap. Our architectural organization considers power usage as a foundational design principle, as opposed to classical power-unaware Time Warp design. We provide early experimental results showing the potential of our proposal

Crossref

Archivio della Ricerca - Università di Roma 3

ART

Archivio della ricerca- Università di Roma La Sapienza

Toward adaptive radiotherapy for head and neck patients: Uncertainties in dose warping due to the choice of deformable registration algorithm.

Author: Lourenço Ana Mónica
McClelland Jamie R
Modat Marc
Mouinuddin Syed
Ourselin Sébastien
Royle Gary
Van Herk Marcel
van Herk Marcel
Veiga Catarina
Publication venue
Publication date: 01/01/2015
Field of study

The aims of this work were to evaluate the performance of several deformable image registration (DIR) algorithms implemented in our in-house software (NiftyReg) and the uncertainties inherent to using different algorithms for dose warping

UCL Discovery

The University of Manchester - Institutional Repository

King's Research Portal

Reducing off-chip memory accesses of wavefront parallel programs in Graphics Processing Units

Author: Ranasinghe Waruna
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2014
Field of study

2014 Fall.Includes bibliographical references.The power wall is one of the major barriers that stands on the way to exascale computing. To break the power wall, overall system power/energy must be reduced, without affecting the performance. We can decrease energy consumption by designing power efficient hardware and/or software. In this thesis, we present a software approach to lower energy consumption of programs targeted for Graphics Processing Units (GPUs). The main idea is to reduce energy consumption by minimizing the amount of off-chip (global) memory accesses. Off-chip memory accesses can be minimized by improving the last level (L2) cache hits. A wavefront is a set of data/tiles that can be processed concurrently. A kernel is a function that get executed in GPU. We propose a novel approach to implement wavefront parallel programs on GPUs. Instead of using one kernel call per wavefront like in the traditional implementation, we use one kernel call for the whole program and organize the order of computations in such a way that L2 cache reuse is achieved. A strip of wavefronts (or a pass) is a collection of partial wavefronts. We exploit the non-preemptive behavior of the thread block scheduler to process a strip of wavefronts (i.e., a pass) instead of processing a complete wavefront at a time. The data transfered by a partial wavefront in a pass is small enough to fit in L2 cache, so that, successive partial wavefronts in the pass reuse the data in L2 cache. Hence the number of off-chip memory accesses is significantly pruned. We also introduce a technique to communicate and synchronize between two thread blocks without limiting the number of thread blocks per kernel or SM. This technique is used to maintain the order of wavefronts. We have analytically shown and experimentally validated the amount of reduction in off-chip memory accesses in our approach. The off-chip memory reads and writes are decreased by a factor of 45 and 3 respectively. We have shown that if GPUs incorporate L2 cache with write-back cache write policy, then off-chip memory writes also get reduced by a factor of 45. Our approach provides 98% and 74% L2 cache read hits and total cache hits respectively and the traditional approach reports only 2% and 1% respectively

Mountain Scholar (Digital Collections of Colorado and Wyoming)

Acceleration of Large-Scale Electronic Structure Simulations with Heterogeneous Parallel Computing

Author: Kwon Oh-Kyoung
Ryu Hoon
Publication venue: 'IntechOpen'
Publication date: 05/11/2018
Field of study

Large-scale electronic structure simulations coupled to an empirical modeling approach are critical as they present a robust way to predict various quantum phenomena in realistically sized nanoscale structures that are hard to be handled with density functional theory. For tight-binding (TB) simulations of electronic structures that normally involve multimillion atomic systems for a direct comparison to experimentally realizable nanoscale materials and devices, we show that graphical processing unit (GPU) devices help in saving computing costs in terms of time and energy consumption. With a short introduction of the major numerical method adopted for TB simulations of electronic structures, this work presents a detailed description for the strategies to drive performance enhancement with GPU devices against traditional clusters of multicore processors. While this work only uses TB electronic structure simulations for benchmark tests, it can be also utilized as a practical guideline to enhance performance of numerical operations that involve large-scale sparse matrices

IntechOpen

An FPGA-based infrastructure for fine-grained DVFS analysis in high-performance embedded systems

Author: Carloni Luca P.
Cota Emilio G.
Di Guglielmo Giuseppe
Mantovani Paolo
Pilato Christian
Shepard Ken
Tien Kevin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

Emerging technologies provide SoCs with fine-grained DVFS capabilities both in space (number of domains) and time (transients in the order of tens of nanoseconds). Analyzing these systems requires cycle-accurate accounting of rapidly-changing dynamics and complex interactions among accelerators, interconnect, memory, and OS. We present an FPGA-based infrastructure that facilitates such analyses for high-performance embedded systems. We show how our infrastructure can be used to first generate SoCs with loosely-coupled accelerators, and then perform design-space exploration considering several DVFS policies under full-system workload scenarios, sweeping spatial and temporal domain granularity

Archivio istituzionale della ricerca - Politecnico di Milano

Efficient and Scalable Computing for Resource-Constrained Cyber-Physical Systems: A Layered Approach

Author: Zou An
Publication venue: Washington University Open Scholarship
Publication date: 15/05/2021
Field of study

With the evolution of computing and communication technology, cyber-physical systems such as self-driving cars, unmanned aerial vehicles, and mobile cognitive robots are achieving increasing levels of multifunctionality and miniaturization, enabling them to execute versatile tasks in a resource-constrained environment. Therefore, the computing systems that power these resource-constrained cyber-physical systems (RCCPSs) have to achieve high efficiency and scalability. First of all, given a fixed amount of onboard energy, these computing systems should not only be power-efficient but also exhibit sufficiently high performance to gracefully handle complex algorithms for learning-based perception and AI-driven decision-making. Meanwhile, scalability requires that the current computing system and its components can be extended both horizontally, with more resources, and vertically, with emerging advanced technology. To achieve efficient and scalable computing systems in RCCPSs, my research broadly investigates a set of techniques and solutions via a bottom-up layered approach. This layered approach leverages the characteristics of each system layer (e.g., the circuit, architecture, and operating system layers) and their interactions to discover and explore the optimal system tradeoffs among performance, efficiency, and scalability. At the circuit layer, we investigate the benefits of novel power delivery and management schemes enabled by integrated voltage regulators (IVRs). Then, between the circuit and microarchitecture/architecture layers, we present a voltage-stacked power delivery system that offers best-in-class power delivery efficiency for many-core systems. After this, using Graphics Processing Units (GPUs) as a case study, we develop a real-time resource scheduling framework at the architecture and operating system layers for heterogeneous computing platforms with guaranteed task deadlines. Finally, fast dynamic voltage and frequency scaling (DVFS) based power management across the circuit, architecture, and operating system layers is studied through a learning-based hierarchical power management strategy for multi-/many-core systems

Washington University St. Louis: Open Scholarship

Datacenter Design for Future Cloud Radio Access Network.

Author: Zheng Qi
Publication venue
Publication date: 01/01/2015
Field of study

Cloud radio access network (C-RAN), an emerging cloud service that combines the traditional radio access network (RAN) with cloud computing technology, has been proposed as a solution to handle the growing energy consumption and cost of the traditional RAN. Through aggregating baseband units (BBUs) in a centralized cloud datacenter, C-RAN reduces energy and cost, and improves wireless throughput and quality of service. However, designing a datacenter for C-RAN has not yet been studied. In this dissertation, I investigate how a datacenter for C-RAN BBUs should be built on commodity servers. I first design WiBench, an open-source benchmark suite containing the key signal processing kernels of many mainstream wireless protocols, and study its characteristics. The characterization study shows that there is abundant data level parallelism (DLP) and thread level parallelism (TLP). Based on this result, I then develop high performance software implementations of C-RAN BBU kernels in C++ and CUDA for both CPUs and GPUs. In addition, I generalize the GPU parallelization techniques of the Turbo decoder to the trellis algorithms, an important family of algorithms that are widely used in data compression and channel coding. Then I evaluate the performance of commodity CPU servers and GPU servers. The study shows that the datacenter with GPU servers can meet the LTE standard throughput with 4× to 16× fewer machines than with CPU servers. A further energy and cost analysis show that GPU servers can save on average 13× more energy and 6× more cost. Thus, I propose the C-RAN datacenter be built using GPUs as a server platform. Next I study resource management techniques to handle the temporal and spatial traffic imbalance in a C-RAN datacenter. I propose a “hill-climbing” power management that combines powering-off GPUs and DVFS to match the temporal C-RAN traffic pattern. Under a practical traffic model, this technique saves 40% of the BBU energy in a GPU-based C-RAN datacenter. For spatial traffic imbalance, I propose three workload distribution techniques to improve load balance and throughput. Among all three techniques, pipelining packets has the most throughput improvement at 10% and 16% for balanced and unbalanced loads, respectively.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120825/1/qizheng_1.pd

Deep Blue Documents at the University of Michigan

Adaptive resource provisioning mechanism in VEEs for improving performance of HLA-based simulations

Author: CAI Wentong
LI Xiaorong
LI Zengxiang
TA Nguyen Binh Duong
TURNER Stephen John
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/06/2015
Field of study

Institutional Knowledge at Singapore Management University

Parallel Programming with Migratable Objects: Charm++ in Practice

Author: Acun Bilge
Gupta Abhishek
Jain Nikhil
Kale Laxmikant
Langer Akhil
Menon Harshitha
Mikida Eric
Ni Xiang
Robson Michael
Sun Yanhua
Totoni Ehsan
Wesolowski Lukasz
Publication venue: Smith ScholarWorks
Publication date: 16/01/2014
Field of study

The advent of petascale computing has introduced new challenges (e.g. Heterogeneity, system failure) for programming scalable parallel applications. Increased complexity and dynamism in science and engineering applications of today have further exacerbated the situation. Addressing these challenges requires more emphasis on concepts that were previously of secondary importance, including migratability, adaptivity, and runtime system introspection. In this paper, we leverage our experience with these concepts to demonstrate their applicability and efficacy for real world applications. Using the CHARM++ parallel programming framework, we present details on how these concepts can lead to development of applications that scale irrespective of the rough landscape of supercomputing technology. Empirical evaluation presented in this paper spans many miniapplications and real applications executed on modern supercomputers including Blue Gene/Q, Cray XE6, and Stampede

Smith College: Smith ScholarWorks

Recommended from our members

Optimizing performance/watt of embedded SIMD multiprocessors through a priori application guided power scheduling

Author: Albright Ryan K.
Publication venue: 'Oregon State University'
Publication date
Field of study

A method for improving performance/watt of an embedded single-instruction multiple-data (SIMD) architecture using application-guided a priori scheduling of hardware resources is presented. A multi-core architectural simulator is adopted that accurately estimates power, performance, and utilization of various processor components (logic, interconnect and memory). A greedy search is then performed on each algorithm block of a signal processing chain in order to schedule each component's throughput and power. The proposed software-directed hardware rebalancing, applied to a typical electroencephalography (EEG) filtering chain, is analyzed for two different SIMD architectures. The first, representing a super V[subscript th] processor demonstrates a 51%-86% improvement in performance/watt at 1%-10% throughput reduction using block level or algorithm level a priori scheduling. The second architecture used is Synctium, a near V[subscript th] processor which demonstrates 50%-99% performance/watt improvement across the same throughput reduction range and optimization techniques

ScholarsArchive@OSU