2,738 research outputs found
SysScale: Exploiting Multi-domain Dynamic Voltage and Frequency Scaling for Energy Efficient Mobile Processors
There are three domains in a modern thermally-constrained mobile
system-on-chip (SoC): compute, IO, and memory. We observe that a modern SoC
typically allocates a fixed power budget, corresponding to worst-case
performance demands, to the IO and memory domains even if they are
underutilized. The resulting unfair allocation of the power budget across
domains can cause two major issues: 1) the IO and memory domains can operate at
a higher frequency and voltage than necessary, increasing power consumption and
2) the unused power budget of the IO and memory domains cannot be used to
increase the throughput of the compute domain, hampering performance. To avoid
these issues, it is crucial to dynamically orchestrate the distribution of the
SoC power budget across the three domains based on their actual performance
demands.
We propose SysScale, a new multi-domain power management technique to improve
the energy efficiency of mobile SoCs. SysScale is based on three key ideas.
First, SysScale introduces an accurate algorithm to predict the performance
(e.g., bandwidth and latency) demands of the three SoC domains. Second,
SysScale uses a new DVFS (dynamic voltage and frequency scaling) mechanism to
distribute the SoC power to each domain according to the predicted performance
demands. Third, in addition to using a global DVFS mechanism, SysScale uses
domain-specialized techniques to optimize the energy efficiency of each domain
at different operating points.
We implement SysScale on an Intel Skylake microprocessor for mobile devices
and evaluate it using a wide variety of SPEC CPU2006, graphics (3DMark), and
battery life workloads (e.g., video playback). On a 2-core Skylake, SysScale
improves the performance of SPEC CPU2006 and 3DMark workloads by up to 16% and
8.9% (9.2% and 7.9% on average), respectively.Comment: To appear at ISCA 202
A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps
Modern Graphics Processing Units (GPUs) are well provisioned to support the
concurrent execution of thousands of threads. Unfortunately, different
bottlenecks during execution and heterogeneous application requirements create
imbalances in utilization of resources in the cores. For example, when a GPU is
bottlenecked by the available off-chip memory bandwidth, its computational
resources are often overwhelmingly idle, waiting for data from memory to
arrive.
This work describes the Core-Assisted Bottleneck Acceleration (CABA)
framework that employs idle on-chip resources to alleviate different
bottlenecks in GPU execution. CABA provides flexible mechanisms to
automatically generate "assist warps" that execute on GPU cores to perform
specific tasks that can improve GPU performance and efficiency.
CABA enables the use of idle computational units and pipelines to alleviate
the memory bandwidth bottleneck, e.g., by using assist warps to perform data
compression to transfer less data from memory. Conversely, the same framework
can be employed to handle cases where the GPU is bottlenecked by the available
computational units, in which case the memory pipelines are idle and can be
used by CABA to speed up computation, e.g., by performing memoization using
assist warps.
We provide a comprehensive design and evaluation of CABA to perform effective
and flexible data compression in the GPU memory hierarchy to alleviate the
memory bandwidth bottleneck. Our extensive evaluations show that CABA, when
used to implement data compression, provides an average performance improvement
of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU
applications
Scalable Interactive Volume Rendering Using Off-the-shelf Components
This paper describes an application of a second generation implementation of the Sepia architecture (Sepia-2) to interactive volu-metric visualization of large rectilinear scalar fields. By employingpipelined associative blending operators in a sort-last configuration a demonstration system with 8 rendering computers sustains 24 to 28 frames per second while interactively rendering large data volumes (1024x256x256 voxels, and 512x512x512 voxels). We believe interactive performance at these frame rates and data sizes is unprecedented. We also believe these results can be extended to other types of structured and unstructured grids and a variety of GL rendering techniques including surface rendering and shadow map-ping. We show how to extend our single-stage crossbar demonstration system to multi-stage networks in order to support much larger data sizes and higher image resolutions. This requires solving a dynamic mapping problem for a class of blending operators that includes Porter-Duff compositing operators
Polystore++: Accelerated Polystore System for Heterogeneous Workloads
Modern real-time business analytic consist of heterogeneous workloads (e.g,
database queries, graph processing, and machine learning). These analytic
applications need programming environments that can capture all aspects of the
constituent workloads (including data models they work on and movement of data
across processing engines). Polystore systems suit such applications; however,
these systems currently execute on CPUs and the slowdown of Moore's Law means
they cannot meet the performance and efficiency requirements of modern
workloads. We envision Polystore++, an architecture to accelerate existing
polystore systems using hardware accelerators (e.g, FPGAs, CGRAs, and GPUs).
Polystore++ systems can achieve high performance at low power by identifying
and offloading components of a polystore system that are amenable to
acceleration using specialized hardware. Building a Polystore++ system is
challenging and introduces new research problems motivated by the use of
hardware accelerators (e.g, optimizing and mapping query plans across
heterogeneous computing units and exploiting hardware pipelining and
parallelism to improve performance). In this paper, we discuss these challenges
in detail and list possible approaches to address these problems.Comment: 11 pages, Accepted in ICDCS 201
Reconfigurable Hardware Accelerators: Opportunities, Trends, and Challenges
With the emerging big data applications of Machine Learning, Speech
Recognition, Artificial Intelligence, and DNA Sequencing in recent years,
computer architecture research communities are facing the explosive scale of
various data explosion. To achieve high efficiency of data-intensive computing,
studies of heterogeneous accelerators which focus on latest applications, have
become a hot issue in computer architecture domain. At present, the
implementation of heterogeneous accelerators mainly relies on heterogeneous
computing units such as Application-specific Integrated Circuit (ASIC),
Graphics Processing Unit (GPU), and Field Programmable Gate Array (FPGA). Among
the typical heterogeneous architectures above, FPGA-based reconfigurable
accelerators have two merits as follows: First, FPGA architecture contains a
large number of reconfigurable circuits, which satisfy requirements of high
performance and low power consumption when specific applications are running.
Second, the reconfigurable architectures of employing FPGA performs prototype
systems rapidly and features excellent customizability and reconfigurability.
Nowadays, in top-tier conferences of computer architecture, emerging a batch of
accelerating works based on FPGA or other reconfigurable architectures. To
better review the related work of reconfigurable computing accelerators
recently, this survey reserves latest high-level research products of
reconfigurable accelerator architectures and algorithm applications as the
basis. In this survey, we compare hot research issues and concern domains,
furthermore, analyze and illuminate advantages, disadvantages, and challenges
of reconfigurable accelerators. In the end, we prospect the development
tendency of accelerator architectures in the future, hoping to provide a
reference for computer architecture researchers
PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation
High-performance computing has recently seen a surge of interest in
heterogeneous systems, with an emphasis on modern Graphics Processing Units
(GPUs). These devices offer tremendous potential for performance and efficiency
in important large-scale applications of computational science. However,
exploiting this potential can be challenging, as one must adapt to the
specialized and rapidly evolving computing environment currently exhibited by
GPUs. One way of addressing this challenge is to embrace better techniques and
develop tools tailored to their needs. This article presents one simple
technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL,
two open-source toolkits that support this technique.
In introducing PyCUDA and PyOpenCL, this article proposes the combination of
a dynamic, high-level scripting language with the massive performance of a GPU
as a compelling two-tiered computing platform, potentially offering significant
performance and productivity advantages over conventional single-tier, static
systems. The concept of RTCG is simple and easily implemented using existing,
robust infrastructure. Nonetheless it is powerful enough to support (and
encourage) the creation of custom application-specific tools by its users. The
premise of the paper is illustrated by a wide range of examples where the
technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie
Radiation therapy calculations using an on-demand virtual cluster via cloud computing
Computer hardware costs are the limiting factor in producing highly accurate
radiation dose calculations on convenient time scales. Because of this,
large-scale, full Monte Carlo simulations and other resource intensive
algorithms are often considered infeasible for clinical settings. The emerging
cloud computing paradigm promises to fundamentally alter the economics of such
calculations by providing relatively cheap, on-demand, pay-as-you-go computing
resources over the Internet. We believe that cloud computing will usher in a
new era, in which very large scale calculations will be routinely performed by
clinics and researchers using cloud-based resources. In this research, several
proof-of-concept radiation therapy calculations were successfully performed on
a cloud-based virtual Monte Carlo cluster. Performance evaluations were made of
a distributed processing framework developed specifically for this project. The
expected 1/n performance was observed with some caveats. The economics of
cloud-based virtual computing clusters versus traditional in-house hardware is
also discussed. For most situations, cloud computing can provide a substantial
cost savings for distributed calculations.Comment: 12 pages, 4 figure
An Intermediate Language and Estimator for Automated Design Space Exploration on FPGAs
We present the TyTra-IR, a new intermediate language intended as a
compilation target for high-level language compilers and a front-end for HDL
code generators. We develop the requirements of this new language based on the
design-space of FPGAs that it should be able to express and the
estimation-space in which each configuration from the design-space should be
mappable in an automated design flow. We use a simple kernel to illustrate
multiple configurations using the semantics of TyTra-IR. The key novelty of
this work is the cost model for resource-costs and throughput for different
configurations of interest for a particular kernel. Through the realistic
example of a Successive Over-Relaxation kernel implemented both in TyTra-IR and
HDL, we demonstrate both the expressiveness of the IR and the accuracy of our
cost model.Comment: Pre-print and extended version of poster paper accepted at
international symposium on Highly Efficient Accelerators and Reconfigurable
Technologies (HEART2015) Boston, MA, USA, June 1-2, 201
Integration of advanced teleoperation technologies for control of space robots
Teleoperated robots require one or more humans to control actuators, mechanisms, and other robot equipment given feedback from onboard sensors. To accomplish this task, the human or humans require some form of control station. Desirable features of such a control station include operation by a single human, comfort, and natural human interfaces (visual, audio, motion, tactile, etc.). These interfaces should work to maximize performance of the human/robot system by streamlining the link between human brain and robot equipment. This paper describes development of a control station testbed with the characteristics described above. Initially, this testbed will be used to control two teleoperated robots. Features of the robots include anthropomorphic mechanisms, slaving to the testbed, and delivery of sensory feedback to the testbed. The testbed will make use of technologies such as helmet mounted displays, voice recognition, and exoskeleton masters. It will allow tor integration and testing of emerging telepresence technologies along with techniques for coping with control link time delays. Systems developed from this testbed could be applied to ground control of space based robots. During man-tended operations, the Space Station Freedom may benefit from ground control of IVA or EVA robots with science or maintenance tasks. Planetary exploration may also find advanced teleoperation systems to be very useful
- …