22,806 research outputs found
Mechanistic modeling of architectural vulnerability factor
Reliability to soft errors is a significant design challenge in modern microprocessors owing to an exponential increase in the number of transistors on chip and the reduction in operating voltages with each process generation. Architectural Vulnerability Factor (AVF) modeling using microarchitectural simulators enables architects to make informed performance, power, and reliability tradeoffs. However, such simulators are time-consuming and do not reveal the microarchitectural mechanisms that influence AVF. In this article, we present an accurate first-order mechanistic analytical model to compute AVF, developed using the first principles of an out-of-order superscalar execution. This model provides insight into the fundamental interactions between the workload and microarchitecture that together influence AVF. We use the model to perform design space exploration, parametric sweeps, and workload characterization for AVF
Predicting Scheduling Failures in the Cloud
Cloud Computing has emerged as a key technology to deliver and manage
computing, platform, and software services over the Internet. Task scheduling
algorithms play an important role in the efficiency of cloud computing services
as they aim to reduce the turnaround time of tasks and improve resource
utilization. Several task scheduling algorithms have been proposed in the
literature for cloud computing systems, the majority relying on the
computational complexity of tasks and the distribution of resources. However,
several tasks scheduled following these algorithms still fail because of
unforeseen changes in the cloud environments. In this paper, using tasks
execution and resource utilization data extracted from the execution traces of
real world applications at Google, we explore the possibility of predicting the
scheduling outcome of a task using statistical models. If we can successfully
predict tasks failures, we may be able to reduce the execution time of jobs by
rescheduling failed tasks earlier (i.e., before their actual failing time). Our
results show that statistical models can predict task failures with a precision
up to 97.4%, and a recall up to 96.2%. We simulate the potential benefits of
such predictions using the tool kit GloudSim and found that they can improve
the number of finished tasks by up to 40%. We also perform a case study using
the Hadoop framework of Amazon Elastic MapReduce (EMR) and the jobs of a gene
expression correlations analysis study from breast cancer research. We find
that when extending the scheduler of Hadoop with our predictive models, the
percentage of failed jobs can be reduced by up to 45%, with an overhead of less
than 5 minutes
Using Pilot Systems to Execute Many Task Workloads on Supercomputers
High performance computing systems have historically been designed to support
applications comprised of mostly monolithic, single-job workloads. Pilot
systems decouple workload specification, resource selection, and task execution
via job placeholders and late-binding. Pilot systems help to satisfy the
resource requirements of workloads comprised of multiple tasks. RADICAL-Pilot
(RP) is a modular and extensible Python-based pilot system. In this paper we
describe RP's design, architecture and implementation, and characterize its
performance. RP is capable of spawning more than 100 tasks/second and supports
the steady-state execution of up to 16K concurrent tasks. RP can be used
stand-alone, as well as integrated with other application-level tools as a
runtime system
DReAM: An approach to estimate per-Task DRAM energy in multicore systems
Accurate per-task energy estimation in multicore systems would allow performing per-task energy-aware task scheduling and energy-aware billing in data centers, among other applications. Per-task energy estimation is challenged by the interaction between tasks in shared resources, which impacts tasks’ energy consumption in uncontrolled ways. Some accurate mechanisms have been devised recently to estimate per-task energy consumed on-chip in multicores, but there is a lack of such mechanisms for DRAM memories. This article makes the case for accurate per-task DRAM energy metering in multicores, which opens new paths to energy/performance optimizations. In particular, the contributions of this article are (i) an ideal per-task energy metering model for DRAM memories; (ii) DReAM, an accurate yet low cost implementation of the ideal model (less than 5% accuracy error when 16 tasks share memory); and (iii) a comparison with standard methods (even distribution and access-count based) proving that DReAM is much more accurate than these other methods.Peer ReviewedPostprint (author's final draft
An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration
We empirically evaluate an undervolting technique, i.e., underscaling the
circuit supply voltage below the nominal level, to improve the power-efficiency
of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable
Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing
faults due to excessive circuit latency increase. We evaluate the
reliability-power trade-off for such accelerators. Specifically, we
experimentally study the reduced-voltage operation of multiple components of
real FPGAs, characterize the corresponding reliability behavior of CNN
accelerators, propose techniques to minimize the drawbacks of reduced-voltage
operation, and combine undervolting with architectural CNN optimization
techniques, i.e., quantization and pruning. We investigate the effect of
environmental temperature on the reliability-power trade-off of such
accelerators. We perform experiments on three identical samples of modern
Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification
CNN benchmarks. This approach allows us to study the effects of our
undervolting technique for both software and hardware variability. We achieve
more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain
is the result of eliminating the voltage guardband region, i.e., the safe
voltage region below the nominal level that is set by FPGA vendor to ensure
correct functionality in worst-case environmental and circuit conditions. 43%
of the power-efficiency gain is due to further undervolting below the
guardband, which comes at the cost of accuracy loss in the CNN accelerator. We
evaluate an effective frequency underscaling technique that prevents this
accuracy loss, and find that it reduces the power-efficiency gain from 43% to
25%.Comment: To appear at the DSN 2020 conferenc
Performance Characterization of Multi-threaded Graph Processing Applications on Intel Many-Integrated-Core Architecture
Intel Xeon Phi many-integrated-core (MIC) architectures usher in a new era of
terascale integration. Among emerging killer applications, parallel graph
processing has been a critical technique to analyze connected data. In this
paper, we empirically evaluate various computing platforms including an Intel
Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU and an Xeon Phi 7210 processor
codenamed Knights Landing (KNL) in the domain of parallel graph processing. We
show that the KNL gains encouraging performance when processing graphs, so that
it can become a promising solution to accelerating multi-threaded graph
applications. We further characterize the impact of KNL architectural
enhancements on the performance of a state-of-the art graph framework.We have
four key observations: 1 Different graph applications require distinctive
numbers of threads to reach the peak performance. For the same application,
various datasets need even different numbers of threads to achieve the best
performance. 2 Only a few graph applications benefit from the high bandwidth
MCDRAM, while others favor the low latency DDR4 DRAM. 3 Vector processing units
executing AVX512 SIMD instructions on KNLs are underutilized when running the
state-of-the-art graph framework. 4 The sub-NUMA cache clustering mode offering
the lowest local memory access latency hurts the performance of graph
benchmarks that are lack of NUMA awareness. At last, We suggest future works
including system auto-tuning tools and graph framework optimizations to fully
exploit the potential of KNL for parallel graph processing.Comment: published as L. Jiang, L. Chen and J. Qiu, "Performance
Characterization of Multi-threaded Graph Processing Applications on
Many-Integrated-Core Architecture," 2018 IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS), Belfast, United
Kingdom, 2018, pp. 199-20
- …