2,316 research outputs found
Performance Characterization of Multi-threaded Graph Processing Applications on Intel Many-Integrated-Core Architecture
Intel Xeon Phi many-integrated-core (MIC) architectures usher in a new era of
terascale integration. Among emerging killer applications, parallel graph
processing has been a critical technique to analyze connected data. In this
paper, we empirically evaluate various computing platforms including an Intel
Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU and an Xeon Phi 7210 processor
codenamed Knights Landing (KNL) in the domain of parallel graph processing. We
show that the KNL gains encouraging performance when processing graphs, so that
it can become a promising solution to accelerating multi-threaded graph
applications. We further characterize the impact of KNL architectural
enhancements on the performance of a state-of-the art graph framework.We have
four key observations: 1 Different graph applications require distinctive
numbers of threads to reach the peak performance. For the same application,
various datasets need even different numbers of threads to achieve the best
performance. 2 Only a few graph applications benefit from the high bandwidth
MCDRAM, while others favor the low latency DDR4 DRAM. 3 Vector processing units
executing AVX512 SIMD instructions on KNLs are underutilized when running the
state-of-the-art graph framework. 4 The sub-NUMA cache clustering mode offering
the lowest local memory access latency hurts the performance of graph
benchmarks that are lack of NUMA awareness. At last, We suggest future works
including system auto-tuning tools and graph framework optimizations to fully
exploit the potential of KNL for parallel graph processing.Comment: published as L. Jiang, L. Chen and J. Qiu, "Performance
Characterization of Multi-threaded Graph Processing Applications on
Many-Integrated-Core Architecture," 2018 IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS), Belfast, United
Kingdom, 2018, pp. 199-20
DReAM: Dynamic Re-arrangement of Address Mapping to Improve the Performance of DRAMs
The initial location of data in DRAMs is determined and controlled by the
'address-mapping' and even modern memory controllers use a fixed and
run-time-agnostic address mapping. On the other hand, the memory access pattern
seen at the memory interface level will dynamically change at run-time. This
dynamic nature of memory access pattern and the fixed behavior of address
mapping process in DRAM controllers, implied by using a fixed address mapping
scheme, means that DRAM performance cannot be exploited efficiently. DReAM is a
novel hardware technique that can detect a workload-specific address mapping at
run-time based on the application access pattern which improves the performance
of DRAMs. The experimental results show that DReAM outperforms the best
evaluated address mapping on average by 9%, for mapping-sensitive workloads, by
2% for mapping-insensitive workloads, and up to 28% across all the workloads.
DReAM can be seen as an insurance policy capable of detecting which scenarios
are not well served by the predefined address mapping
An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration
We empirically evaluate an undervolting technique, i.e., underscaling the
circuit supply voltage below the nominal level, to improve the power-efficiency
of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable
Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing
faults due to excessive circuit latency increase. We evaluate the
reliability-power trade-off for such accelerators. Specifically, we
experimentally study the reduced-voltage operation of multiple components of
real FPGAs, characterize the corresponding reliability behavior of CNN
accelerators, propose techniques to minimize the drawbacks of reduced-voltage
operation, and combine undervolting with architectural CNN optimization
techniques, i.e., quantization and pruning. We investigate the effect of
environmental temperature on the reliability-power trade-off of such
accelerators. We perform experiments on three identical samples of modern
Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification
CNN benchmarks. This approach allows us to study the effects of our
undervolting technique for both software and hardware variability. We achieve
more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain
is the result of eliminating the voltage guardband region, i.e., the safe
voltage region below the nominal level that is set by FPGA vendor to ensure
correct functionality in worst-case environmental and circuit conditions. 43%
of the power-efficiency gain is due to further undervolting below the
guardband, which comes at the cost of accuracy loss in the CNN accelerator. We
evaluate an effective frequency underscaling technique that prevents this
accuracy loss, and find that it reduces the power-efficiency gain from 43% to
25%.Comment: To appear at the DSN 2020 conferenc
Improving DRAM Performance by Parallelizing Refreshes with Accesses
Modern DRAM cells are periodically refreshed to prevent data loss due to
leakage. Commodity DDR DRAM refreshes cells at the rank level. This degrades
performance significantly because it prevents an entire rank from serving
memory requests while being refreshed. DRAM designed for mobile platforms,
LPDDR DRAM, supports an enhanced mode, called per-bank refresh, that refreshes
cells at the bank level. This enables a bank to be accessed while another in
the same rank is being refreshed, alleviating part of the negative performance
impact of refreshes. However, there are two shortcomings of per-bank refresh.
First, the per-bank refresh scheduling scheme does not exploit the full
potential of overlapping refreshes with accesses across banks because it
restricts the banks to be refreshed in a sequential round-robin order. Second,
accesses to a bank that is being refreshed have to wait.
To mitigate the negative performance impact of DRAM refresh, we propose two
complementary mechanisms, DARP (Dynamic Access Refresh Parallelization) and
SARP (Subarray Access Refresh Parallelization). The goal is to address the
drawbacks of per-bank refresh by building more efficient techniques to
parallelize refreshes and accesses within DRAM. First, instead of issuing
per-bank refreshes in a round-robin order, DARP issues per-bank refreshes to
idle banks in an out-of-order manner. Furthermore, DARP schedules refreshes
during intervals when a batch of writes are draining to DRAM. Second, SARP
exploits the existence of mostly-independent subarrays within a bank. With
minor modifications to DRAM organization, it allows a bank to serve memory
accesses to an idle subarray while another subarray is being refreshed.
Extensive evaluations show that our mechanisms improve system performance and
energy efficiency compared to state-of-the-art refresh policies and the benefit
increases as DRAM density increases.Comment: The original paper published in the International Symposium on
High-Performance Computer Architecture (HPCA) contains an error. The arxiv
version has an erratum that describes the error and the fix for i
Doctor of Philosophy
dissertationThe computing landscape is undergoing a major change, primarily enabled by ubiquitous wireless networks and the rapid increase in the use of mobile devices which access a web-based information infrastructure. It is expected that most intensive computing may either happen in servers housed in large datacenters (warehouse- scale computers), e.g., cloud computing and other web services, or in many-core high-performance computing (HPC) platforms in scientific labs. It is clear that the primary challenge to scaling such computing systems into the exascale realm is the efficient supply of large amounts of data to hundreds or thousands of compute cores, i.e., building an efficient memory system. Main memory systems are at an inflection point, due to the convergence of several major application and technology trends. Examples include the increasing importance of energy consumption, reduced access stream locality, increasing failure rates, limited pin counts, increasing heterogeneity and complexity, and the diminished importance of cost-per-bit. In light of these trends, the memory system requires a major overhaul. The key to architecting the next generation of memory systems is a combination of the prudent incorporation of novel technologies, and a fundamental rethinking of certain conventional design decisions. In this dissertation, we study every major element of the memory system - the memory chip, the processor-memory channel, the memory access mechanism, and memory reliability, and identify the key bottlenecks to efficiency. Based on this, we propose a novel main memory system with the following innovative features: (i) overfetch-aware re-organized chips, (ii) low-cost silicon photonic memory channels, (iii) largely autonomous memory modules with a packet-based interface to the proces- sor, and (iv) a RAID-based reliability mechanism. Such a system is energy-efficient, high-performance, low-complexity, reliable, and cost-effective, making it ideally suited to meet the requirements of future large-scale computing systems
- …