2 research outputs found
Storage and Memory Characterization of Data Intensive Workloads for Bare Metal Cloud
As the cost-per-byte of storage systems dramatically decreases, SSDs are
finding their ways in emerging cloud infrastructure. Similar trend is happening
for main memory subsystem, as advanced DRAM technologies with higher capacity,
frequency and number of channels are deploying for cloud-scale solutions
specially for non-virtualized environment where cloud subscribers can exactly
specify the configuration of underling hardware. Given the performance
sensitivity of standard workloads to the memory hierarchy parameters, it is
important to understand the role of memory and storage for data intensive
workloads. In this paper, we investigate how the choice of DRAM (high-end vs
low-end) impacts the performance of Hadoop, Spark, and MPI based Big Data
workloads in the presence of different storage types on bare metal cloud.
Through a methodical experimental setup, we have analyzed the impact of DRAM
capacity, operating frequency, the number of channels, storage type, and
scale-out factors on the performance of these popular frameworks. Based on
micro-architectural analysis, we classified data-intensive workloads into three
groups namely I/O bound, compute bound, and memory bound. The characterization
results show that neither DRAM capacity, frequency, nor the number of channels
play a significant role on the performance of all studied Hadoop workloads as
they are mostly I/O bound. On the other hand, our results reveal that iterative
tasks (e.g. machine learning) in Spark and MPI are benefiting from a high-end
DRAM in particular high frequency and large number of channels, as they are
memory or compute bound. Our results show that using SSD PCIe cannot shift the
bottleneck from storage to memory, while it can change the workload behavior
from I/O bound to compute bound.Comment: 8 pages, research draf
Pyramid: Machine Learning Framework to Estimate the Optimal Timing and Resource Usage of a High-Level Synthesis Design
The emergence of High-Level Synthesis (HLS) tools shifted the paradigm of
hardware design by making the process of mapping high-level programming
languages to hardware design such as C to VHDL/Verilog feasible. HLS tools
offer a plethora of techniques to optimize designs for both area and
performance, but resource usage and timing reports of HLS tools mostly deviate
from the post-implementation results. In addition, to evaluate a hardware
design performance, it is critical to determine the maximum achievable clock
frequency. Obtaining such information using static timing analysis provided by
CAD tools is difficult, due to the multitude of tool options. Moreover, a
binary search to find the maximum frequency is tedious, time-consuming, and
often does not obtain the optimal result. To address these challenges, we
propose a framework, called Pyramid, that uses machine learning to accurately
estimate the optimal performance and resource utilization of an HLS design. For
this purpose, we first create a database of C-to-FPGA results from a diverse
set of benchmarks. To find the achievable maximum clock frequency, we use
Minerva, which is an automated hardware optimization tool. Minerva determines
the close-to-optimal settings of tools, using static timing analysis and a
heuristic algorithm, and targets either optimal throughput or
throughput-to-area. Pyramid uses the database to train an ensemble machine
learning model to map the HLS-reported features to the results of Minerva. To
this end, Pyramid re-calibrates the results of HLS to bridge the accuracy gap
and enable developers to estimate the throughput or throughput-to-area of
hardware design with more than 95% accuracy and alleviates the need to perform
actual implementation for estimation.Comment: This paper has been accepted in The International Conference on
Field-Programmable Logic and Applications 2019 (FPL'19