64 research outputs found
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Slim Fly: A Cost Effective Low-Diameter Network Topology
Abstract—We introduce a high-performance cost-effective net-work topology called Slim Fly that approaches the theoretically optimal network diameter. Slim Fly is based on graphs that approximate the solution to the degree-diameter problem. We analyze Slim Fly and compare it to both traditional and state-of-the-art networks. Our analysis shows that Slim Fly has significant advantages over other topologies in latency, bandwidth, resiliency, cost, and power consumption. Finally, we propose deadlock-free routing schemes and physical layouts for large computing centers as well as a detailed cost and power model. Slim Fly enables constructing cost effective and highly resilient datacenter and HPC networks that offer low latency and high bandwidth under different HPC workloads such as stencil or graph computations. I
RapidChiplet: A Toolchain for Rapid Design Space Exploration of Chiplet Architectures
Chiplet architectures are a promising paradigm to overcome the scaling
challenges of monolithic chips. Chiplets offer heterogeneity, modularity, and
cost-effectiveness. The design space of chiplet architectures is huge as there
are many degrees of freedom such as the number, size and placement of chiplets,
the topology of the inter-chiplet interconnect and many more. Existing tools
for cost and performance prediction are often too slow to explore this design
space. We present RapidChiplet, a fast, open-source toolchain to predict
latency and throughput of the inter-chiplet interconnect, as well as a chip's
manufacturing cost and thermal stability
HexaMesh: Scaling to Hundreds of Chiplets with an Optimized Chiplet Arrangement
2.5D integration is an important technique to tackle the growing cost of
manufacturing chips in advanced technology nodes. This poses the challenge of
providing high-performance inter-chiplet interconnects (ICIs). As the number of
chiplets grows to tens or hundreds, it becomes infeasible to hand-optimize
their arrangement in a way that maximizes the ICI performance. In this paper,
we propose HexaMesh, an arrangement of chiplets that outperforms a grid
arrangement both in theory (network diameter reduced by 42%; bisection
bandwidth improved by 130%) and in practice (latency reduced by 19%; throughput
improved by 34%). MexaMesh enables large-scale chiplet designs with
high-performance ICIs
Sparse Hamming Graph: A Customizable Network-on-Chip Topology
Chips with hundreds to thousands of cores require scalable networks-on-chip
(NoCs). Customization of the NoC topology is necessary to reach the diverse
design goals of different chips. We introduce sparse Hamming graph, a novel NoC
topology with an adjustable costperformance trade-off that is based on four NoC
topology design principles we identified. To efficiently customize this
topology, we develop a toolchain that leverages approximate floorplanning and
link routing to deliver fast and accurate cost and performance predictions. We
demonstrate how to use our methodology to achieve desired cost-performance
trade-offs while outperforming established topologies in cost, performance, or
both
Learning Combinatorial Node Labeling Algorithms
We present a graph neural network to learn graph coloring heuristics using
reinforcement learning. Our learned deterministic heuristics give better
solutions than classical degree-based greedy heuristics and only take seconds
to evaluate on graphs with tens of thousands of vertices. As our approach is
based on policy-gradients, it also learns a probabilistic policy as well. These
probabilistic policies outperform all greedy coloring baselines and a machine
learning baseline. Our approach generalizes several previous machine-learning
frameworks, which applied to problems like minimum vertex cover. We also
demonstrate that our approach outperforms two greedy heuristics on minimum
vertex cover
- …