High-Level Synthesis enables the rapid prototyping of hardware accelerators,
by combining a high-level description of the functional behavior of a kernel
with a set of micro-architecture optimizations as inputs. Such optimizations
can be described by inserting pragmas e.g. pipelining and replication of units,
or even higher level transformations for HLS such as automatic data caching
using the AMD/Xilinx Merlin compiler. Selecting the best combination of
pragmas, even within a restricted set, remains particularly challenging and the
typical state-of-practice uses design-space exploration to navigate this space.
But due to the highly irregular performance distribution of pragma
configurations, typical DSE approaches are either extremely time consuming, or
operating on a severely restricted search space. This work proposes a framework
to automatically insert HLS pragmas in regular loop-based programs, supporting
pipelining, unit replication, and data caching. We develop an analytical
performance and resource model as a function of the input program properties
and pragmas inserted, using non-linear constraints and objectives. We prove
this model provides a lower bound on the actual performance after HLS. We then
encode this model as a Non-Linear Program, by making the pragma configuration
unknowns of the system, which is computed optimally by solving this NLP. This
approach can also be used during DSE, to quickly prune points with a (possibly
partial) pragma configuration, driven by lower bounds on achievable latency. We
extensively evaluate our end-to-end, fully implemented system, showing it can
effectively manipulate spaces of billions of designs in seconds to minutes for
the kernels evaluated