This is the final version of the article. Available from Elsevier via the DOI in this record.As computing hardware evolves, increasing core counts mean that memory bandwidth is becoming the deciding factor in attaining peak performance of numerical methods. High-order finite element methods, such as those implemented in the spectral/hp framework Nektar++, are particularly well-suited to this environment. Unlike low-order methods that typically utilise sparse storage, matrices representing high-order operators have greater density and richer structure. In this paper, we show how these qualities can be exploited to increase runtime performance on nodes that comprise a typical high-performance computing system, by amalgamating the action of key operators on multiple elements into a single, memory-efficient block. We investigate different strategies for achieving optimal performance across a range of polynomial orders and element types. As these strategies all depend on external factors such as BLAS implementation and the geometry of interest, we present a technique for automatically selecting the most efficient strategy at runtime.We thank D. Ekelschot and M. Turner for their assistance in generating the mesh and parameters for the simulation of Section 6. We also thank F. Witherden for initial discussions motivating this study. This work was funded in part by support from the libHPC II EPSRC project under grant EP/K038788/1. DM additionally acknowledges support under the Laminar Flow Control Centre funded by Airbus/EADS and EPSRC under grant EP/I037946. SJS acknowledges Royal Academy of Engineering support under their research chair scheme. We thank the Imperial College High Performance Computing Service for computing time used to calculate the results seen in Section 6. We additionally acknowledge access to ARCHER with support from the UK Turbulence Consortium under EPSRC grant EP/L000261/1