Vector processing is highly effective in boosting processor performance and
efficiency for data-parallel workloads. In this paper, we present Ara2, the
first fully open-source vector processor to support the RISC-V V 1.0 frozen
ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels
for various problem sizes and vector-unit configurations, achieving an average
functional-unit utilization of 95% on the most computationally intensive
kernels. We pinpoint performance boosters and bottlenecks, including the scalar
core, memories, and vector architecture, providing insights into the main
vector architecture's performance drivers. Leveraging the openness of the
design, we implement Ara2 in a 22nm technology, characterize its PPA metrics on
various configurations (2-16 lanes), and analyze its microarchitecture and
implementation bottlenecks. Ara2 achieves a state-of-the-art energy efficiency
of 37.8 DP-GFLOPS/W (0.8V) and 1.35GHz of clock frequency (critical path: ~40
FO4 gates). Finally, we explore the performance and energy-efficiency
trade-offs of multi-core vector processors: we find that multiple vector cores
help overcome the scalar core issue-rate bound that limits short-vector
performance. For example, a cluster of eight 2-lane Ara2 (16 FPUs) achieves
more than 3x better performance than a 16-lane single-core Ara2 (16 FPUs) when
executing a 32x32x32 matrix multiplication, with 1.5x improved energy
efficiency