Modern scientific workflows require hybrid infrastructures combining numerous
decentralized resources on the IoT/Edge interconnected to Cloud/HPC systems
(aka the Computing Continuum) to enable their optimized execution.
Understanding and optimizing the performance of such complex Edge-to-Cloud
workflows is challenging. Capturing the provenance of key performance
indicators, with their related data and processes, may assist in understanding
and optimizing workflow executions. However, the capture overhead can be
prohibitive, particularly in resource-constrained devices, such as the ones on
the IoT/Edge.To address this challenge, based on a performance analysis of
existing systems, we propose ProvLight, a tool to enable efficient provenance
capture on the IoT/Edge. We leverage simplified data models, data compression
and grouping, and lightweight transmission protocols to reduce overheads. We
further integrate ProvLight into the E2Clab framework to enable workflow
provenance capture across the Edge-to-Cloud Continuum. This integration makes
E2Clab a promising platform for the performance optimization of applications
through reproducible experiments.We validate ProvLight at a large scale with
synthetic workloads on 64 real-life IoT/Edge devices in the FIT IoT LAB
testbed. Evaluations show that ProvLight outperforms state-of-the-art systems
like ProvLake and DfAnalyzer in resource-constrained devices. ProvLight is 26
-- 37x faster to capture and transmit provenance data; uses 5 -- 7x less CPU;
2x less memory; transmits 2x less data; and consumes 2 -- 2.5x less energy.
ProvLight and E2Clab are available as open-source tools