Data-intensive end-user analyses in high energy physics require high data throughput to reach short turnaround cycles. This leads to enormous challenges for storage and network infrastructure, especially when facing the tremendously increasing amount of data to be processed during High-Luminosity LHC runs. Including opportunistic resources with volatile storage systems into the traditional HEP computing facilities makes this situation more complex.
						Bringing data close to the computing units is a promising approach to solve throughput limitations and improve the overall performance. We focus on coordinated distributed caching by coordinating workows to the most suitable hosts in terms of cached files. This allows optimizing overall processing efficiency of data-intensive workows and efficiently use limited cache volume by reducing replication of data on distributed caches.
					We developed a NaviX coordination service at KIT that realizes coordinated distributed caching using XRootD cache proxy server infrastructure and HTCondor batch system. In this paper, we present the experience gained in operating coordinated distributed caches on cloud and HPC resources. Furthermore, we show benchmarks of a dedicated high throughput cluster, the Throughput-Optimized Analysis-System (TOpAS), which is based on the above-mentioned concept

Giffels, M.

Heidecker, C.

Quast, G.

Sauter, M.

Schnepf, M. J.

Von Cube, R. F.

English

Cube, R. F. von

KITopen

Journal of Physics: Conference SeriesPAPER • OPEN ACCESSBoosting Performance of Data-intensive Analysis Workflows withDistributed Coordinated CachingTo cite this article: C Heidecker et al 2020 J. Phys.: Conf. Ser. 1525 012065 View the article online for updates and enhancements.This content was downloaded from IP address 84.132.35.219 on 12/09/2020 at 18:43Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distributionof this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.Published under licence by IOP Publishing LtdACAT 2019Journal of Physics: Conference Series 1525 (2020) 012065IOP Publishingdoi:10.1088/1742-6596/1525/1/0120651Boosting Performance of Data-intensive AnalysisWorkflows with Distributed Coordinated CachingC Heidecker, R F von Cube, M Giffels, G Quast, M Sauter and M JSchnepfKIT - Karlsruhe Institute of Technology, GermanyE-mail: christoph.heidecker@kit.eduAbstract. Data-intensive end-user analyses in high energy physics require high datathroughput to reach short turnaround cycles. This leads to enormous challenges for storageand network infrastructure, especially when facing the tremendously increasing amount of datato be processed during High-Luminosity LHC runs. Including opportunistic resources withvolatile storage systems into the traditional HEP computing facilities makes this situation morecomplex.Bringing data close to the computing units is a promising approach to solve throughputlimitations and improve the overall performance. We focus on coordinated distributed cachingby coordinating workflows to the most suitable hosts in terms of cached files. This allowsoptimizing overall processing efficiency of data-intensive workflows and efficiently use limitedcache volume by reducing replication of data on distributed caches.We developed a NaviX coordination service at KIT that realizes coordinated distributedcaching using XRootD cache proxy server infrastructure and HTCondor batch system. In thispaper, we present the experience gained in operating coordinated distributed caches on cloudand HPC resources. Furthermore, we show benchmarks of a dedicated high throughput cluster,the Throughput-Optimized Analysis-System (TOpAS), which is based on the above-mentionedconcept.1. IntroductionThe performance of data-intensive workflows is limited by the data transfer rate [1]. Thisis especially significant within a distributed computing infrastructure, where workflows mustaccess data from remote when it is not available locally. We can avoid bottlenecks regardingdata transfer rates by bringing data processing and storage resources close enough together [2].Applying this data locality on various scales, e.g. within regional computing clusters or on per-host basis, allows for individually tuning established computing infrastructures for data-intensiveworkflows.Caching data for realizing data locality fits especially well for HEP workflows that repeatedlyaccess large amount of data as input. If the throughput for accessing remote data is limited,repeatedly accessing cached input data leads to an overall optimization of data throughput, and,thus, the CPU efficiency of workflows waiting for data. While increasing the efficiency of dataprocessing and thus reducing the processing time, caching reduces the load for shared storageand network resources.Conventional caches deployed in a distributed infrastructure might not necessarily be usedin an efficient way. Repeated processing of workflows that might run on different hosts canACAT 2019Journal of Physics: Conference Series 1525 (2020) 012065IOP Publishingdoi:10.1088/1742-6596/1525/1/0120652cause duplication of data in multiple caches. We want to avoid duplication, since it wasteslimited cache space, decreases the cache hit rate, and, thus, reduces the total data throughputrate [2]. Therefore, it is essential to coordinate data placement and schedule workflows to themost suitable host in terms of data locality.2. Coordinating distributed cachesWe focus on increasing the efficiency of the WLCG Tier 3 computing infrastructure for HEPworkflows including institute and opportunistic resources. There are two main challenges whencoordinating caches in a distributed computing infrastructure: data selection and job to cachecoordination. We need to select data relevant for caching, since we only need to speed up jobsthat are suffering from insufficient data transfer rate. Other jobs will indirectly profit fromresources that are freed faster. Furthermore, we need to coordinate jobs to the most suitablehost in terms of data locality. To foster data locality, we need to influence the batch system,especially the scheduling jobs to resources. By scheduling of jobs to resources, we directlyinfluence the placement of data in caches, and, thus, also coordinate data in distributed caches.3. NaviX — Implementation of job and data coordinationWe realized the coordinated distributed caching concept by implementing the coordinationservice NaviX [3]. It adds job and data coordination to an HTCondor batch system [4] usingan XRootD [5] caching infrastructure. This section gives a short overview about the workingprinciple of NaviX. A more detailed description can be found in [6]. Job FlowCacheInformationWorkerNode WorkerNodeJobSubmissionManagement Computing ResourceGridStorageElementResource PoolCache XRootDProxyHEPUserDataFlowWorkerNode  hookUpdate JobDescriptionMatchJob To CacheCoordinationServiceNaviXFigure 1. NaviX orchestrates an XRootD caching proxy server infrastructure and an HTCondorbatch system to coordinate jobs to the most suitable computing resource in terms of data locality.We require an HTCondor batch system and concentrate on data being transferred via theXRootD protocol. Both are widely used within the HEP community. Since HTCondor does notnatively consider data locality for job to resource scheduling, we need to include this informationby manipulating submitted jobs before they are scheduled. XRootD already provides caching viaso-called caching proxy servers [8] and allows to transparently redirect clients accessing files tosuch a caching proxy server 1. A caching proxy server reads the local copy if the file is alreadycached, or it streams the accessed file while caching it on the fly for repeated access. SuchXRootD caching proxy servers can be placed within a distributed infrastructure. As shown inFigure 1, each submitted job triggers the NaviX coordination service to coordinate the job based1 The transparent redirection of data transfers to the caching proxy server requires XRootD version 4.7 and higher.ACAT 2019Journal of Physics: Conference Series 1525 (2020) 012065IOP Publishingdoi:10.1088/1742-6596/1525/1/0120653on data locality. Using information about the input data of the job as well as files already cachedwithin XRootD proxy servers, NaviX calculates how well each cache fits to the job. Based onthe calculated data locality score, it updates the job description to influence the scheduling ofHTCondor. The decision is updated periodically while jobs are waiting for resources to becomeavailable. We need the user to specify the list of input files required for each job, since this cannot be reliably extracted from the job itself.4. Benchmarking prototype setups for distributed cachingBenchmark results of a first prototype system for coordinated distributed caching were presentedat the CHEP conference in 2018 [6]. We showed that NaviX successfully coordinates jobs to themost suitable hosts in terms of data locality. It reduced the duplication of data in distributedcaches and improved the overall data throughput by coordinating jobs to the cached data. Theexperiences of operating this prototype system allowed us to test the suitability of coordinateddistributed caching on different kinds of resources. In this section, we present benchmarkresults of a dedicated high throughput cluster, an HPC cluster, and a cloud resource whenusing distributed caches coordinated by NaviX.4.1. A dedicated high throughput cluster0 20 40 60 80 100Fraction of cached files (%)65707580859095100CPU efficiency (%)Test-runFigure 2. HEP analysis benchmark for theTOpAS cluster using 9 coordinated NVME SSDcaches and accessing data stored on remote Tier2 center. The dots show the mean value of420 jobs processed in parallel, the error barsrepresent the variance. [7]We installed the Throughput Optimized Anal-ysis System (TOpAS), a dedicated highthroughput cluster at the WLCG Tier 1 cen-ter GridKa. This cluster consists of 11 workernodes and is designed for data-intensive end-user analysis workflows benefiting from fastnetwork connection to WLCG storage re-sources. Each worker node has a single 1TBNVME SSD, and all worker nodes share a1PB distributed filesystem, both intended forcaching.First benchmarks of the TOpAS setup usethe NVME SSDs as coordinated cache totest their usability as fast level-one cache.We measure the benefit for a data-intensiveCMS jet energy calibration workflow, whichserves as an estimate for benefit for HEPanalysis workflows. The test jobs accesseddata from remote Tier 2 Grid storage elementsto evaluate the benefits of caching withinthe WLCG structure. We repeatedly processthe same jobs while increasing the fractionof cached files in steps of 10%. This allowsscanning for an optimal working point, where jobs profit from optimally combining the datatransfer rate from remote Grid storage elements via network with direct access to local cache.As Figure 2 shows, processing cached data directly from built-in SSD increased the CPUefficiency of the workflow by about 20% compared to accessing remote Tier 2 Grid storageelements. When data transfer from remote Grid storage elements is limited, the high datatransfer rate of built-in caches accelerate the jobs. Speeding up the workflows by 20% CPUefficiency means, that we can save the same fraction of costs for CPU cores. A realistic speed-upfactor fspeed−up of the cluster depends on fraction(naccess−1naccess)of cache read accesses comparedto the total number of file accesses naccess. Comparing the costs for additional SSD cachesACAT 2019Journal of Physics: Conference Series 1525 (2020) 012065IOP Publishingdoi:10.1088/1742-6596/1525/1/0120654(costSSD ≈ 200e) with additional worker nodes (costwn ≈ 5000e) required for compensatingthe lost CPU hours, caching helps to save about(1 − 1−fspeed−upfspeed−up ·costSSDcostwn)percent of money.Even if we assume an average cache hit rate of 80-100% and only 2 file accesses in total, weget a realistic speed-up factor of fspeed−up = 0.1, which means that we save about 64% money.Additionally, caching allows for reducing the load for the network and Grid storage infrastructureby accessing data internally within the cluster. This allows us to reduce costs while freeingresources for other WLCG users.4.2. Boosting shared computing infrastructuresWe, additionally, investigated the benefits of coordinated distributed caching for sharedcomputing infrastructures such as cloud and HPC infrastructures. In contrast to TOpAS, wehave to face fluctuating performance of shared storage, computing and network resources, whichcomplicates optimizing data throughput due to congestion of resources.We successfully used resources at NEMO HPC cluster at Freiburg [9, 10] and OpenTelekomcloud [1]. When processing data-intensive workflows on these resources, we observed inefficientCPU utilization caused by limited data transfer rates between the processing hosts and theGrid storage elements [11]. We expect an improvement of the data transfer rate, when placingcaching proxy servers directly inside the HPC cluster and the OpenTelekom cloud resources.0 20 40 60 80 100Fraction of cached files (%)20406080100120Processing time (min)Test-run 1Test-run 20 20 40 60 80 100Fraction of cached files (%)406080100120Processing time (min)Test-run 1Test-run 2Figure 3. Maximum achievable performance using a cache within the shared NEMO HPCcenter (left) and the shared OpenTelekom cloud (right). The dots show the mean value of 420jobs processed in parallel, the error bars represent the variance. [7]For measuring the performance gain caused by caching, we used test jobs that only read datawithout further processing. Again, we repeatedly processed the same jobs while increasingthe fraction of cached files in steps of 10% to scan for an optimal working point. Werepeated the measurement after one week, and observed different behavior for both, NEMOand OpenTelekom. The results of the test runs are shown in Figure 3. Since both are sharedresources, the performance of the setup varies over time according to the load caused by otherusers. At NEMO, the achievable data throughput is influenced by the load on the network andthe distributed file system used as cache volume. Especially at OpenTelekom, we observe adifferent behavior. While the I/O rate of the cache volume was the limiting factor for the firsttest run, limited network transfers slow down the second test run. Since we have no detailedinsight into how OpenTelekom provides storage and network in virtual machines, we can not todetermine the exact reasons. We will have to balance performance of network with the cachevolume and need to adapt this decision to the evolving conditions at the different systems due toACAT 2019Journal of Physics: Conference Series 1525 (2020) 012065IOP Publishingdoi:10.1088/1742-6596/1525/1/0120655congestion. Whether an increase in performance can be achieved with caching therefore dependsstrongly on the resource provider and the utilization of resources. Caching enables the use ofresources that provide storage volume suitable for caching purpose available for processing ofdata-intensive HEP analysis workflows. In particular, HPC centers, in which distributed filesystems are usually quickly accessible, can be easily adopted for HEP use.5. ConclusionThe coordinated distributed caching concept utilizes caches in a distributed computinginfrastructure and coordinates workflows to the most suitable host in terms of data locality. Wehave shown that coordinated distributed caches can improve the overall processing efficiency fordata-intensive jobs on specific computing infrastructures.This concept was realized on the dedicated high throughput system TOpAS. We observed aperformance improvement for a data-intensive HEP workflow when accessing data from built-incaches instead of remote Tier 2 WLCG Grid storage resources. Here, high data transfer rateimproves the data throughput of jobs and reduces the processing time. By increasing the CPUefficiency of jobs, caching optimizes the overall data throughput of the cluster, and reducescosts. Furthermore, we should also consider freeing resources for other WLCG users by cachingas much data as possible.Additionally, we tested caching at resources not dedicated for HEP usage such as the NEMOHPC center and the OpenTelekom cloud. Making these resources available for processing HEPworkflows, allows for adding additional CPU resources and, thus, facing increasing need forcomputing resources. Here, we observed highly fluctuating performance caused by resourcesbeing shared among different users. When adjusting selection of data for caching and thescheduling of jobs to cached data to the computing infrastructure and resource provider, weenable efficient processing of data-intensive HEP workflows.References[1] Schnepf M J et al. 2019 Dynamic Integration and Management of Opportunistic Resources for HEP,Proceedings of the 23rd International Conference on Computing in High Energy and Nuclear Physics (tobe published)[2] Fischer M et al. 2016 Data Locality via Coordinated Caching for Distributed Processing Journal of Physics:Conference Series 62 012011[3] Sauter M, Heidecker C, NaviX [software], version 1.0, 2017. Available from https://gitlab.ekp.kit.edu/ETP-HTC/NaviX [accessed 2019-05-13][4] Thain D et al. 2005 Distributed computing in practice: the Condor experience, Concurrency and Computation:Practice and Experience Volume 17 24 32356[5] Dorigo A et al. 2005 XROOTD/TXNetFile: A Highly Scalable Architecture for Data Access in the ROOTEnvironment, Proceedings of TELE-INFO’05 46:1–46:6[6] Heidecker C et al. 2019 Advancing throughput of HEP analysis work-flows using caching concepts, Proceedingsof the 23rd International Conference on Computing in High Energy and Nuclear Physics (to be published)[7] Sauter M 2019 Increasing efficiency of HEP workflows by coordinated distributed caching, Karlsruhe Instituteof Technology: Institute of Experimental Particle Physics (to be published)[8] Bauerdick L A T et al. 2014 XRootd, disk-based, caching proxy for optimization of data access, data placementand data replication, Journal of Physics: Conference Series 513 4 042044[9] Meier K et al. 2016 Dynamic provisioning of a HEP computing infrastructure on a shared hybrid HPC system,Journal of Physics: Conference Series 762 012012[10] Heidecker C et al. 2019 Dynamic Resource Extension for Data Intensive Computing with Specialized SoftwareEnvironments on HPC systems, Proceedings of the 5rd bwHPC-Symposium (to be published)[11] Schnepf M J et al. 2018 Mastering Opportunistic Computing Resources for HEP, Journal of Physics:Conference Series Volume 1085 032056

Boosting Performance of Data-intensive Analysis Workflows with Distributed Coordinated Caching

https://publikationen.bibliothek.kit.edu/1000123017/87185326

Boosting Performance of Data-intensive Analysis Workflows with Distributed Coordinated Caching

Abstract

Similar works

Full text

Available Versions

KITopen