To enable data locality, we have developed an approach of adding coordinated caches to existing compute clusters. Since the data stored locally is volatile and selected dynamically, only a fraction of local storage space is required. Our approach allows to freely select the degree at which data locality is provided. It may be used to work in conjunction with large network bandwidths, providing only highly used data to reduce peak loads. Alternatively, local storage may be scaled up to perform data analysis even with low network bandwidth. To prove the applicability of our approach, we have developed a prototype implementing all required functionality. It integrates seamlessly into batch systems, requiring practically no adjustments by users. We have now been actively using this prototype on a test cluster for HEP analyses. Specifically, it has been integral to our jet energy calibration analyses for CMS during run 2. The system has proven to be easily usable, while providing substantial performance improvements. 
					Since confirming the applicability for our use case, we have investigated the design in a more general way. Simulations show that many infrastructure setups can benefit from our approach. For example, it may enable us to dynamically provide data locality in opportunistic cloud resources. The experience we have gained from our prototype enables us to realistically assess the feasibility for general production use

Fischer, M.

Gffels, M.

Jung, C.

Kuehn, E.

Journal of Physics Conference Series

English

KITopen

This content has been downloaded from IOPscience. Please scroll down to see the full text.Download details:IP Address: 129.13.72.197This content was downloaded on 09/08/2017 at 11:02Please note that terms and conditions apply.Data Locality via Coordinated Caching for Distributed ProcessingView the table of contents for this issue, or go to the journal homepage for more2016 J. Phys.: Conf. Ser. 762 012011(http://iopscience.iop.org/1742-6596/762/1/012011)Home Search Collections Journals About Contact us My IOPscienceYou may also be interested in:Evaluation of Apache Hadoop for parallel data analysis with ROOTS Lehrack, G Duckeck and J EbkeOn the Factor Refinement Principle and its Implementation on Multicore ArchitecturesMd Mohsin Ali, Marc Moreno Maza and Yuzhen XieImplementation of a solution Cloud Computing with MapReduce modelChalabi BayaRunning a typical ROOT HEP analysis on Hadoop MapReduceS A Russo, M Pinamonti and M CobalALICE HLT TPC Tracking of Pb-Pb Events on GPUsDavid Rohr, Sergey Gorbunov, Artur Szostak et al.Experience with Intel's Many Integrated Core architecture in ATLAS softwareS Fleischmann, S Kama, W Lavrijsen et al.Forming an ad-hoc nearby storage, based on IKAROS and social networking servicesChristos Filippidis, Yiannis Cotronis and Christos MarkouExploiting the ALICE HLT for PROOF by scheduling of Virtual MachinesMarco Meoni, Stefan Boettger, Pierre Zelnicek et al.Performance optimisations for distributed analysis in ALICEL Betev, A Gheata, M Gheata et al.Data Locality via Coordinated Caching forDistributed ProcessingM Fischer, E Kuehn, M Gi↵els, C JungKarlsruhe Institute of Technology, Steinbuch Centre for Computing,Hermann-von-Helmholtz-Platz 1, 76344 Eggenstein-Leopoldshafen, GermanyE-mail: {max.fischer, eileen.kuehn, manuel.giffels, christopher.jung}@kit.eduAbstract. To enable data locality, we have developed an approach of adding coordinatedcaches to existing compute clusters. Since the data stored locally is volatile and selecteddynamically, only a fraction of local storage space is required. Our approach allows to freelyselect the degree at which data locality is provided. It may be used to work in conjunction withlarge network bandwidths, providing only highly used data to reduce peak loads. Alternatively,local storage may be scaled up to perform data analysis even with low network bandwidth.To prove the applicability of our approach, we have developed a prototype implementingall required functionality. It integrates seamlessly into batch systems, requiring practically noadjustments by users. We have now been actively using this prototype on a test cluster for HEPanalyses. Specifically, it has been integral to our jet energy calibration analyses for CMS duringrun 2. The system has proven to be easily usable, while providing substantial performanceimprovements.Since confirming the applicability for our use case, we have investigated the design in amore general way. Simulations show that many infrastructure setups can benefit from ourapproach. For example, it may enable us to dynamically provide data locality in opportunisticcloud resources. The experience we have gained from our prototype enables us to realisticallyassess the feasibility for general production use.1. IntroductionEnd user data analysis tasks in HEP are commonly processed by hundreds of jobs on a batchcluster, reading data over network from file servers. As we have shown in earlier work [1, 2], ananalysis on a modern institute cluster easily saturates network capacity. With moving simulationjobs to opportunistic resources [3], we expect saturation from analysis jobs to be more frequent.To enable e cient analyses in the future, we therefore investigated data locality as a means toeliminate dependency on network resources.Data locality approaches reduce overall remote I/O by executing jobs as close as to theirinput data as possible. Ideally, the machine executing a job and hosting its data are the same.Several frameworks such as Hadoop [4] already provide data locality based processing, and haveproven the feasibility of this approach.However, we have found such frameworks to be inadequate for end user analyses. For example,the extent of software modifications required would e↵ectively eliminate portability to and fromother infrastructure. Thus, we have developed an alternate approach to data locality thatintegrates into regular batch processing.ACAT2016 IOP PublishingJournal of Physics: Conference Series 762 (2016) 012011 doi:10.1088/1742-6596/762/1/012011Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distributionof this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.Published under licence by IOP Publishing Ltd 1Our approach uses coordinated caches to provide datalocality for a fraction of data. This exploits that atany time, only a few sets of data contribute to overallthroughput (see Figure 1). By eliminating remoteaccesses to them, network capacity remains available forless frequently used data. This is a similar strategy tothe current mixture of simulation and analysis tasks.We have implemented a prototypical middleware [5],targeting the HTCondor batch system [6]. Thisprototype is deployed on a portion of the local KITCMS analysis groups’ batch system. It has sincebeen successfully used for the CMS Jet Energy Scalecalibration analyses performed at KIT.In this paper, we focus on discussing advantagesand disadvantages of our approach and prototype.Section 2 details the features inherent to our approach ingeneral. In section 3, we discuss our current prototypeimplementation. Finally, Section 4 provides a shortconclusion.Figure 1: Read accesses from jobsto skim versions: Over time, userscreate new skims for their analyses.Only some of these are used frequently,however. Often, it takes severalintermediate iterations before a skimis replaced.2. Coordinated Caching in Batch SystemsIndividual features of our approach have been subject of past research. Coordinated caching hasbeen shown to be e↵ective in distributed systems, e.g. web services [7]. Applicability of datalocality frameworks for HEP has been investigated in abundance [8, 9]. Limited data localityvia a middleware has been attempted using cache servers [10]. Other caches for batch processingprovide applications used across several jobs [11, 12] Our work is set apart mainly by the scopein terms of size and subjects of caching.2.1. Scope and GranularityOur approach is to have a single cache target the batch system as a data consumer. This sets usapart from coordinated caches that target data providers such as web services. The systen itselfcan be compared to a scaled up operating system page cache. A page cache targets applicationsaccessing blocks via read system calls to process files. Our cache targets workflows accessingfiles via jobs to process datasets. Under the hood, the system is composed of several caches, oneon each worker node. These are joined together by a coordination service.The biggest advantage is the scaled up decision layer, selecting files for caching. Even in asmall cluster, using a few cores for managing the caches is negligible overhead. Likewise, storingfile meta-data in the scale of MB is negligible compared to file sizes at the scale of GB. Sincejobs operate on the scale of minutes to hours, the system does not need to respond any faster.This allows for sophisticated caching logic.The biggest challenge originates from our cache volume being actually several distinctvolumes. Coordinating these is not a technical challenge, but a scheduling problem. Sincewe want to avoid remote accesses, jobs and data must be closely aligned.2.2. Data Selection and Hit RatesUsing coordinated caches for data locality adds another dimension to data handling. It is notsu cient to have a file anywhere in the cache. Instead, it must be available on the host jobsare trying to access it from. This is the key reason why coordination is required for distributedcaching of unique input to be e↵ective. If files and jobs were placed randomly, the chance of a fileACAT2016 IOP PublishingJournal of Physics: Conference Series 762 (2016) 012011 doi:10.1088/1742-6596/762/1/0120112and corresponding job being on the same worker node is inversely proportional to the number ofnodes. When a job requires several files, the expected fraction of files locally available convergesto the inverse number of nodes. Even for small clusters, this makes the impact of uncoordinatedcaching negligible.In most setups, network throughput still is substantial. To maximize overall throughput,it is best not to read everything from cache, but instead read some data over network. Thiscan easily be achieved by caching only a portion of the data deemed relevant. There are twoextremes to this: On one end, caches provide just enough data to not overburden the network.This maximizes cache volume, as only a fraction of each dataset must be provided. On the otherend, caches provide as much data as possible, freeing the maximum of network resources. Thisis optimal if there are many workflows unsuitable for caching, as these can use the network fully.2.3. User Workflow IntegrationBeing designed as a cache, our system relies on intercepting access requests. This contrasts withdedicated data locality solutions, which require explicit requests to the middleware.As we treat the entire job as an access, requests are intercepted at di↵erent points. Onthe one hand, jobs are intercepted as meta-data in the batch system. This provides extensiveinformation, e.g. estimated runtime. On the other hand, the file accesses of jobs executing onworker nodes are intercepted. This implements the actual rerouting to local data.Intercepting requests has the advantage of being transparent to end users. It does not makea technical di↵erence to jobs whether our system is present. The only notable di↵erence is anincrease in performance if files are provided from our cache. This ensures optimal portability.The downside of a transparent system is that it cannot directly interact with user workflows.For example, data locality frameworks actively set job input to match the distribution ofdatasets. Our system instead has to optimize data placement to match the splitting alreadyused by jobs.3. HTDA PrototypeOur approach is prototypically implemented as the High Throughput Data Analysismiddleware [5]. At its core is a generic node application, which is deployed on worker nodes andservice machines. The HTDA nodes implement all facilities to join together to a single pool.Each node runs one or several modules which implement the actual services:• A Provider on each worker node, which adds, maintains and removes local copies of files.• A Locator per submission node, which tracks the files available on worker nodes.• A Coordinator per pool, which decides what files to cache and where to do so.We have deployed our middleware on a portion of the local KIT CMS HTCondor cluster.The HTDA section is composed of 4 worker nodes (see Table 1) running Provider nodes. Atotal of 7 file servers mounted via NFS are used.Table 1: Test Cluster Worker NodeOS Scientific Linux 6 (Kernel 2.6.32) or CentOS 7 (Kernel 4.4.2)CPU 2x Intel Xeon E5-2650v2 @ 2.66GHz (à 8 cores, 16 threads)Memory 8x 8GB RAMSSD 1x Samsung SSD 840 PRO 512GB or2x Samsung SSD 840 EVO 256GBNetwork 1x Intel X540-T1 (10GigE/RJ45)ACAT2016 IOP PublishingJournal of Physics: Conference Series 762 (2016) 012011 doi:10.1088/1742-6596/762/1/01201133.1. Middleware PerformanceBeing a prototype, we have implemented the middleware in Python. This is motivated bythe need for rapid development and ease of maintenance. The prototype makes heavy useof abstraction, both between node and module as well as between modules themselves. Theimplementation is mature enough for stable deployment and operation.Experiences in terms of system requirements have been unexpectedly good. The onlyperformance critical component, the Provider node, has negligible overhead (see Table 2). ItsCPU consumption is linear to the frequency of validating files. For our tests, this was once every5 minutes. Since files are rarely deleted by users before being discarded by our cache, this couldbe reduced by at least one magnitude.Table 2: Module Resource Consumption, according to the ps utilitymodule CPU RSSProvider 3.5% 120MBLocator 1.0% 60MBCoordinator 14.1% 1GB3.2. Request InterceptionWe have implemented the interception of requests via two means: Job meta-data is interceptedvia hooks in the HTCondor job router daemon. Application access to data is intercepted on theworker node by a union file system.Using hooks in the job router has several advantages over the common method of parsingthe job queue. Most importantly, we do not have to repeatedly scan the queue for jobs. Theselection and tracking is e ciently handled by HTCondor itself. The hooks are automaticallycalled on job submission and removal as well as regularly while it is running. This allows for allour services to be event driven.Since the hooks connect to the pool, communication can easily be optimized. Our hooksare executed often, but skip several updates after updating a job successfully. This naturallyleads to spreading out requests if our system cannot service them fast enough. Additionally,hooks can address any end-point of the pool for load balancing. Finally, we can limit how manyjobs are connected to our system by HTCondor at any time. We can thus handle an arbitrarynumber of queued jobs.The downside is that only one type of hook may be active per job. Using hooks, it is notpossible to track one job by multiple systems, e.g. our cache and an opportunistic resourceprovider. This would require creating an intermediate hook calling the services’ hooks.Intercepting read requests via a union file system has proven ideal for performance. We haveused Another Union File System in all our setups. It performs the redirection to cache or storageinside the VFS layer of the kernel. Any overhead from this is too small to measure. It is worthnoting that this technique is not production ready on Scientific Linux 6. The combination of itskernel and the available AUFS 2 may deadlock. However, we have since switched to CentOS 7and AUFS 4, which works flawlessly.3.3. Compatibility with Volatile ResourcesThe HTDA middleware is robust against nodes unexpectedly entering and leaving the pool.Any node may keep on functioning on its own. Provider nodes maintain existing files, allowingLocator and Coordinator nodes to work with their last known state for some time. This makesthe system intrinsically suitable for deployment on opportunistic resources.For better adaptability to such resources, the handling of file meta-data and ownership maybe improved in the future. At the moment, we assume reading from remote caches has nobenefit over reading from the original source. Thus files and their meta-data are owned by theACAT2016 IOP PublishingJournal of Physics: Conference Series 762 (2016) 012011 doi:10.1088/1742-6596/762/1/0120114Provider maintaining them. To allow shared caches accessed from multiple hosts, it would bebetter to have both owned by the cache device. An attached Provider would take ownershiponly temporarily. If the opportunistic worker node hosting it shuts down, the data may persiston the shared device and a new Provider may take ownership.4. ConclusionData locality is an important approach for scalable data analysis. To integrate data locality intoHEP workflows, we have created a new approach to transparently enhance batch systems. Thisis based on a pool of coordinated caches, providing files used by jobs locally on worker nodes.There are several advantages intrinsic to our approach, which make it suitable for use in HEP.Since we target the batch system as a consumer, our system must only provide frequently useddata. An arbitrary number and volume of data servers may provide infrequently used data. Thesystem is by design transparent to users and existing workflows. It can thus be added seamlesslyto existing infrastructure without negative side e↵ects.We have implemented our system as a prototype and successfully used it for CMS run 2analyses. Being the first of its kind, there are several features that may be improved or expandedin the future. These mainly concern the applicability to other setups, such as opportunisticresources or shared cache volumes. The system itself is mature enough for active use in dedicatedbatch systems.AcknowledgmentsThe authors would like to thank all people and institutions involved in the project Large ScaleData Management and Analysis (LSDMA), as well as the German Helmholtz Association, andthe Karlsruhe School of Elementary Particle and Astroparticle Physics (KSETA) for supportingand funding this work.References[1] Fischer M, Metzla↵ C, Kühn E, Gi↵els M, Quast G, Jung C and Hauth T 2015 J. Phys.: Conf. Ser. 664092008[2] Fischer M, Gi↵els M, Jung C, Kühn E and Quast G 2015 J. Phys.: Conf. Ser. 608 012018[3] Gi↵els M, Hauth T, Polgart F and Quast G 2015 J. Phys.: Conf. Ser. 664 022022[4] The Apache Software Foundation 2015 Apache hadoop URL https://hadoop.apache.org[5] Fischer M, Metzla↵ C and Gi↵els M 2015 HPDA middleware repository URLhttps://bitbucket.org/kitcmscomputing/hpda[6] Thain D, Tannenbaum T and Livny M 2005 Concurr. Comput.: Pract. Exper. 17 2–4[7] Paul S and Fei Z 2001 Comput. Commun. 24 256 – 268[8] Russo S A, Pinamonti M and Cobal M 2014 J. Phys.: Conf. Ser. 513 032080[9] Lehrack S, Duckeck G and Ebke J 2014 J. Phys.: Conf. Ser. 513 032054[10] Yang W, Hanushevsky A B, Mount R P and the Atlas Collaboration 2014 J Phys.: Conf. Ser. 513 042035[11] Blomer J and Fuhrmann T 2010 A fully decentralized file system cache for the cernvm-fs 2010 Proc. of 19thInt. Conf. Comput. Commun. and Netw. (ICCCN) pp 1–6[12] Weitzel D, Bockelman B and Swanson D 2015 Distributed caching using the htcondor cached Proc. for Conf.Parallel and Distrib. Process. Techn. and Appl.ACAT2016 IOP PublishingJournal of Physics: Conference Series 762 (2016) 012011 doi:10.1088/1742-6596/762/1/0120115

Data Locality via Coordinated Caching for Distributed Processing

M Fischer

E Kuehn

M Giffels

C Jung

Crossref

https://publikationen.bibliothek.kit.edu/1000063493/4281592

Data Locality via Coordinated Caching for Distributed Processing

Abstract

Similar works

Full text

Available Versions

KITopen

Crossref