3 research outputs found

    Issues in petabyte data indexing, retrieval and analysis

    Get PDF
    We propose several methods for speeding up the processing of particle physics data on clusters of PCs. We present a new way of indexing and retrieving data in a high dimensional space by making use of two levels of catalogues enabling an efficient data preselection. We propose several scheduling policies for parallelizing data intensive particle physics applications on clusters of PCs. We show that making use of intra-job parallelization, caching data on the cluster node disks and reordering incoming jobs improves drastically the performances of a simple batch oriented scheduling policy. In addition, we propose the concept of delayed scheduling and adaptive delayed scheduling, where the deliberate inclusion of a delay improves the disk cache access rate and enables a better utilisation of the cluster. We build theoretical models for the different scheduling policies and propose a detailed comparison between the theoretical models and the results of the cluster simulations. We study the improvements obtained by pipelining data I/O operations and data processing operations, both in respect to tertiary storage I/O and to disk I/O. Pipelining improves the performances by approximately 30%. Using the parallelization framework developed EPFL, we describe a possible implementation of the proposed access policies, within the context of the LHCb experiment at CERN. A first prototype is implemented and the proposed scheduling policies can be easily plugged into it

    Parallelization and scheduling of data intensive particle physics analysis jobs on clusters of PCs

    No full text

    Parallelization and scheduling of data intensive particle physics analysis jobs on clusters of PCs

    No full text
    Summary form only given. Scheduling policies are proposed for parallelizing data intensive particle physics analysis applications on computer clusters. Particle physics analysis jobs require the analysis of tens of thousands of particle collision events, each event requiring typically 200ms processing time and 600KB of data. Many jobs are launched concurrently by a large number of physicists. At a first view, particle physics jobs seem to be easy to parallelize, since particle collision events can be processed independently one from another. However, since large amounts of data need to be accessed, the real challenge resides in making an efficient use of the underlying computing resources. We propose several job parallelization and scheduling policies aiming at reducing job processing times and at increasing the sustainable load of a cluster server. Since particle collision events are usually reused by several jobs, cache based job splitting strategies considerably increase cluster utilization and reduce job processing times. Compared with straightforward job scheduling on a processing form, cache based first in first out job splitting speeds up average response times by an order of magnitude and reduces job waiting times in the system's queues from hours to minutes. By scheduling the jobs out of order, according to the availability of their collision events in the node disk caches, response times are further reduced, especially at high loads. In the delayed scheduling policy, job requests are accumulated during a time period, divided into subjob requests according to a parameterizable subjob size, and scheduled at the beginning of the next time period according to the availability of their data segments within the disk node caches. Delayed scheduling sustains a load close to the maximal theoretically sustainable load of a cluster, but at the cost of longer average response times. Finally we propose an adaptive delay scheduling approach, where the scheduling delay is adapted to the current load. This last scheduling approach sustains very high loads and offers low response times at normal loads
    corecore