13 research outputs found

    Computing in High Energy and Nuclear Physics (CHEP) 2012

    No full text
    PROOF on Demand (PoD) is a tool-set, which dynamically sets up a PROOF cluster at a user’s request on any resource management system (RMS). It provides a plug-in based system, in order to use different job submission front-ends. PoD is currently shipped with gLite, LSF, PBS (PBSPro/OpenPBS/Torque), Grid Engine (OGE/SGE), Condor, LoadLeveler, and SSH plug-ins. It makes it possible just within a few seconds to get a private PROOF cluster on any RMS. If there is no RMS, then SSH plug-in can be used, which dynamically turns a bunch of machines to PROOF workers. In this presentation new developments and use cases will be covered. Recently a new major step in PoD development has been made. It can now work not only with local PoD servers, but also with remote ones. PoD’s newly developed “pod-remote” command made it possible for users to utilize a thin client concept. In order to create dynamic PROOF clusters, users are now able to select a remote computer, even behind a firewall, to control a PoD server on it and to submit PoD jobs. In this case a user interface machine is just a lightweight control center and could run on different OS types or mobile devices. All communications are secured and provided via SSH channels. Additionally PoD automatically creates and maintains SSH tunnels for PROOF connections between a user interface and PROOF muster. PoD will create and manage remote and local PROOF clusters for you. Just two commands of PoD will provide you with the full functional PROOF cluster and a real computing on demand. The talk will also include several live demos from real life use cases

    PROOF on Demand

    No full text

    DDS: The Dynamic Deployment System

    No full text
    The Dynamic Deployment System (DDS) [1, 2] is a tool-set that automates and significantly simplifies the deployment of user-defined processes and their dependencies on any resource management system (RMS) using a given topology. DDS is a part of the ALFA framework [3]. DDS implements a single responsibility principle command line tool-set and API. The system treats users’ taskas a black box – it can be an executable or a script. It also provides a watchdogging and a rule-based execution of tasks. DDS implements a plug-in system to abstract the execution of the topology from RMS. Additionally it ships an SSH and a localhost plug-ins which can be used when no RMS is available. DDS doesn’t require pre-installation and pre-configuration on the worker nodes. It deploys private facilities on demand with isolated sandboxes.The system provides a key-value property propagation engine. That engine can be used to configure tasks at runtime. DDS also provides a lightweight API for tasks to exchange messages, so-called, custom commands. In this report a detailed description, current status and future development plans of the DDS will be highlighted

    DDS: The Dynamic Deployment System

    Get PDF
    The Dynamic Deployment System (DDS) [1, 2] is a tool-set that automates and significantly simplifies the deployment of user-defined processes and their dependencies on any resource management system (RMS) using a given topology. DDS is a part of the ALFA framework [3]. DDS implements a single responsibility principle command line tool-set and API. The system treats users’ taskas a black box – it can be an executable or a script. It also provides a watchdogging and a rule-based execution of tasks. DDS implements a plug-in system to abstract the execution of the topology from RMS. Additionally it ships an SSH and a localhost plug-ins which can be used when no RMS is available. DDS doesn’t require pre-installation and pre-configuration on the worker nodes. It deploys private facilities on demand with isolated sandboxes.The system provides a key-value property propagation engine. That engine can be used to configure tasks at runtime. DDS also provides a lightweight API for tasks to exchange messages, so-called, custom commands. In this report a detailed description, current status and future development plans of the DDS will be highlighted

    Modular toolsets for integrating HPC clusters in experiment control systems

    No full text
    New particle/nuclear physics experiments require a massive amount of computing power that is only achieved by using high performance clusters directly connected to the data acquisition systems and integrated into the online systems of the experiments. However, integrating an HPC cluster into the online system of an experiment means: Managing and synchronizing thousands of processes that handle the huge throughput. In this work, modular components that can be used to build and integrate such a HPC cluster in the experiment control systems (ECS) will be introduced. The Online Device Control library (ODC) [1] in combination with the Dynamic Deployment System (DDS) [2, 3] and FairMQ [4] message queuing library offers a sustainable solution for integrating HPC cluster controls into an ECS. DDS as part of the ALFA framework [5] is a toolset that automates and significantly simplifies a dynamic deployment of user-defined processes and their dependencies on any resource management system (RMS) using a given process graph (topology). ODC, in this architecture, is the tool to control and communicate with a topology of FairMQ processes using DDS. ODC is designed to act as a broker between a high level experiment control system and a low level task management system e.g.: DDS. In this work the architecture of both DDS and ODC will be discussed, as well as the design decisions taken based on the experience gained of using these tools in production by the ALICE experiment at CERN to deploy and control thousands of processes (tasks) on the Event Processing Nodes cluster (EPN) during Run3 as a part of the ALICE O2 software ecosystem [6]

    FairRootGroup/ODC: 0.80.2

    No full text
    <ul> <li>Add more details in the log on failed tasks/collections: host & working directory.</li> </ul&gt

    FairRootGroup/ODC: 0.80.1

    No full text
    <ul> <li>CustomCommands: Adapt to the changes in https://github.com/google/flatbuffers/releases/tag/v23.5.8 by dropping the (unused) JSON commands format.</li> <li>Rename GrpcController -> GrpcServer</li> <li>Additional debug info for request timeouts</li> <li>gRPC controller: log request before lock to provide better feedback if the lock can't be acquired.</li> </ul> <p>The removals and name changes are not user-facing and thus non-breaking.</p&gt

    FairRootGroup/DDS: 3.8

    No full text
    &lt;p&gt;&lt;details&gt; &lt;summary&gt;Release Notes&lt;/summary&gt; ## v3.8 (2024-01-19)&lt;/p&gt; &lt;ul&gt; &lt;li&gt;&lt;p&gt;DDS general&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Fixed: On task done remove agents from the agent to tasks mapping.&lt;/li&gt; &lt;li&gt;Fixed: Replace std::iterator as it's deprecated (C++17).&lt;/li&gt; &lt;li&gt;Fixed: Tasks working directory is set to their slot directory instead of DDS_LOCATION.</li> <li>Fixed: Multiple stability issues.</li> <li>Modified: support C++20 standard (GH-477).</li> <li>Modified: Bump minimum version requirements for cmake (from 3.11.0 to 3.19) and boost (from 1.67 to 1.75). (GH-428)</li> <li>Modified: C++17 modernization of EnvProp.h/env_prop. (GH-368)</li> <li>Added: 3rd party dependency on Protobuf (min v3.15).</li> <li>Added: every DDS module logs now its pid, group id and parent pid. (GH-403)</li> <li>Added: Support for Task Assets. (GH-406)</li> <li>Added: Cancel running and pending SLURM jobs on DDS shutdown. (GH-429)</li> <li>Added: Support for Apple's arm64 architecture. (GH-393)</li> <li>Added: DDS_CONFIG and &lt;code&gt;/etc/dds/DDS.cfg&lt;/code&gt; are added to the DDS config search paths. (GH-458)&lt;/li&gt; &lt;li&gt;Added: DDS libraries are now decorated with an ABI version. (GH-410)&lt;/li&gt; &lt;/ul&gt; &lt;/li&gt; &lt;li&gt;&lt;p&gt;dds-agent&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Fixed: Address potential crash in the external process termination routines.&lt;/li&gt; &lt;li&gt;Fixed: Revised handling of the slots container.&lt;/li&gt; &lt;li&gt;Fixed: Ignore SIGTERM while performing cleaning procedures. (GH-459)&lt;/li&gt; &lt;/ul&gt; &lt;/li&gt; &lt;li&gt;&lt;p&gt;dds_intercom_lib&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Fixed: Stability improvements.&lt;/li&gt; &lt;li&gt;Modified: Temporary increase intercom message size to 2048. (GH-440)&lt;/li&gt; &lt;li&gt;Modified: Set debug log severity on Custom command events. (GH-424)&lt;/li&gt; &lt;/ul&gt; &lt;/li&gt; &lt;li&gt;&lt;p&gt;dds-session&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Fixed: skip bad or non-session directories/files when performing clean and list operations.&lt;/li&gt; &lt;li&gt;Added: A data retention sanitization. Not running sessions older than the specified number of days (&quot;server.data_retention&quot;) are auto deleted. (GH-435)&lt;/li&gt; &lt;/ul&gt; &lt;/li&gt; &lt;li&gt;&lt;p&gt;dds-submit&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Added: Users can specify a GroupName tag for each submission. This tag will be assigned to agents and can be used as a requirement in topologies. (GH-407)&lt;/li&gt; &lt;li&gt;Added: Users can provide a Submission Tag (&lt;code&gt;--submission-tag&lt;/code&gt;). DDS RMS plug-ins will use this tag to name RMS jobs and directories. (GH-426)&lt;/li&gt; &lt;li&gt;Added: The command learned a new argument &lt;code&gt;--env-config/-e&lt;/code&gt;. It can be used to define a custom environment script for each agent. (GH-430)&lt;/li&gt; &lt;li&gt;Added: The command learned a new argument &lt;code&gt;--min-instances&lt;/code&gt;. It can be used to provide the minimum number of agents to spawn. (GH-434)&lt;/li&gt; &lt;li&gt;Added: The command learned a new argument &lt;code&gt;--enable-overbooking&lt;/code&gt;. The flag instructs DDS RMS plug-ing to not specify any CPU requirement for RMS jobs. (GH-442)&lt;/li&gt; &lt;li&gt;Added: The command learned a new argument &lt;code&gt;--inline-config&lt;/code&gt;. Content of this string will be added to the RMS job configuration file as is. It can be specified multiple times to add multiline options. (GH-449)&lt;/li&gt; &lt;li&gt;Modified: WN package builder timeout interval was increased from 15 to 30 sec. (GH-468)&lt;/li&gt; &lt;li&gt;Modified: Improve validation of the WN package builder. (GH-468)&lt;/li&gt; &lt;/ul&gt; &lt;/li&gt; &lt;li&gt;&lt;p&gt;dds-topology&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Fixed: Stability improvements.&lt;/li&gt; &lt;li&gt;Fixed: A bug which caused &lt;code&gt;dds::topology_api::CTopoCreator&lt;/code&gt; to ignore task assets. (GH-452)&lt;/li&gt; &lt;li&gt;Fixed: Activating topology takes too long when task assets are used. (GH-454)&lt;/li&gt; &lt;li&gt;Fixed: a bug, which can cause a segfault when updating variables in topology.&lt;/li&gt; &lt;li&gt;Added: A new groupName requirement. It can be used on task and collection. (GH-407)&lt;/li&gt; &lt;li&gt;Added: Open API to read/update/add topology variable. The &lt;code&gt;CTopoVars&lt;/code&gt; class.&lt;/li&gt; &lt;li&gt;Added: Support for Task Assets. (GH-406)&lt;/li&gt; &lt;li&gt;Added: Custom type of Task and Collection requirements. (GH-445)&lt;/li&gt; &lt;/ul&gt; &lt;/li&gt; &lt;li&gt;&lt;p&gt;dds-ssh-plugin&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Fixed: ssh cfg parser is passing cfg files of all plug-ins. (GH-413)&lt;/li&gt; &lt;li&gt;Added: Support for SubmissionID (GH-411)&lt;/li&gt; &lt;/ul&gt; &lt;/li&gt; &lt;li&gt;&lt;p&gt;dds-slurm-plugin&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Fixed: Make sure that scancel's SIGTERM is properly handled by all job steps and their scripts. (GH-459)&lt;/li&gt; &lt;li&gt;Added: Support for SubmissionID (GH-411)&lt;/li&gt; &lt;li&gt;Added: Support of minimum number of agents to spawn. (GH-434)&lt;/li&gt; &lt;li&gt;Modified: Replace array job submission with nodes requirement. (GH-430)&lt;/li&gt; &lt;li&gt;Modified: Remove &lt;code&gt;#SBATCH --ntasks-per-node=1&lt;/code&gt;. (GH-444)&lt;/li&gt; &lt;li&gt;Modified: The &lt;code&gt;#SBATCH --cpus-per-task=%DDS_NSLOTS%&lt;/code&gt; requirement is now can be disabled by providing the &quot;enable-overbooking&quot; flag (ToolsAPI or dds-submit). (GH-442)&lt;/li&gt; &lt;li&gt;Modified: Prevent job termination when downing a single node of the job allocation. (GH-450)&lt;/li&gt; &lt;/ul&gt; &lt;/li&gt; &lt;li&gt;&lt;p&gt;dds-localhost-plugin&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Added: Support for SubmissionID (GH-411)&lt;/li&gt; &lt;/ul&gt; &lt;/li&gt; &lt;li&gt;&lt;p&gt;dds-tools-api&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Modified: Logs of user processes which use Tools API are moved now to the DDS root log directory, instead of sessions directory.&lt;/li&gt; &lt;li&gt;Modified: &lt;code&gt;CSession::waitForNumAgents&lt;/code&gt; is renamed to &lt;code&gt;CSession::waitForNumSlots&lt;/code&gt;. (GH-439)&lt;/li&gt; &lt;li&gt;Added: An ability to unsubscribe from either individual events or all events of requests. (GH-382)&lt;/li&gt; &lt;li&gt;Added: SAgentInfoResponseData provides the agent group name. (GH-415)&lt;/li&gt; &lt;li&gt;Added: SSubmitRequestData supports flags. See &lt;code&gt;SSubmitRequestData::setFlag&lt;/code&gt; and &lt;code&gt;SSubmitRequestData::ESubmitRequestFlags&lt;/code&gt;. (GH-442)&lt;/li&gt; &lt;li&gt;Added: Users can define additional job RMS configuration via &lt;code&gt;SSubmitRequestData::m_inlineConfig&lt;/code&gt;. It will be inlined as is into the final job script. (GH-449)&lt;/li&gt; &lt;/ul&gt; &lt;/li&gt; &lt;li&gt;&lt;p&gt;dds-user-defaults&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Fixed: a dangling reference to a temporary in User Defaults class.&lt;/li&gt; &lt;li&gt;Modified: Bump the version to 0.5.&lt;/li&gt; &lt;li&gt;Added: A &lt;code&gt;server.data_retention&lt;/code&gt; configuration key. (GH-435)&lt;/li&gt; &lt;/ul&gt; &lt;/li&gt; &lt;li&gt;&lt;p&gt;dds-info&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Fixed: wrong exit code when called with &lt;code&gt;--help/--version&lt;/code&gt;. (GH-470)&lt;/li&gt; &lt;/ul&gt; &lt;/li&gt; &lt;li&gt;&lt;p&gt;dds-agent-cmd&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Modified: getlog: now logs are tar'ed without their source directory structure - as a flat stack of files. (GH-369)&lt;/li&gt; &lt;li&gt;Modified: getlog: the command outputs the destination directory where downloaded archives will be stored into. Also fixed command's description. (GH-369)&lt;p&gt;&lt;/details&gt;&lt;/p&gt; &lt;/li&gt; &lt;/ul&gt; &lt;/li&gt; &lt;/ul&gt

    ALFA: A framework for building distributed applications

    No full text
    The ALFA framework is a joint development between ALICE Online-Offline and FairRoot teams. ALFA has a distributed architecture, i.e. a collection of highly maintainable, testable, loosely coupled, independently deployable processes. ALFA allows the developer to focus on building singlefunction modules with well-defined interfaces and operations. The communication between the independent processes is handled by FairMQ transport layer. FairMQ offers multiple implementations of its abstract data transport interface, it integrates some popular data transport technologies like ZeroMQ and nanomsg. Furthermore it also provides shared memory and RDMA transport (based on libfabric) for high throughput, low latency applications. Moreover, FairMQ allows the single process to use multiple and different transports at the same time. FairMQ based processes can be controlled and orchestrated via different systems by implementing the corresponding plugin. However, ALFA delivers also the Dynamic Deployment System (DDS) as an independent set of utilities and interfaces, providing a dynamic distribution of different user processes on any Resource Management System (RMS) or a laptop. ALFA is already being tested and used by different experiments in different stages of data processing as it offers an easy integration of heterogeneous hardware and software

    ALFA: A framework for building distributed applications

    Get PDF
    The ALFA framework is a joint development between ALICE Online-Offline and FairRoot teams. ALFA has a distributed architecture, i.e. a collection of highly maintainable, testable, loosely coupled, independently deployable processes. ALFA allows the developer to focus on building singlefunction modules with well-defined interfaces and operations. The communication between the independent processes is handled by FairMQ transport layer. FairMQ offers multiple implementations of its abstract data transport interface, it integrates some popular data transport technologies like ZeroMQ and nanomsg. Furthermore it also provides shared memory and RDMA transport (based on libfabric) for high throughput, low latency applications. Moreover, FairMQ allows the single process to use multiple and different transports at the same time. FairMQ based processes can be controlled and orchestrated via different systems by implementing the corresponding plugin. However, ALFA delivers also the Dynamic Deployment System (DDS) as an independent set of utilities and interfaces, providing a dynamic distribution of different user processes on any Resource Management System (RMS) or a laptop. ALFA is already being tested and used by different experiments in different stages of data processing as it offers an easy integration of heterogeneous hardware and software
    corecore