260,937 research outputs found
3D analytical modelling and iterative solution for high performance computing clusters
Mobile Cloud Computing enables the migration of services to the edge of the Internet. Therefore, high-performance computing clusters are widely deployed to improve computational capabilities of such environments. However, they are prone to failures and need analytical models to predict their behaviour in order to deliver desired quality-of-service and quality-of-experience to mobile users. This paper proposes a 3D analytical model and a problem-solving approach for sustainability evaluation of high-performance computing clusters. The proposed solution uses an iterative approach to obtain performance measurements to overcome the state space explosion problem. The availability modelling and evaluation of master and computing nodes are performed using a multi-repairman approach. The optimum number of repairmen is also obtained to get realistic results and reduce the overall cost. The proposed model is validated using discrete event simulation. The analytical approach is much faster and in good agreement with the simulations. The analysis focuses on mean queue length, throughput, and mean response time outputs. The maximum differences between analytical and simulation results in the considered scenarios of up to a billion states are less than1.149%,3.82%, and3.76%respectively. These differences are well within the5%of confidence interval of the simulation and the proposed model
Architecture of Job scheduling simulator for demand response based resource provisioning
We study a new service model based on the Demand Response (DR) resource provisioning at High Performance Computing (HPC) centers. This DR-based resource provisioning model allows administrators of HPC centers to provide computing services with incentives to users to compensate for the performance loss due to power saving operations. In a power conservation mode, a job’s performance may decrease, both in terms of a job waiting time and a job execution time. With DR-based resource provisioning, the submitted jobs are divided into two categories, allowed jobs and disallowed jobs, depending on the user’s tolerance in the performance degradation. The allowed jobs, if indeed affected by the power saving operations, will receive compensation in accordance with an incentive system which determines the reward to the user. For designing an appropriate demand response model, we need to focus on the increase in the job’s execution time and the job’s waiting time, and the corresponding decrease in the power consumption. These are important factors in deriving an incentive system. Currently, no existing approaches can reliably quantify the effectiveness and the contribution of these factors in HPC job scheduling and resource provisioning. In this paper, we propose a newly developed job scheduling simulator that can evaluate DR-based resource provisioning approach under various operating conditions. We designed and implemented the job scheduling simulator for HPC demand-response resource provisioning using a general-purpose discrete-event simulator. Our experiments show that the job scheduling simulator can properly represent the demand response resource provisioning using different job scheduling scenarios
Optimised access to user analysis data using the gLite DPM
The ScotGrid distributed Tier-2 now provides more that 4MSI2K and 500TB for LHC computing, which is spread across three sites at Durham, Edinburgh and Glasgow. Tier-2 sites have a dual role to play in the computing models of the LHC VOs. Firstly, their CPU resources are used for the generation of Monte Carlo event data. Secondly, the end user analysis data is distributed across the grid to the site's storage system and held on disk ready for processing by physicists' analysis jobs. In this paper we show how we have designed the ScotGrid storage and data management resources in order to optimise access by physicists to LHC data. Within ScotGrid, all sites use the gLite DPM storage manager middleware. Using the EGEE grid to submit real ATLAS analysis code to process VO data stored on the ScotGrid sites, we present an analysis of the performance of the architecture at one site, and procedures that may be undertaken to improve such. The results will be presented from the point of view of the end user (in terms of number of events processed/second) and from the point of view of the site, which wishes to minimise load and the impact that analysis activity has on other users of the system
CMS Monte Carlo production in the WLCG computing Grid
Monte Carlo production in CMS has received a major boost in performance and
scale since the past CHEP06 conference. The production system has been re-engineered in order
to incorporate the experience gained in running the previous system and to integrate production
with the new CMS event data model, data management system and data processing framework.
The system is interfaced to the two major computing Grids used by CMS, the LHC Computing
Grid (LCG) and the Open Science Grid (OSG).
Operational experience and integration aspects of the new CMS Monte Carlo production
system is presented together with an analysis of production statistics. The new system
automatically handles job submission, resource monitoring, job queuing, job distribution
according to the available resources, data merging, registration of data into the data
bookkeeping, data location, data transfer and placement systems. Compared to the previous
production system automation, reliability and performance have been considerably improved. A
more efficient use of computing resources and a better handling of the inherent Grid unreliability
have resulted in an increase of production scale by about an order of magnitude, capable of
running in parallel at the order of ten thousand jobs and yielding more than two million events
per day
Recommended from our members
Computing infrastructure issues in distributed communications systems : a survey of operating system transport system architectures
The performance of distributed applications (such as file transfer, remote login, tele-conferencing, full-motion video, and scientific visualization) is influenced by several factors that interact in complex ways. In particular, application performance is significantly affected both by communication infrastructure factors and computing infrastructure factors. Several communication infrastructure factors include channel speed, bit-error rate, and congestion at intermediate switching nodes. Computing infrastructure factors include (among other things) both protocol processing activities (such as connection management, flow control, error detection, and retransmission) and general operating system factors (such as memory latency, CPU speed, interrupt and context switching overhead, process architecture, and message buffering). Due to a several orders of magnitude increase in network channel speed and an increase in application diversity, performance bottlenecks are shifting from the network factors to the transport system factors.This paper defines an abstraction called an "Operating System Transport System Architecture" (OSTSA) that is used to classify the major components and services in the computing infrastructure. End-to-end network protocols such as TCP, TP4, VMTP, XTP, and Delta-t typically run on general-purpose computers, where they utilize various operating system resources such as processors, virtual memory, and network controllers. The OSTSA provides services that integrate these resources to support distributed applications running on local and wide area networks.A taxonomy is presented to evaluate OSTSAs in terms of their support for protocol processing activities. We use this taxonomy to compare and contrast five general-purpose commercial and experimental operating systems including System V UNIX, BSD UNIX, the x-kernel, Choices, and Xinu
Mobile Computing in Physics Analysis - An Indicator for eScience
This paper presents the design and implementation of a Grid-enabled physics
analysis environment for handheld and other resource-limited computing devices
as one example of the use of mobile devices in eScience. Handheld devices offer
great potential because they provide ubiquitous access to data and
round-the-clock connectivity over wireless links. Our solution aims to provide
users of handheld devices the capability to launch heavy computational tasks on
computational and data Grids, monitor the jobs status during execution, and
retrieve results after job completion. Users carry their jobs on their handheld
devices in the form of executables (and associated libraries). Users can
transparently view the status of their jobs and get back their outputs without
having to know where they are being executed. In this way, our system is able
to act as a high-throughput computing environment where devices ranging from
powerful desktop machines to small handhelds can employ the power of the Grid.
The results shown in this paper are readily applicable to the wider eScience
community.Comment: 8 pages, 7 figures. Presented at the 3rd Int Conf on Mobile Computing
& Ubiquitous Networking (ICMU06. London October 200
A parallel grid-based implementation for real time processing of event log data in collaborative applications
Collaborative applications usually register user interaction in the form of semi-structured plain text event log data. Extracting and structuring of data is a prerequisite for later key processes such as the analysis of interactions, assessment of group activity, or the provision of awareness and feedback. Yet, in real situations of online collaborative activity, the processing of log data is usually done offline since structuring event log data is, in general, a computationally costly process and the amount of log data tends to be very large. Techniques to speed and scale up the structuring and processing of log data with minimal impact on the performance of the collaborative application are thus desirable to be able to process log data in real time. In this paper, we present a parallel grid-based implementation for processing in real time the event log data generated in collaborative applications. Our results show the feasibility of using grid middleware to speed and scale up the process of structuring and processing semi-structured event log data. The Grid prototype follows the Master-Worker (MW) paradigm. It is implemented using the Globus Toolkit (GT) and is tested on the Planetlab platform
- …