219,333 research outputs found
High-Throughput Computing on High-Performance Platforms: A Case Study
The computing systems used by LHC experiments has historically consisted of
the federation of hundreds to thousands of distributed resources, ranging from
small to mid-size resource. In spite of the impressive scale of the existing
distributed computing solutions, the federation of small to mid-size resources
will be insufficient to meet projected future demands. This paper is a case
study of how the ATLAS experiment has embraced Titan---a DOE leadership
facility in conjunction with traditional distributed high- throughput computing
to reach sustained production scales of approximately 52M core-hours a years.
The three main contributions of this paper are: (i) a critical evaluation of
design and operational considerations to support the sustained, scalable and
production usage of Titan; (ii) a preliminary characterization of a next
generation executor for PanDA to support new workloads and advanced execution
modes; and (iii) early lessons for how current and future experimental and
observational systems can be integrated with production supercomputers and
other platforms in a general and extensible manner
Condor services for the Global Grid:interoperability between Condor and OGSA
In order for existing grid middleware to remain viable it is important to investigate their potentialfor integration with emerging grid standards and architectural schemes. The Open Grid ServicesArchitecture (OGSA), developed by the Globus Alliance and based on standard XML-based webservices technology, was the first attempt to identify the architectural components required tomigrate towards standardized global grid service delivery. This paper presents an investigation intothe integration of Condor, a widely adopted and sophisticated high-throughput computing softwarepackage, and OGSA; with the aim of bringing Condor in line with advances in Grid computing andprovide the Grid community with a mature suite of high-throughput computing job and resourcemanagement services. This report identifies mappings between elements of the OGSA and Condorinfrastructures, potential areas of conflict, and defines a set of complementary architectural optionsby which individual Condor services can be exposed as OGSA Grid services, in order to achieve aseamless integration of Condor resources in a standardized grid environment
Initial explorations of ARM processors for scientific computing
Power efficiency is becoming an ever more important metric for both high
performance and high throughput computing. Over the course of next decade it is
expected that flops/watt will be a major driver for the evolution of computer
architecture. Servers with large numbers of ARM processors, already ubiquitous
in mobile computing, are a promising alternative to traditional x86-64
computing. We present the results of our initial investigations into the use of
ARM processors for scientific computing applications. In particular we report
the results from our work with a current generation ARMv7 development board to
explore ARM-specific issues regarding the software development environment,
operating system, performance benchmarks and issues for porting High Energy
Physics software.Comment: Submitted to proceedings of the 15th International Workshop on
Advanced Computing and Analysis Techniques in Physics Research (ACAT2013),
Beijing. arXiv admin note: text overlap with arXiv:1311.100
Timely-Throughput Optimal Coded Computing over Cloud Networks
In modern distributed computing systems, unpredictable and unreliable
infrastructures result in high variability of computing resources. Meanwhile,
there is significantly increasing demand for timely and event-driven services
with deadline constraints. Motivated by measurements over Amazon EC2 clusters,
we consider a two-state Markov model for variability of computing speed in
cloud networks. In this model, each worker can be either in a good state or a
bad state in terms of the computation speed, and the transition between these
states is modeled as a Markov chain which is unknown to the scheduler. We then
consider a Coded Computing framework, in which the data is possibly encoded and
stored at the worker nodes in order to provide robustness against nodes that
may be in a bad state. With timely computation requests submitted to the system
with computation deadlines, our goal is to design the optimal computation-load
allocation scheme and the optimal data encoding scheme that maximize the timely
computation throughput (i.e, the average number of computation tasks that are
accomplished before their deadline). Our main result is the development of a
dynamic computation strategy called Lagrange Estimate-and Allocate (LEA)
strategy, which achieves the optimal timely computation throughput. It is shown
that compared to the static allocation strategy, LEA increases the timely
computation throughput by 1.4X - 17.5X in various scenarios via simulations and
by 1.27X - 6.5X in experiments over Amazon EC2 clustersComment: to appear in MobiHoc 201
- …
