36,949 research outputs found
Inter-organizational fault management: Functional and organizational core aspects of management architectures
Outsourcing -- successful, and sometimes painful -- has become one of the
hottest topics in IT service management discussions over the past decade. IT
services are outsourced to external service provider in order to reduce the
effort required for and overhead of delivering these services within the own
organization. More recently also IT services providers themselves started to
either outsource service parts or to deliver those services in a
non-hierarchical cooperation with other providers. Splitting a service into
several service parts is a non-trivial task as they have to be implemented,
operated, and maintained by different providers. One key aspect of such
inter-organizational cooperation is fault management, because it is crucial to
locate and solve problems, which reduce the quality of service, quickly and
reliably. In this article we present the results of a thorough use case based
requirements analysis for an architecture for inter-organizational fault
management (ioFMA). Furthermore, a concept of the organizational respective
functional model of the ioFMA is given.Comment: International Journal of Computer Networks & Communications (IJCNC
Hyperswitch communication network
The Hyperswitch Communication Network (HCN) is a large scale parallel computer prototype being developed at JPL. Commercial versions of the HCN computer are planned. The HCN computer being designed is a message passing multiple instruction multiple data (MIMD) computer, and offers many advantages in price-performance ratio, reliability and availability, and manufacturing over traditional uniprocessors and bus based multiprocessors. The design of the HCN operating system is a uniquely flexible environment that combines both parallel processing and distributed processing. This programming paradigm can achieve a balance among the following competing factors: performance in processing and communications, user friendliness, and fault tolerance. The prototype is being designed to accommodate a maximum of 64 state of the art microprocessors. The HCN is classified as a distributed supercomputer. The HCN system is described, and the performance/cost analysis and other competing factors within the system design are reviewed
A Pattern Language for High-Performance Computing Resilience
High-performance computing systems (HPC) provide powerful capabilities for
modeling, simulation, and data analytics for a broad class of computational
problems. They enable extreme performance of the order of quadrillion
floating-point arithmetic calculations per second by aggregating the power of
millions of compute, memory, networking and storage components. With the
rapidly growing scale and complexity of HPC systems for achieving even greater
performance, ensuring their reliable operation in the face of system
degradations and failures is a critical challenge. System fault events often
lead the scientific applications to produce incorrect results, or may even
cause their untimely termination. The sheer number of components in modern
extreme-scale HPC systems and the complex interactions and dependencies among
the hardware and software components, the applications, and the physical
environment makes the design of practical solutions that support fault
resilience a complex undertaking. To manage this complexity, we developed a
methodology for designing HPC resilience solutions using design patterns. We
codified the well-known techniques for handling faults, errors and failures
that have been devised, applied and improved upon over the past three decades
in the form of design patterns. In this paper, we present a pattern language to
enable a structured approach to the development of HPC resilience solutions.
The pattern language reveals the relations among the resilience patterns and
provides the means to explore alternative techniques for handling a specific
fault model that may have different efficiency and complexity characteristics.
Using the pattern language enables the design and implementation of
comprehensive resilience solutions as a set of interconnected resilience
patterns that can be instantiated across layers of the system stack.Comment: Proceedings of the 22nd European Conference on Pattern Languages of
Program
- …