107 research outputs found
Advanced information processing system: Local system services
The Advanced Information Processing System (AIPS) is a multi-computer architecture composed of hardware and software building blocks that can be configured to meet a broad range of application requirements. The hardware building blocks are fault-tolerant, general-purpose computers, fault-and damage-tolerant networks (both computer and input/output), and interfaces between the networks and the computers. The software building blocks are the major software functions: local system services, input/output, system services, inter-computer system services, and the system manager. The foundation of the local system services is an operating system with the functions required for a traditional real-time multi-tasking computer, such as task scheduling, inter-task communication, memory management, interrupt handling, and time maintenance. Resting on this foundation are the redundancy management functions necessary in a redundant computer and the status reporting functions required for an operator interface. The functional requirements, functional design and detailed specifications for all the local system services are documented
Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages
International audienceWe present a joint scheduling and memory allocation algorithm for efficient execution of task-parallel programs on non-uniform memory architecture (NUMA) systems. Task and data placement decisions are based on a static description of the memory hierarchy and on runtime information about intertask communication. Existing locality-aware scheduling strategies for fine-grained tasks have strong limitations: they are specific to some class of machines or applications, they do not handle task dependences, they require manual program annotations, or they rely on fragile profiling schemes. By contrast, our solution makes no assumption on the structure of programs or on the layout of data in memory. Experimental results, based on the OpenStream language, show that locality of accesses to main memory of scientific applications can be increased significantly on a 64-core machine, resulting in a speedup of up to 1.63Ă— compared to a state-of-the-art work-stealing scheduler
Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 1: Army fault tolerant architecture overview
Digital computing systems needed for Army programs such as the Computer-Aided Low Altitude Helicopter Flight Program and the Armored Systems Modernization (ASM) vehicles may be characterized by high computational throughput and input/output bandwidth, hard real-time response, high reliability and availability, and maintainability, testability, and producibility requirements. In addition, such a system should be affordable to produce, procure, maintain, and upgrade. To address these needs, the Army Fault Tolerant Architecture (AFTA) is being designed and constructed under a three-year program comprised of a conceptual study, detailed design and fabrication, and demonstration and validation phases. Described here are the results of the conceptual study phase of the AFTA development. Given here is an introduction to the AFTA program, its objectives, and key elements of its technical approach. A format is designed for representing mission requirements in a manner suitable for first order AFTA sizing and analysis, followed by a discussion of the current state of mission requirements acquisition for the targeted Army missions. An overview is given of AFTA's architectural theory of operation
Parallel Architectures and Parallel Algorithms for Integrated Vision Systems
Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems
Recommended from our members
Cherub: A hardware distributed single shared address space memory architecture
Increased computer throughput can be achieved through the use of parallel processing. The granularity of a parallel program is the average number of instructions performed by the tasks constituting it. Coarse-grained programs typically execute huge numbers of instructions per task (w 105). The tasks in fine-grained programs are typically short (æ 103). In general, the finer the program grain, the greater the potential for exploiting parallelism. Amdahl’s Law shows that in the absence of overheads, the more potential parallelism that is realised in an algorithm, the faster it will be. The economical granularity of tasks is determined by the intertask communications overhead. Break-even occurs when processing is approximately equally divided between useful work and overhead.
The two common parallel programming paradigms are shared variable and message passing. Shared variable is, in general, the more natural of the two as it allows implicit communication between tasks. This encourages the programmer to make use of fine-grained tasks. The message passing paradigm requires explicit communication between tasks. This encourages the programmer to use coarser-grained tasks.
Two kinds of parallel architecture have become established. The first is the multiprocessor, which is built around a shared bus giving broadcast communications and a shared memory. This is characterised by low communications overhead, but limited scalability. The second is the multicomputer, which is based on point-to-point communications with larger communications overhead, but good scalability. Quantitatively, the low overhead of the multiprocessor is well matched to fine-grain tasks and, hence, to supporting the shared variable paradigm, while the high overhead of the multicomputer matches it to coarse-grain parallelism and, hence, to the message passing paradigm.
Currently, there appears to be no middle ground in parallel computing; an architecture which can support both several hundred medium-grained (« 104 instructions) parallel tasks and the shared variable programming paradigm would be advantageous in many applications.
This thesis asserts that it is possible to implement a new computer architecture, Cherub, which has at least 200 processors and is able to support shared variable programming with an optimal task granularity of around 104 instructions. This can be achieved through the combination of a hardware-based distributed shared single address space and a wafer-scale communications network.
To support the thesis, the dissertation first specifies a programmer’s interface to Cherub which is simple enough to implement in hardware. It then designs algorithms which provide this interface, allowing the requirements of the underlying network to be estimated. Finally, a wafer scale communications network is outlined, and simulations are used to demonstrate that it can provide the performance required to successfully implement Cherub
NETRA - A Parallel Architecture for Integrated Vision Systems I: Architecture and Organization
Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Aeronautics and Space Administration / NASA-NAG-1-61
Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 2: Army fault tolerant architecture design and analysis
Described here is the Army Fault Tolerant Architecture (AFTA) hardware architecture and components and the operating system. The architectural and operational theory of the AFTA Fault Tolerant Data Bus is discussed. The test and maintenance strategy developed for use in fielded AFTA installations is presented. An approach to be used in reducing the probability of AFTA failure due to common mode faults is described. Analytical models for AFTA performance, reliability, availability, life cycle cost, weight, power, and volume are developed. An approach is presented for using VHSIC Hardware Description Language (VHDL) to describe and design AFTA's developmental hardware. A plan is described for verifying and validating key AFTA concepts during the Dem/Val phase. Analytical models and partial mission requirements are used to generate AFTA configurations for the TF/TA/NOE and Ground Vehicle missions
Porting and optimizing BWA-MEM2 using the Fujitsu A64FX processor
Sequence alignment pipelines for human genomes are an emerging workload that will dominate in the precision medicine field. BWA-MEM2 is a tool widely used in the scientific community to perform read mapping studies. In this paper, we port BWA-MEM2 to the AArch64 architecture using the ARMv8-A specification, and we compare the resulting version against an Intel Skylake system both in performance and in energy-to-solution. The porting effort entails numerous code modifications, since BWA-MEM2 implements certain kernels using x86_64 specific intrinsics, e.g., AVX-512. To adapt this code we use the recently introduced Arm's Scalable Vector Extensions (SVE). More specifically, we use Fujitsu's A64FX processor, the first to implement SVE. The A64FX powers the Fugaku Supercomputer that led the Top500 ranking from June 2020 to November 2021. After porting BWA-MEM2 we define and implement a number of optimizations to improve performance in the A64FX target architecture. We show that while the A64FX performance is lower than that of the Skylake system, A64FX delivers 11.6% better energy-to-solution on average. All the code used for this article is available at https://gitlab.bsc.es/rlangari/bwa-a64fxThis work has been partially supported by the Spanish Ministry of Economy and Competitiveness (contracts PID2019-107255GB-C21 / AEI /10.13039/501100011033 and PID2019-105660RB-C21 / AEI / 10.13039/501100011033), Gobierno de Aragon (T5820R research group), the Generalitat de Catalunya (contracts 2017-SGR-1328 and 2017-SGR1414), and the European Union’s Horizon 2020 research and innovation program (Mont-Blanc 2020 project, grant agreement 779877). Finally, A. Armejach and M. Moreto have been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Juan de la Cierva fellowship no. IJCI-2017-33945 and Ramon y Cajal fellowship no. RYC-2016-21104, respectively.Peer ReviewedPostprint (author's final draft
Porting and optimizing BWA-MEM2 using the Fujitsu A64FX processor
Sequence alignment pipelines for human genomes are an emerging workload that will dominate in the precision medicine field. BWA-MEM2 is a tool widely used in the scientific community to perform read mapping studies. In this paper, we port BWA-MEM2 to the AArch64 architecture using the ARMv8-A specification, and we compare the resulting version against an Intel Skylake system both in performance and in energy-to-solution. The porting effort entails numerous code modifications, since BWA-MEM2 implements certain kernels using x86 64 specific intrinsics, e.g., AVX-512. To adapt this code we use the recently introduced Arm’s Scalable Vector Extensions (SVE). More specifically, we use Fujitsu’s A64FX processor, the first to implement SVE. The A64FX powers the Fugaku Supercomputer that led the Top500 ranking from June 2020 to November 2021. After porting BWA-MEM2 we define and implement a number of optimizations to improve performance in the A64FX target architecture. We show that while the A64FX performance is lower than that of the Skylake system, A64FX delivers 11.6% better energy-to-solution on average. All the code used for this article is available at https://gitlab.bsc.es/rlangari/bwa-a64fx
- …