1,897 research outputs found

    HPC Accelerators with 3D Memory

    Get PDF
    Artículo invitado, publicado en las actas del congreso por IEEE Society Press. Páginas 320 a 328. ISBN: 978-1-5090-3593-9.DOI 10.1109/CSE-EUC-DCABES-2016.203After a decade evolving in the High Performance Computing arena, GPU-equipped supercomputers have con- quered the top500 and green500 lists, providing us unprecedented levels of computational power and memory bandwidth. This year, major vendors have introduced new accelerators based on 3D memory, like Xeon Phi Knights Landing by Intel and Pascal architecture by Nvidia. This paper reviews hardware features of those new HPC accelerators and unveils potential performance for scientific applications, with an emphasis on Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM) used by commercial products according to roadmaps already announced.Universidad de Málaga. Campus de Excelencia Internacional Andalucia Tec

    Introduction to Multiprocessor I/O Architecture

    Get PDF
    The computational performance of multiprocessors continues to improve by leaps and bounds, fueled in part by rapid improvements in processor and interconnection technology. I/O performance thus becomes ever more critical, to avoid becoming the bottleneck of system performance. In this paper we provide an introduction to I/O architectural issues in multiprocessors, with a focus on disk subsystems. While we discuss examples from actual architectures and provide pointers to interesting research in the literature, we do not attempt to provide a comprehensive survey. We concentrate on a study of the architectural design issues, and the effects of different design alternatives

    A low-cost high-speed twin-prefetching DSP-based shared-memory system for real-time image processing applications

    Get PDF
    This dissertation introduces, investigates, and evaluates a low-cost high-speed twin-prefetching DSP-based bus-interconnected shared-memory system for real-time image processing applications. The proposed architecture can effectively support 32 DSPs in contrast to a maximum of 4 DSPs supported by existing DSP-based bus- interconnected systems. This significant enhancement is achieved by introducing two small programmable fast memories (Twins) between the processor and the shared bus interconnect. While one memory is transferring data from/to the shared memory, the other is supplying the core processor with data. The elimination of the traditional direct linkage of the shared bus and processor data bus makes feasible the utilization of a wider shared bus i.e., shared bus width becomes independent of the data bus width of the processors. The fast prefetching memories and the wider shared bus provide additional bus bandwidth into the system, which eliminates large memory latencies; such memory latencies constitute the major drawback for the performance of shared-memory multiprocessors. Furthermore, in contrast to existing DSP-based uniprocessor or multiprocessor systems the proposed architecture does not require all data to be placed on on-chip or off-chip expensive fast memory in order to reach or maintain peak performance. Further, it can maintain peak performance regardless of whether the processed image is small or large. The performance of the proposed architecture has been extensively investigated executing computationally intensive applications such as real-time high-resolution image processing. The effect of a wide variety of hardware design parameters on performance has been examined. More specifically tables and graphs comprehensively analyze the performance of 1, 2, 4, 8, 16, 32 and 64 DSP-based systems, for a wide variety of shared data interconnect widths such as 32, 64, 128, 256 and 512. In addition, the effect of the wide variance of temporal and spatial locality (present in different applications) on the multiprocessor\u27s execution time is investigated and analyzed. Finally, the prefetching cache-size was varied from a few kilobytes to 4 Mbytes and the corresponding effect on the execution time was investigated. Our performance analysis has clearly showed that the execution time converges to a shallow minimum i.e., it is not sensitive to the size of the prefetching cache. The significance of this observation is that near optimum performance can be achieved with a small (16 to 300 Kbytes) amount of prefetching cache

    Strengthening measurements from the edges: application-level packet loss rate estimation

    Get PDF
    Network users know much less than ISPs, Internet exchanges and content providers about what happens inside the network. Consequently users cannot either easily detect network neutrality violations or readily exercise their market power by knowledgeably switching ISPs. This paper contributes to the ongoing efforts to empower users by proposing two models to estimate -- via application-level measurements -- a key network indicator, i.e., the packet loss rate (PLR) experienced by FTP-like TCP downloads. Controlled, testbed, and large-scale experiments show that the Inverse Mathis model is simpler and more consistent across the whole PLR range, but less accurate than the more advanced Likely Rexmit model for landline connections and moderate PL

    PPMB: A Partial-Multiple-Bus Multiprocessor Architecture with Improved Cost-Effectiveness

    Get PDF
    This paper addresses the design and performance analysis of partial-multiple-bus interconnection networks. They are bus architectures that have evolved from multiple-bus structure by dividing buses into groups and reducing bus connections. Their effect is to reduce cost and alleviate arbitration and drive requirements without degrading performance significantly. One such structure, called processor-oriented partial-multiple-bus (or PPMB), is proposed. It serves as an alternative to the conventional structure called memory-oriented partial-multiple-bus (or MPMB) and is aimed at higher system performance at less or equal system cost. It has been shown, both analytically and by simulation, that a substantial increase in system bandwidth (up to 20%) is achieved by the PPMB structure over the MPMB structure. With very large systems, the results also imply a significantly improved cost-effectiveness over the conventional multiple-bus architecture

    Distributed modular RT-systems for detector DAQ, trigger and control applications

    Get PDF
    A modular approach to development of distributed modular system architecture for detector control, data acquisition and trigger data processing is proposed. A multilevel parallel-pipeline model of data acquisition, processing and control is proposed and discussed. Multiprocessor architecture with SCI-based interconnections is proposed as good high-performance system for parallel-pipeline data processing. A network (Ethernet -100) can be used for loading, monitoring and diagnostic purposes independent of basic interconnections. The modular cPCI-based structures with high speed modular interconnections are proposed for DAQ and control applications. For distributed control RT-systems, to construct the effective (cost-performance) systems the same platform of an Intel compatible processor board should be used. The basic computer multiprocessor nodes consist of high-power PC MB (Industrial Computer Systems), which are interconnected by SCI modules and link to embedded microprocessor-based sub-systems for control applications. The required number of multiprocessor nodes should be interconnected by SCI for parallel-pipeline data processing in real time (according to the multilevel model) and link to RT-systems for embedded control. (19 refs)

    Hyperswitch communication network

    Get PDF
    The Hyperswitch Communication Network (HCN) is a large scale parallel computer prototype being developed at JPL. Commercial versions of the HCN computer are planned. The HCN computer being designed is a message passing multiple instruction multiple data (MIMD) computer, and offers many advantages in price-performance ratio, reliability and availability, and manufacturing over traditional uniprocessors and bus based multiprocessors. The design of the HCN operating system is a uniquely flexible environment that combines both parallel processing and distributed processing. This programming paradigm can achieve a balance among the following competing factors: performance in processing and communications, user friendliness, and fault tolerance. The prototype is being designed to accommodate a maximum of 64 state of the art microprocessors. The HCN is classified as a distributed supercomputer. The HCN system is described, and the performance/cost analysis and other competing factors within the system design are reviewed

    Higher-order CFD and Interface Tracking Methods on Highly-Parallel MPI and GPU systems

    Get PDF
    A computational investigation of the effects on parallel performance of higher-order accurate schemes was carried out on two different computational systems: a traditional CPU based MPI cluster and a system of four Graphics Processing Units (GPUs) controlled by a single quad-core CPU. The investigation was based on the solution of the level set equations for interface tracking using a High-Order Upstream Central (HOUC) scheme. Different variants of the HOUC scheme were employed together with a 3rd-order TVD Runge-Kutta time integration. An increase in performance of two orders of magnitude was seen when comparing a single CPU core to a single GPU with a greater increase at higher orders of accuracy and at lower precision

    CROSS-LAYER CUSTOMIZATION FOR LOW POWER AND HIGH PERFORMANCE EMBEDDED MULTI-CORE PROCESSORS

    Get PDF
    Due to physical limitations and design difficulties, computer processor architecture has shifted to multi-core and even many-core based approaches in recent years. Such architectures provide potentials for sustainable performance scaling into future peta-scale/exa-scale computing platforms, at affordable power budget, design complexity, and verification efforts. To date, multi-core processor products have been replacing uni-core processors in almost every market segment, including embedded systems, general-purpose desktops and laptops, and super computers. However, many issues still remain with multi-core processor architectures that need to be addressed before their potentials could be fully realized. People in both academia and industry research community are still seeking proper ways to make efficient and effective use of these processors. The issues involve hardware architecture trade-offs, the system software service, the run-time management, and user application design, which demand more research effort into this field. Due to the architectural specialties with multi-core based computers, a Cross-Layer Customization framework is proposed in this work, which combines application specific information and system platform features, along with necessary operating system service support, to achieve exceptional power and performance efficiency for targeted multi-core platforms. Several topics are covered with specific optimization goals, including snoop cache coherence protocol, inter-core communication for producer-consumer applications, synchronization mechanisms, and off-chip memory bandwidth limitations. Analysis of benchmark program execution with conventional mechanisms is made to reveal the overheads in terms of power and performance. Specific customizations are proposed to eliminate such overheads with support from hardware, system software, compiler, and user applications. Experiments show significant improvement on system performance and power efficiency
    corecore