117 research outputs found

    Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA2009)

    Get PDF
    Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA2009) which was held Feb. 12th 2009 in Mannheim, Germany. The 1st International Workshop for Research on HyperTransport is an international high quality forum for scientists, researches and developers working in the area of HyperTransport. This includes not only developments and research in HyperTransport itself, but also work which is based on or enabled by HyperTransport. HyperTransport (HT) is an interconnection technology which is typically used as system interconnect in modern computer systems, connecting the CPUs among each other and with the I/O bridges. Primarily designed as interconnect between high performance CPUs it provides an extremely low latency, high bandwidth and excellent scalability. The definition of the HTX connector allows the use of HT even for add-in cards. In opposition to other peripheral interconnect technologies like PCI-Express no protocol conversion or intermediate bridging is necessary. HT is a direct connection between device and CPU with minimal latency. Another advantage is the possibility of cache coherent devices. Because of these properties HT is of high interest for high performance I/O like networking and storage, but also for co-processing and acceleration based on ASIC or FPGA technologies. In particular acceleration sees a resurgence of interest today. One reason is the possibility to reduce power consumption by the use of accelerators. In the area of parallel computing the low latency communication allows for fine grain communication schemes and is perfectly suited for scalable systems. Summing up, HT technology offers key advantages and great performance to any research aspect related to or based on interconnects. For more information please consult the workshop website (http://whtra.uni-hd.de)

    Proceedings of the Second International Workshop on HyperTransport Research and Applications (WHTRA2011)

    Get PDF
    Proceedings of the Second International Workshop on HyperTransport Research and Applications (WHTRA2011) which was held Feb. 9th 2011 in Mannheim, Germany. The Second International Workshop for Research on HyperTransport is an international high quality forum for scientists, researches and developers working in the area of HyperTransport. This includes not only developments and research in HyperTransport itself, but also work which is based on or enabled by HyperTransport. HyperTransport (HT) is an interconnection technology which is typically used as system interconnect in modern computer systems, connecting the CPUs among each other and with the I/O bridges. Primarily designed as interconnect between high performance CPUs it provides an extremely low latency, high bandwidth and excellent scalability. The definition of the HTX connector allows the use of HT even for add-in cards. In opposition to other peripheral interconnect technologies like PCI-Express no protocol conversion or intermediate bridging is necessary. HT is a direct connection between device and CPU with minimal latency. Another advantage is the possibility of cache coherent devices. Because of these properties HT is of high interest for high performance I/O like networking and storage, but also for co-processing and acceleration based on ASIC or FPGA technologies. In particular acceleration sees a resurgence of interest today. One reason is the possibility to reduce power consumption by the use of accelerators. In the area of parallel computing the low latency communication allows for fine grain communication schemes and is perfectly suited for scalable systems. Summing up, HT technology offers key advantages and great performance to any research aspect related to or based on interconnects. For more information please consult the workshop website (http://whtra.uni-hd.de)

    Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA2009)(revised 08/2009)

    Get PDF
    Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA2009) which was held Feb. 12th 2009 in Mannheim, Germany. The 1st International Workshop for Research on HyperTransport is an international high quality forum for scientists, researches and developers working in the area of HyperTransport. This includes not only developments and research in HyperTransport itself, but also work which is based on or enabled by HyperTransport. HyperTransport (HT) is an interconnection technology which is typically used as system interconnect in modern computer systems, connecting the CPUs among each other and with the I/O bridges. Primarily designed as interconnect between high performance CPUs it provides an extremely low latency, high bandwidth and excellent scalability. The definition of the HTX connector allows the use of HT even for add-in cards. In opposition to other peripheral interconnect technologies like PCI-Express no protocol conversion or intermediate bridging is necessary. HT is a direct connection between device and CPU with minimal latency. Another advantage is the possibility of cache coherent devices. Because of these properties HT is of high interest for high performance I/O like networking and storage, but also for co-processing and acceleration based on ASIC or FPGA technologies. In particular acceleration sees a resurgence of interest today. One reason is the possibility to reduce power consumption by the use of accelerators. In the area of parallel computing the low latency communication allows for fine grain communication schemes and is perfectly suited for scalable systems. Summing up, HT technology offers key advantages and great performance to any research aspect related to or based on interconnects. For more information please consult the workshop website (http://whtra.uni-hd.de)

    A new degree of freedom for memory allocation in clusters

    Full text link
    Improvements in parallel computing hardware usually involve increments in the number of available resources for a given application such as the number of computing cores and the amount of memory. In the case of shared-memory computers, the increase in computing resources and available memory is usually constrained by the coherency protocol, whose overhead rises with system size, limiting the scalability of the final system. In this paper we propose an efficient and cost-effective way to increase the memory available for a given application by leveraging free memory in other computers in the cluster. Our proposal is based on the observation that many applications benefit from having more memory resources but do not require more computing cores, thus reducing the requirements for cache coherency and allowing a simpler implementation and better scalability. Simulation results show that, when additional mechanisms intended to hide remote memory latency are used, execution time of applications that use our proposal is similar to the time required to execute them in a computer populated with enough local memory, thus validating the feasibility of our proposal. We are currently building a prototype that implements our ideas. The first results from real executions in this prototype demonstrate not only that our proposal works but also that it can efficiently execute applications that make use of remote memory resources. © 2011 Springer Science+Business Media, LLC.This work has been supported by PROMETEO from Generalitat Valenciana (GVA) under Grant PROMETEO/2008/060.Montaner Mas, H.; Silla Jiménez, F.; Fröning, H.; Duato Marín, JF. (2012). A new degree of freedom for memory allocation in clusters. Cluster Computing. 15(2):101-123. https://doi.org/10.1007/s10586-010-0150-7S1011231523leaf Systems: http://www.3leafsystems.comAcharya, A., Setia, S.: Availability and utility of idle memory in workstation clusters. ACM SIGMETRICS Perform. Eval. Rev. 27(1), 35–46 (1999). doi: 10.1145/301464.301478Anderson, T., Culler, D., Patterson, D.: A case for NOW (Networks of Workstations). IEEE MICRO 15(1), 54–64 (1995). doi: 10.1109/40.342018HyperTransport Technology Consortium. HyperTransport I/O Link Specification Revision 3.10 (2008). Available at http://www.hypertransport.orgBienia, C., Kumar, S., et al.: The parsec benchmark suite: Characterization and architectural implications. In: Proceedings of the 17th PACT (2008)Chapman, M., Heiser, G.: vNUMA: A virtual shared-memory multiprocessor. In: Proceedings of the 2009 USENIX Annual Technical Conference, San Diego, USA, 2000, pp. 349–362. (2009)Charles, P., Grothoff, C., Saraswat, V., et al.: X10: an object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Not. 40(10), 519–538 (2005)Consortium, H.: HyperTransport High Node Count, Slides. http://www.hypertransport.org/default.cfm?page=HighNodeCountSpecificationConway, P., Hughes, B.: The AMD opteron northbridge architecture. IEEE MICRO 27(2), 10–21 (2007). doi: 10.1109/MM.2007.43Conway, P., Kalyanasundharam, N., Donley, G., et al.: Blade computing with the AMD Opteron processor (Magny-Cours). Hot chips 21 (2009)Duato, J., Silla, F., Yalamanchili, S., et al.: Extending HyperTransport protocol for improved scalability. First International Workshop on HyperTransport Research and Applications (2009)Feeley, M.J., Morgan, W.E., Pighin, E.P., Karlin, A.R., Levy, H.M., Thekkath, C.A.: Implementing global memory management in a workstation cluster. In: SOSP ’95: Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, pp. 201–212. ACM, New York (1995). doi: 10.1145/224056.224072Fröning, H., Litz, H.: Efficient hardware support for the partitioned global address space. In: 10th Workshop on Communication Architecture for Clusters (2010)Fröning, H., Nuessle, M., Slogsnat, D., Litz, H., Brüening, U.: The HTX-board: a rapid prototyping station. In: 3rd annual FPGAworld Conference (2006)Garcia-Molina, H., Salem, K.: Main memory database systems: an overview. IEEE Trans. Knowl. Data Eng. 4(6), 509–516 (1992). doi: 10.1109/69.180602Gaussian 03: http://www.gaussian.comGray, J., Liu, D.T., Nieto-Santisteban, M., et al.: Scientific data management in the coming decade. SIGMOD Rec. 34(4), 34–41 (2005). doi: 10.1145/1107499.1107503IBM journal of Research and Development staff: Overview of the IBM Blue Gene/P project. IBM J. Res. Dev. 52(1/2), 199–220 (2008)IBM z Series: http://www.ibm.com/systems/zIn-Memory Database Systems (IMDSs) Beyond the Terabyte Size Boudary: http://www.mcobject.com/130/EmbeddedDatabaseWhitePapers.htmKeltcher, C., McGrath, K., Ahmed, A., Conway, P.: The AMD opteron processor for multiprocessor servers. Micro IEEE 23(2), 66–76 (2003). doi: 10.1109/MM.2003.1196116Kottapalli, S., Baxter, J.: Nehalem-EX CPU architecture. Hot chips 21 (2009)Liang, S., Noronha, R., Panda, D.: Swapping to remote memory over infiniband: an approach using a high performance network block device. In: Cluster Computing, 2005. IEEE International, pp. 1–10. (2005) doi: 10.1109/CLUSTR.2005.347050Litz, H., Fröning, H., Nuessle, M., Brüening, U.: A hypertransport network interface controller for ultra-low latency message transfers. HyperTransport Consortium White Paper (2007)Litz, H., Fröning, H., Nuessle, M., Brüening, U.: VELO: A novel communication engine for ultra-low latency message transfers. In: 37th International Conference on Parallel Processing, 2008. ICPP ’08, pp. 238–245 (2008). doi: 10.1109/ICPP.2008.85Magnusson, P., Christensson, M., Eskilson, J., et al.: Simics: a full system simulation platform. Computer 35(2), 50–58 (2002). doi: 10.1109/2.982916Martin, M., Sorin, D., Beckmann, B., et al.: Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. ACM SIGARCH Comput. Archit. News 33(4), 92–99 (2005) doi: 10.1145/1105734.1105747MBA3 NC Series Catalog: http://www.fujitsu.com/global/services/computing/storage/hdd/ehdd/mba3073nc-mba3300nc.htmlMcCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25 (1995)NUMAChip: http://www.numachip.com/Oguchi, M., Kitsuregawa, M.: Using available remote memory dynamically for parallel data mining application on ATM-connected PC cluster. In: IPDPS 2000. Proceedings, 14th International, pp. 411–420 (2000). doi: 10.1109/IPDPS.2000.846014Oleszkiewicz, J., Xiao, L., Liu, Y.: Parallel network RAM: effectively utilizing global cluster memory for large data-intensive parallel programs. In: International Conference on Parallel Processing, 2004. ICPP 2004, vol. 1, pp. 353–360 (2004). doi: 10.1109/ICPP.2004.1327942Ronstrom, M., Thalmann, L.: MySQL cluster architecture overview. Technical White Paper. MySQL (2004)ScaleMP: http://www.scalemp.comSGI: Technical advances in the SGI Altix UV architecture, White Paper. http://www.sgi.com/products/servers/altix/uv/Slogsnat, D., Giese, A., Nüssle, M., Brüning, U.: An open-source HyperTransport core. ACM Trans. Reconfigurable Technol. Syst. 1(3), 1–21 (2008). doi: 10.1007/s10586-010-0150-7Szalay, A.S., Gray, J., vandenBerg, J.: Petabyte Scale Data Mining: Dream or Reality? CoRR cs.DB/0208013 (2002)Tuck, J., Ceze, L., Torrellas, J.: Scalable cache miss handling for high memory-level parallelism. In: Microarchitecture, 2006. MICRO-39. 39th Annual IEEE/ACM International Symposium on (2006)Violin Memory: http://violin-memory.comDynamic Logical Partitioning. White Paper: http://www.ibm.com/systems/p/hardware/whitepapers/dlpar.htmlYelick, K.: Computer architecture: Opportunities and challenges for scalable applications. Sandia CSRI Workshop on Next-generation scalable applications: When MPI-only is not enough (2008)Yelick, K.: Programming models: Opportunities and challenges for scalable applications. Sandia CSRI Workshop on Next-generation scalable applications: When MPI-only is not enough (2008

    Achieving High Speed CFD simulations: Optimization, Parallelization, and FPGA Acceleration for the unstructured DLR TAU Code

    Get PDF
    Today, large scale parallel simulations are fundamental tools to handle complex problems. The number of processors in current computation platforms has been recently increased and therefore it is necessary to optimize the application performance and to enhance the scalability of massively-parallel systems. In addition, new heterogeneous architectures, combining conventional processors with specific hardware, like FPGAs, to accelerate the most time consuming functions are considered as a strong alternative to boost the performance. In this paper, the performance of the DLR TAU code is analyzed and optimized. The improvement of the code efficiency is addressed through three key activities: Optimization, parallelization and hardware acceleration. At first, a profiling analysis of the most time-consuming processes of the Reynolds Averaged Navier Stokes flow solver on a three-dimensional unstructured mesh is performed. Then, a study of the code scalability with new partitioning algorithms are tested to show the most suitable partitioning algorithms for the selected applications. Finally, a feasibility study on the application of FPGAs and GPUs for the hardware acceleration of CFD simulations is presented

    HyperTransport 3 Core: A Next Generation Host Interface with Extremely High Bandwidth

    Get PDF
    As the amount of computing power keeps increasing, host interface bandwidth to memory and input-output devices (I/O) becomes a more and more limiting factor. High speed serial host interface protocols like PCI-Express and HyperTransport (HT) have been introduced to satisfy the applications’ ever increasing demands for more bandwidth. Recent applications in the field of General Purpose Graphic Processing Units (GPGPUs) and Field Programmable Gate Array (FPGA) based coprocessors are an example. In this Paper we present a novel implementation of an FPGA based HyperTransport 3 (HT3) host interface. To the best of our knowledge it represents the very first implementation of this type. The design offers an extremely high unidirectional bandwidth of up to 2.3 GByte/s. It can be employed in arbitrary FPGA applications and then offers direct access to an AMD Opteron processor via the HT interface. To allow the development of an optimal design, we perform a complexity and requirements analysis. The result is our proposed solution which has been implemented in synthesizable Hardware Description Language (HDL) code. Microbenchmarks are presented to show the feasibility and high performance of the design

    Firmware and gateway for the ACE1 reconfigurable accelerator card

    Get PDF
    This thesis describes the continued work on the in-house designed FPGA based co-processor daughtercard referred to as ACE1. The aim: to create an ecosystem incorporating firmware, bootstrapping code, drivers and a development environment to create a seamless environment. Challenges in setting up and debugging the interface that connects the coprocessor daughtercard to the host server include: problems with the power network, the edge connectors and timing problems with the primary protocol which prevented host-based communications. The options include allowing the daughtercard to function in a stand-alone fashion and we present a gateware solution that allows users to select from a number of alternatives for each of the layers in the Open Systems Interconnect networking model

    Acceleration Methodology for the Implementation of Scientific Applications on Reconfigurable Hardware

    Get PDF
    The role of heterogeneous multi-core architectures in the industrial and scientific computing community is expanding. For researchers to increase the performance of complex applications, a multifaceted approach is needed to utilize emerging reconfigurable computing (RC) architectures. First, the method for accelerating applications must provide flexible solutions for fully utilizing key architecture traits across platforms. Secondly, the approach needs to be readily accessible to application scientists. A recent trend toward emerging disruptive architectures is an important signal that fundamental limitations in traditional high performance computing (HPC) are limiting break through research. To respond to these challenges, scientists are under pressure to identify new programming methodologies and elements in platform architectures that will translate into enhanced program efficacy. Reconfigurable computing (RC) allows the implementation of almost any computer architecture trait, but identifying which traits work best for numerous scientific problem domains is difficult. However, by leveraging the existing underlying framework available in field programmable gate arrays (FPGAs), it is possible to build a method for utilizing RC traits for accelerating scientific applications. By contrasting both hardware and software changes, RC platforms afford developers the ability to examine various architecture characteristics to find those best suited for production-level scientific applications. The flexibility afforded by FPGAs allow these characteristics to then be extrapolated to heterogeneous, multi-core and general-purpose computing on graphics processing units (GP-GPU) HPC platforms. Additionally by coupling high-level languages (HLL) with reconfigurable hardware, relevance to a wider industrial and scientific population is achieved. To provide these advancements to the scientific community we examine the acceleration of a scientific application on a RC platform. By leveraging the flexibility provided by FPGAs we develop a methodology that removes computational loads from host systems and internalizes portions of communication with the aim of reducing fiscal costs through the reduction of physical compute nodes required to achieve the same runtime performance. Using this methodology an improvement in application performance is shown to be possible without requiring hand implementation of HLL code in a hardware description language (HDL) A review of recent literature demonstrates the challenge of developing a platform-independent flexible solution that allows access to cutting edge RC hardware for application scientists. To address this challenge we propose a structured methodology that begins with examination of the application\u27s profile, computations, and communications and utilizes tools to assist the developer in making partitioning and optimization decisions. Through experimental results, we will analyze the computational requirements, describe the simulated and actual accelerated application implementation, and finally describe problems encountered during development. Using this proposed method, a 3x speedup is possible over the entire accelerated target application. Lastly we discuss possible future work including further potential optimizations of the application to improve this process and project the anticipated benefits

    Accelerated Financial Applications through Specialized Hardware, FPGA

    Get PDF
    This project will investigate Field Programmable Gate Array (FPGA) technology in financial applications. FPGA implementation in high performance computing is still in its infancy. Certain companies like XtremeData inc. advertized speed improvements of 50 to 1000 times for DNA sequencing using FPGAs, while using an FPGA as a coprocessor to handle specific tasks provides two to three times more processing power. FPGA technology increases performance by parallelizing calculations. This project will specifically address speed and accuracy improvements of both fundamental and transcendental functions when implemented using FPGA technology. The results of this project will lead to a series of recommendations for effective utilization of FPGA technology in financial applications

    Improving the Scalability of High Performance Computer Systems

    Full text link
    Improving the performance of future computing systems will be based upon the ability of increasing the scalability of current technology. New paths need to be explored, as operating principles that were applied up to now are becoming irrelevant for upcoming computer architectures. It appears that scaling the number of cores, processors and nodes within an system represents the only feasible alternative to achieve Exascale performance. To accomplish this goal, we propose three novel techniques addressing different layers of computer systems. The Tightly Coupled Cluster technique significantly improves the communication for inter node communication within compute clusters. By improving the latency by an order of magnitude over existing solutions the cost of communication is considerably reduced. This enables to exploit fine grain parallelism within applications, thereby, extending the scalability considerably. The mechanism virtually moves the network interconnect into the processor, bypassing the latency of the I/O interface and rendering protocol conversions unnecessary. The technique is implemented entirely through firmware and kernel layer software utilizing off-the-shelf AMD processors. We present a proof-of-concept implementation and real world benchmarks to demonstrate the superior performance of our technique. In particular, our approach achieves a software-to-software communication latency of 240 ns between two remote compute nodes. The second part of the dissertation introduces a new framework for scalable Networks-on-Chip. A novel rapid prototyping methodology is proposed, that accelerates the design and implementation substantially. Due to its flexibility and modularity a large application space is covered ranging from Systems-on-chip, to high performance many-core processors. The Network-on-Chip compiler enables to generate complex networks in the form of synthesizable register transfer level code from an abstract design description. Our engine supports different target technologies including Field Programmable Gate Arrays and Application Specific Integrated Circuits. The framework enables to build large designs while minimizing development and verification efforts. Many topologies and routing algorithms are supported by partitioning the tasks into several layers and by the introduction of a protocol agnostic architecture. We provide a thorough evaluation of the design that shows excellent results regarding performance and scalability. The third part of the dissertation addresses the Processor-Memory Interface within computer architectures. The increasing compute power of many-core processors, leads to an equally growing demand for more memory bandwidth and capacity. Current processor designs exhibit physical limitations that restrict the scalability of main memory. To address this issue we propose a memory extension technique that attaches large amounts of DRAM memory to the processor via a low pin count interface using high speed serial transceivers. Our technique transparently integrates the extension memory into the system architecture by providing full cache coherency. Therefore, applications can utilize the memory extension by applying regular shared memory programming techniques. By supporting daisy chained memory extension devices and by introducing the asymmetric probing approach, the proposed mechanism ensures high scalability. We furthermore propose a DMA offloading technique to improve the performance of the processor memory interface. The design has been implemented in a Field Programmable Gate Array based prototype. Driver software and firmware modifications have been developed to bring up the prototype in a Linux based system. We show microbenchmarks that prove the feasibility of our design
    corecore