Search CORE

1,544 research outputs found

A new degree of freedom for memory allocation in clusters

Author: A. Acharya
C. Bienia
C. Keltcher
D. Slogsnat
Federico Silla
H. Fröning
H. Fröning
H. Garcia-Molina
H. Litz
Holger Fröning
Héctor Montaner
IBM journal of Research and Development staff
J. Gray
J. Oleszkiewicz
J. Tuck
J.D. McCalpin
José Duato
M. Chapman
M. Martin
M. Oguchi
M.J. Feeley
P. Charles
P. Conway
P. Magnusson
S. Liang
T. Anderson
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/06/2012
Field of study

Improvements in parallel computing hardware usually involve increments in the number of available resources for a given application such as the number of computing cores and the amount of memory. In the case of shared-memory computers, the increase in computing resources and available memory is usually constrained by the coherency protocol, whose overhead rises with system size, limiting the scalability of the final system. In this paper we propose an efficient and cost-effective way to increase the memory available for a given application by leveraging free memory in other computers in the cluster. Our proposal is based on the observation that many applications benefit from having more memory resources but do not require more computing cores, thus reducing the requirements for cache coherency and allowing a simpler implementation and better scalability. Simulation results show that, when additional mechanisms intended to hide remote memory latency are used, execution time of applications that use our proposal is similar to the time required to execute them in a computer populated with enough local memory, thus validating the feasibility of our proposal. We are currently building a prototype that implements our ideas. The first results from real executions in this prototype demonstrate not only that our proposal works but also that it can efficiently execute applications that make use of remote memory resources. © 2011 Springer Science+Business Media, LLC.This work has been supported by PROMETEO from Generalitat Valenciana (GVA) under Grant PROMETEO/2008/060.Montaner Mas, H.; Silla Jiménez, F.; Fröning, H.; Duato Marín, JF. (2012). A new degree of freedom for memory allocation in clusters. Cluster Computing. 15(2):101-123. https://doi.org/10.1007/s10586-010-0150-7S1011231523leaf Systems: http://www.3leafsystems.comAcharya, A., Setia, S.: Availability and utility of idle memory in workstation clusters. ACM SIGMETRICS Perform. Eval. Rev. 27(1), 35–46 (1999). doi: 10.1145/301464.301478Anderson, T., Culler, D., Patterson, D.: A case for NOW (Networks of Workstations). IEEE MICRO 15(1), 54–64 (1995). doi: 10.1109/40.342018HyperTransport Technology Consortium. HyperTransport I/O Link Specification Revision 3.10 (2008). Available at http://www.hypertransport.orgBienia, C., Kumar, S., et al.: The parsec benchmark suite: Characterization and architectural implications. In: Proceedings of the 17th PACT (2008)Chapman, M., Heiser, G.: vNUMA: A virtual shared-memory multiprocessor. In: Proceedings of the 2009 USENIX Annual Technical Conference, San Diego, USA, 2000, pp. 349–362. (2009)Charles, P., Grothoff, C., Saraswat, V., et al.: X10: an object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Not. 40(10), 519–538 (2005)Consortium, H.: HyperTransport High Node Count, Slides. http://www.hypertransport.org/default.cfm?page=HighNodeCountSpecificationConway, P., Hughes, B.: The AMD opteron northbridge architecture. IEEE MICRO 27(2), 10–21 (2007). doi: 10.1109/MM.2007.43Conway, P., Kalyanasundharam, N., Donley, G., et al.: Blade computing with the AMD Opteron processor (Magny-Cours). Hot chips 21 (2009)Duato, J., Silla, F., Yalamanchili, S., et al.: Extending HyperTransport protocol for improved scalability. First International Workshop on HyperTransport Research and Applications (2009)Feeley, M.J., Morgan, W.E., Pighin, E.P., Karlin, A.R., Levy, H.M., Thekkath, C.A.: Implementing global memory management in a workstation cluster. In: SOSP ’95: Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, pp. 201–212. ACM, New York (1995). doi: 10.1145/224056.224072Fröning, H., Litz, H.: Efficient hardware support for the partitioned global address space. In: 10th Workshop on Communication Architecture for Clusters (2010)Fröning, H., Nuessle, M., Slogsnat, D., Litz, H., Brüening, U.: The HTX-board: a rapid prototyping station. In: 3rd annual FPGAworld Conference (2006)Garcia-Molina, H., Salem, K.: Main memory database systems: an overview. IEEE Trans. Knowl. Data Eng. 4(6), 509–516 (1992). doi: 10.1109/69.180602Gaussian 03: http://www.gaussian.comGray, J., Liu, D.T., Nieto-Santisteban, M., et al.: Scientific data management in the coming decade. SIGMOD Rec. 34(4), 34–41 (2005). doi: 10.1145/1107499.1107503IBM journal of Research and Development staff: Overview of the IBM Blue Gene/P project. IBM J. Res. Dev. 52(1/2), 199–220 (2008)IBM z Series: http://www.ibm.com/systems/zIn-Memory Database Systems (IMDSs) Beyond the Terabyte Size Boudary: http://www.mcobject.com/130/EmbeddedDatabaseWhitePapers.htmKeltcher, C., McGrath, K., Ahmed, A., Conway, P.: The AMD opteron processor for multiprocessor servers. Micro IEEE 23(2), 66–76 (2003). doi: 10.1109/MM.2003.1196116Kottapalli, S., Baxter, J.: Nehalem-EX CPU architecture. Hot chips 21 (2009)Liang, S., Noronha, R., Panda, D.: Swapping to remote memory over infiniband: an approach using a high performance network block device. In: Cluster Computing, 2005. IEEE International, pp. 1–10. (2005) doi: 10.1109/CLUSTR.2005.347050Litz, H., Fröning, H., Nuessle, M., Brüening, U.: A hypertransport network interface controller for ultra-low latency message transfers. HyperTransport Consortium White Paper (2007)Litz, H., Fröning, H., Nuessle, M., Brüening, U.: VELO: A novel communication engine for ultra-low latency message transfers. In: 37th International Conference on Parallel Processing, 2008. ICPP ’08, pp. 238–245 (2008). doi: 10.1109/ICPP.2008.85Magnusson, P., Christensson, M., Eskilson, J., et al.: Simics: a full system simulation platform. Computer 35(2), 50–58 (2002). doi: 10.1109/2.982916Martin, M., Sorin, D., Beckmann, B., et al.: Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. ACM SIGARCH Comput. Archit. News 33(4), 92–99 (2005) doi: 10.1145/1105734.1105747MBA3 NC Series Catalog: http://www.fujitsu.com/global/services/computing/storage/hdd/ehdd/mba3073nc-mba3300nc.htmlMcCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25 (1995)NUMAChip: http://www.numachip.com/Oguchi, M., Kitsuregawa, M.: Using available remote memory dynamically for parallel data mining application on ATM-connected PC cluster. In: IPDPS 2000. Proceedings, 14th International, pp. 411–420 (2000). doi: 10.1109/IPDPS.2000.846014Oleszkiewicz, J., Xiao, L., Liu, Y.: Parallel network RAM: effectively utilizing global cluster memory for large data-intensive parallel programs. In: International Conference on Parallel Processing, 2004. ICPP 2004, vol. 1, pp. 353–360 (2004). doi: 10.1109/ICPP.2004.1327942Ronstrom, M., Thalmann, L.: MySQL cluster architecture overview. Technical White Paper. MySQL (2004)ScaleMP: http://www.scalemp.comSGI: Technical advances in the SGI Altix UV architecture, White Paper. http://www.sgi.com/products/servers/altix/uv/Slogsnat, D., Giese, A., Nüssle, M., Brüning, U.: An open-source HyperTransport core. ACM Trans. Reconfigurable Technol. Syst. 1(3), 1–21 (2008). doi: 10.1007/s10586-010-0150-7Szalay, A.S., Gray, J., vandenBerg, J.: Petabyte Scale Data Mining: Dream or Reality? CoRR cs.DB/0208013 (2002)Tuck, J., Ceze, L., Torrellas, J.: Scalable cache miss handling for high memory-level parallelism. In: Microarchitecture, 2006. MICRO-39. 39th Annual IEEE/ACM International Symposium on (2006)Violin Memory: http://violin-memory.comDynamic Logical Partitioning. White Paper: http://www.ibm.com/systems/p/hardware/whitepapers/dlpar.htmlYelick, K.: Computer architecture: Opportunities and challenges for scalable applications. Sandia CSRI Workshop on Next-generation scalable applications: When MPI-only is not enough (2008)Yelick, K.: Programming models: Opportunities and challenges for scalable applications. Sandia CSRI Workshop on Next-generation scalable applications: When MPI-only is not enough (2008

Crossref

RiuNet

The walking robot project

Author: Bennett M.
Sagraniching E.
Singh R.
Williams P.
Publication venue
Publication date: 01/01/1991
Field of study

A walking robot was designed, analyzed, and tested as an intelligent, mobile, and a terrain adaptive system. The robot's design was an application of existing technologies. The design of the six legs modified and combines well understood mechanisms and was optimized for performance, flexibility, and simplicity. The body design incorporated two tripods for walking stability and ease of turning. The electrical hardware design used modularity and distributed processing to drive the motors. The software design used feedback to coordinate the system and simple keystrokes to give commands. The walking machine can be easily adapted to hostile environments such as high radiation zones and alien terrain. The primary goal of the leg design was to create a leg capable of supporting a robot's body and electrical hardware while walking or performing desired tasks, namely those required for planetary exploration. The leg designers intent was to study the maximum amount of flexibility and maneuverability achievable by the simplest and lightest leg design. The main constraints for the leg design were leg kinematics, ease of assembly, degrees of freedom, number of motors, overall size, and weight

NASA Technical Reports Server

Digital Repository at the University of Maryland

A file server for the DistriX prototype : a multitransputer UNIX system

Author: Hoffman P Kuyper
Publication venue: Department of Computer Science
Publication date: 01/01/1989
Field of study

Bibliography: pages 90-94.The DISTRIX operating system is a multiprocessor distributed operating system based on UNIX. It consists of a number of satellite processors connected to central servers. The system is derived from the MINIX operating system, compatible with UNIX Version 7. A remote procedure call interface is used in conjunction with a system wide, end-to-end communication protocol that connects satellite processors to the central servers. A cached file server provides access to all files and devices at the UNIX system call level. The design of the file server is discussed in depth and the performance evaluated. Additional information is given about the software and hardware used during the development of the project. The MINIX operating system has proved to be a good choice as the software base, but certain features have proved to be poorer. The Inmos transputer emerges as a processor with many useful features that eased the implementation

Cape Town University OpenUCT

Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA2009)(revised 08/2009)

Author
Publication venue
Publication date: 01/01/2009
Field of study

Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA2009) which was held Feb. 12th 2009 in Mannheim, Germany. The 1st International Workshop for Research on HyperTransport is an international high quality forum for scientists, researches and developers working in the area of HyperTransport. This includes not only developments and research in HyperTransport itself, but also work which is based on or enabled by HyperTransport. HyperTransport (HT) is an interconnection technology which is typically used as system interconnect in modern computer systems, connecting the CPUs among each other and with the I/O bridges. Primarily designed as interconnect between high performance CPUs it provides an extremely low latency, high bandwidth and excellent scalability. The definition of the HTX connector allows the use of HT even for add-in cards. In opposition to other peripheral interconnect technologies like PCI-Express no protocol conversion or intermediate bridging is necessary. HT is a direct connection between device and CPU with minimal latency. Another advantage is the possibility of cache coherent devices. Because of these properties HT is of high interest for high performance I/O like networking and storage, but also for co-processing and acceleration based on ASIC or FPGA technologies. In particular acceleration sees a resurgence of interest today. One reason is the possibility to reduce power consumption by the use of accelerators. In the area of parallel computing the low latency communication allows for fine grain communication schemes and is perfectly suited for scalable systems. Summing up, HT technology offers key advantages and great performance to any research aspect related to or based on interconnects. For more information please consult the workshop website (http://whtra.uni-hd.de)

Heidelberger Dokumentenserver

Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA2009)

Author
Publication venue
Publication date: 01/01/2009
Field of study

Heidelberger Dokumentenserver

Proceedings of the Second International Workshop on HyperTransport Research and Applications (WHTRA2011)

Author
Publication venue
Publication date: 01/01/2011
Field of study

Proceedings of the Second International Workshop on HyperTransport Research and Applications (WHTRA2011) which was held Feb. 9th 2011 in Mannheim, Germany. The Second International Workshop for Research on HyperTransport is an international high quality forum for scientists, researches and developers working in the area of HyperTransport. This includes not only developments and research in HyperTransport itself, but also work which is based on or enabled by HyperTransport. HyperTransport (HT) is an interconnection technology which is typically used as system interconnect in modern computer systems, connecting the CPUs among each other and with the I/O bridges. Primarily designed as interconnect between high performance CPUs it provides an extremely low latency, high bandwidth and excellent scalability. The definition of the HTX connector allows the use of HT even for add-in cards. In opposition to other peripheral interconnect technologies like PCI-Express no protocol conversion or intermediate bridging is necessary. HT is a direct connection between device and CPU with minimal latency. Another advantage is the possibility of cache coherent devices. Because of these properties HT is of high interest for high performance I/O like networking and storage, but also for co-processing and acceleration based on ASIC or FPGA technologies. In particular acceleration sees a resurgence of interest today. One reason is the possibility to reduce power consumption by the use of accelerators. In the area of parallel computing the low latency communication allows for fine grain communication schemes and is perfectly suited for scalable systems. Summing up, HT technology offers key advantages and great performance to any research aspect related to or based on interconnects. For more information please consult the workshop website (http://whtra.uni-hd.de)

Heidelberger Dokumentenserver

GRAPE-6: The massively-parallel special-purpose computer for astrophysical particle simulation

Author: Fukushige Toshiyuki
Koga Masaki
Makino Junichiro
Namura Ken
Publication venue: 'Oxford University Press (OUP)'
Publication date: 24/10/2003
Field of study

In this paper, we describe the architecture and performance of the GRAPE-6 system, a massively-parallel special-purpose computer for astrophysical

N

-body simulations. GRAPE-6 is the successor of GRAPE-4, which was completed in 1995 and achieved the theoretical peak speed of 1.08 Tflops. As was the case with GRAPE-4, the primary application of GRAPE-6 is simulation of collisional systems, though it can be used for collisionless systems. The main differences between GRAPE-4 and GRAPE-6 are (a) The processor chip of GRAPE-6 integrates 6 force-calculation pipelines, compared to one pipeline of GRAPE-4 (which needed 3 clock cycles to calculate one interaction), (b) the clock speed is increased from 32 to 90 MHz, and (c) the total number of processor chips is increased from 1728 to 2048. These improvements resulted in the peak speed of 64 Tflops. We also discuss the design of the successor of GRAPE-6.Comment: Accepted for publication in PASJ, scheduled to appear in Vol. 55, No.

arXiv.org e-Print Archive

Crossref