32 research outputs found

    A cost-effective heuristic to schedule local and remote memory in cluster computers

    Full text link
    Cluster computers represent a cost-effective alternative solution to supercomputers. In these systems, it is common to constrain the memory address space of a given processor to the local motherboard. Constraining the system in this way is much cheaper than using a full-fledged shared memory implementation among motherboards. However, memory usage among motherboards can be unfairly balanced. On the other hand, remote memory access (RMA) hardware provides fast interconnects among the motherboards of a cluster. RMA devices can be used to access remote RAM memory from a local motherboard. This work focuses on this capability in order to achieve a better global use of the total RAM memory in the system. More precisely, the address space of local applications is extended to remote motherboards and is used to access remote RAM memory. This paper presents an ideal memory scheduling algorithm and proposes a cost-effective heuristic to allocate local and remote memory among local applications. Compared to the devised ideal algorithm, the heuristic obtains the same (or closely resembling) results while largely reducing the computational cost. In addition, we analyze the impact on the performance of stand alone applications varying the memory distribution among regions (local, local to board, and remote). Then, this study is extended to any number of concurrent applications. Experimental results show that a QoS parameter is needed in order to avoid unacceptable performance degradation. © 2011 Springer Science+Business Media, LLC.This work was supported by Spanish CICYT under Grant TIN2009-14475-C04-01 and by Consolider-Ingenio under Grant CSD2006-00046.Serrano Gómez, M.; Sahuquillo Borrás, J.; Petit Martí, SV.; Hassan Mohamed, H.; Duato Marín, JF. (2012). A cost-effective heuristic to schedule local and remote memory in cluster computers. Journal of Supercomputing. 59(3):1533-1551. https://doi.org/10.1007/s11227-011-0566-8S15331551593IBM journal of Research and Development staff (2008) Overview of the IBM blue gene/P project. IBM J Res Dev 52(1/2):199–220Blocksome M, Archer C, Inglett T, McCarthy P, Mundy M, Ratterman J, Sidelnik A, Smith B, Almási G, Castaños J, Lieber D, Moreira J, Krishnamoorthy S, Tipparaju V, Nieplocha J (2006) Design and implementation of a one-sided communication interface for the IBM eServer Blue Gene® supercomputer. In: Proceedings of the 2006 ACM/IEEE conference on supercomputing, SC ’06, Tampa, FL, USA, November 2006, pp 54–54Kumar S, Dózsa G, Almasi G, Heidelberger P, Chen D, Giampapa M, Blocksome M, Faraj A, Parker J, Ratterman J, Smith BE, Archer C (2008) The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer. In: Proceedings of the 22nd annual international conference on supercomputing, Island of Kos, Greece, June 2008, pp 94–103Tipparaju V, Kot A, Nieplocha J, Bruggencate MT, Chrisochoides N (2007) Evaluation of remote memory access communication on the cray XT3. In: Proceedings of the 21th international parallel and distributed processing symposium, Long Beach, California, USA, March 2007, pp 1–7Nussle M, Scherer M, Bruning U (2009) A resource optimized remote-memory-access architecture for low-latency communication. In: International conference on parallel processing, Sept 2009, pp 220–227http://www.hypertransport.org/Serrano M, Sahuquillo J, Hassan H, Petit S, Duato J (2010) A scheduling heuristic to handle local and remote memory in cluster computers. In: Proceedings of the 12th IEEE international conference on high performance computing, Melbourne, Australia, Sept 2010, pp 35–42Keltcher CN, McGrath KJ, Ahmed A, Conway P (2003) The AMD opteron processor for multiprocessor servers. IEEE MICRO 23(2):66–76Duato J, Silla F, Yalamanchili S (2009) Extending hypertransport protocol for improved scalability. In: First international workshop on hypertransport research and applications.Litz H, Fröening H, Nuessle M, Brüening U (2007) A hypertransport network interface controller for ultra-low latency message transfers. HyperTransport Consortium White Paperhttps://www.simics.net/http://www.cs.wisc.edu/gems/http://www.cs.virginia.edu/stream/Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: Characterization and methodological considerations. In: Proceedings of the 22nd annual international symposium on computer architecture, New York, NY, USA, 1995, pp 24–36Levitin A (2003) Introduction to the design and analysis of algorithms. Addison Wesley, ReadingOleszkiewicz J, Xiao L, Liu Y (2004) Parallel network RAM: Effectively utilizing global cluster memory for large data-intensive parallel programs. In: Proceedings of 33rd international conference on parallel processing, Montreal, Quebec, Canada, pp 353–360Liang S, Noronha R, Panda DK (2005) Swapping to remote memory over infiniband: An approach using a high performance network block device. In: Proceedings of the 2005 IEEE international conference on cluster computing, Boston, Massachusetts, USA, pp 1–10Oguchi M, Kitsuregawa M (2000) Using available remote memory dynamically for parallel data mining application on ATM-connected PC cluster. In: Proceedings of the 14th international parallel & distributed processing symposium, Cancun, Mexico, pp 411–420Werstein P, Jia X, Huang Z (2007) A remote memory swapping system for cluster computers. In: Proceedings of the eighth international conference on parallel and distributed computing, applications and technologies, Adelaide, Australia, pp 75–81Midorikawa H, Kurokawa M, Himeno R, Sato M (2008) DLM: A distributed large memory system using remote memory swapping over cluster nodes. In: Proceedings of the 2008 IEEE international conference on cluster computing, Tsukuba, Japan, October 2008, pp 268–27

    Active memory controller

    Full text link
    Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs\u27 performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50x faster barriers, 12x faster spinlocks, 8.5x-15x faster stream/array operations, and 3x faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation

    Host-Assisted Zero-Copy Remote Memory Access Communication on InfiniBand

    No full text
    This paper describes how RMA can be implemented efficiently over InfiniBand. The capabilities not offered directly by the Infiniband verb layer can be implemented efficiently using the novel host-assisted approach while achieving zero-copy communication and supporting an excellent overlap of computation with communication. For contiguous data we are able to achieve a small message latency of 6s and a peak bandwidth of 830 MB/s for 'put' and a small message latency of 12s and a peak bandwidth of 765 Megabytes for 'get'. These numbers are almost as good as the performance of the native VAPI layer. For the noncontiguous data, the host assisted approach can deliver bandwidth close to that for the contiguous data. We also demonstrate the superior tolerance of host-assisted data-transfer operations to CPU intensive tasks due to minimum host involvement in our approach as compared to the traditional host-based approach. Our implementation also supports a very high degree of overlap of computation and communication. 99% overlap for contiguous and up to 95% for non contiguous in case of large message sizes were achieved. The NAS MG and matrix multiplication benchmarks were used to validate effectiveness of our approach, and demonstrated excellent overall performanc

    High Performance Remote Memory Access Comunications: The ARMCI Approach

    No full text
    This paper describes the Aggregate Remote Memory Copy\ud Interface (ARMCI), a portable high performance remote\ud memory access communication interface, developed originally\ud under the U.S. Department of Energy (DOE) Advanced\ud Computational Testing and Simulation Toolkit project and\ud currently used and advanced as a part of the run-time\ud layer of the DOE project, Programming Models for Scalable\ud Parallel Computing. The paper discusses the model,\ud addresses challenges of portable implementations, and\ud demonstrates that ARMCI delivers high performance on a\ud variety of platforms. Special emphasis is placed on the\ud latency hiding mechanisms and ability to optimize noncontiguous data transfers

    Host-Assisted Zero-Copy Remote Memory Access Communication on InfiniBand

    No full text
    The remote memory access (RMA) is becoming an increasingly important communication model due to its excellent potential for overlapping communication and computations and achieving high performance on modern networks with RDMA hardware such as Infiniband. RMA plays a vital role in supporting the emerging global address space languages and management of advanced distributed data structures. This paper describes how remote memory access communication (RMA) can be implemented efficiently over InfiniBand based on the 'zero-copy' approach. The capabilities not offered directly by the Infiniband verb layer can be implemented efficiently using the novel host-assisted approach while achieving zerocopy communication and supporting a high degree of overlapping computations and communication. For contiguous case we are able to achieve a small message latency of 7.44s and a peak bandwidth of 730 MB/s for 'put' and a small message latency of 15s and a peak bandwidth of 689 MegaBytes for 'get'. These numbers are almost as good as the performance of the native VAPI layer. For the noncontiguous case, with our host assisted approach, we can support close to the peak bandwidth that was achieved for the contiguous data. We also demonstrate the superior tolerance of host-assisted datatransfer operations to CPU intensive tasks due to minimum host involvement in our approach as compared to the traditional host-based approach. Our implementation also supports a very high degree of overlap of computation and communication. 99% overlap for contiguous and up to 95% for non contiguous in case of large message sizes were achieved. Finally, the NAS MG and parallel matrix multiplication benchmarks were used to validate effectiveness of our approach, and demonstrated excellent overall performance

    Symmetric Data Objects and Remote Memory Access Communication for Fortran 95 Applications

    No full text
    Symmetric data objects have been introduced by Cray Inc. in context of SHMEM remote memory access communication on Cray T3D/E systems and later adopted by SGI for their Origin servers. Symmetric data objects greatly simplify parallel programming by allowing programmers to reference remote instance of a data structure by specifying address of the local counterpart. The current paper describes how symmetric data objects and remote memory access communication could be implemented in Fortran 95 without requiring specialized hardware or compiler support. NAS Multi-Grid parallel benchmark was used as an application example and demonstrated competitive performance to the standard MPI implementation

    Combining Distributed and Shared Memory Models: Approach and Evolution of the Global Arrays Toolkit

    No full text
    This paper describes the characteristics of the Global Arrays programming model, capabilities of the toolkit, and discusses its evolutio

    Nano-drug delivery platform for glucocorticoid use in skeletal muscle injury.

    No full text
    Glucocorticoids are utilized for its anti-inflammatory properties in the skeletal muscle and arthritis. However, the major drawback with use of glucocorticoids is that it leads to senescence and toxicity. Therefore, based on the idea that decreasing particle size allows for increased surface area and bio-availability of the drug, in the present study we hypothesized that nano-delivery of dexamethasone will offer increased efficacy and decreased toxicity. The dexamethasone loaded PLGA (poly lactic-co-glycolic acid) nanoparticles were prepared using nanoprecipitation method. The morphological characteristics of the nanoparticles were studied under scanning electron microscope. The particle size of nanoparticles was 217.5±19.99 nm with polydispersity index (PDI) of 0.14±0.07. The nanoparticles encapsulation efficiency was 34.57±1.99% with in vitro drug release profile exhibiting a sustained release pattern over 10 days. We identified improved skeletal muscle myoblast performance with improved closure of the wound along with increased cell viability at 10nM nano-Dexamethasone-PLGA, however dexamethasone solution (1µM) was injurious to cells since the migration efficiency was decreased. In addition, the use of NP-Dexamethasone decreased LPS induced LDH release compared with dexamethasone solution. Taken together, the present study clearly demonstrates that delivery of PLGA-dexamethasone nano-particles to the skeletal muscle cells is beneficial for treating inflammation and skeletal muscle function.The accepted manuscript in pdf format is listed with the files at the bottom of this page. The presentation of the authors' names and (or) special characters in the title of the manuscript may differ slightly between what is listed on this page and what is listed in the pdf file of the accepted manuscript; that in the pdf file of the accepted manuscript is what was submitted by the author

    Characterisation of radicals formed by the triazine 1,4-dioxide hypoxia-activated prodrug, SN30000

    No full text
    The radical species underlying the activity of the bioreductive anticancer prodrug, SN30000, have been identified by electron paramagnetic resonance and pulse radiolysis techniques. Spin-trapping experiments indicate both an aryl-type radical and an oxidising radical, trapped as a carbon-centred radical, are formed from the protonated radical anion of SN30000. The carbon-centred radical, produced upon the one-electron oxidation of the 2-electron reduced metabolite of SN30000, oxidises 2-deoxyribose, a model for the site of damage on DNA which leads to double strand breaks. Calculations using density functional theory support the assignments made
    corecore