Abstract
Introduction
In our research project named "Mega-Scale Computing
The PC cluster solution continues to be a highly at- For the feasibility study, we especially focused on the sors achieves 32 GFlops ofpeakperformance, which is 2.2-performance/power/space ratio problem. To this end, we fold greater than that ofthe original one. The cluster unit is have developed the prototype system based on commodityequippedwith an independent dual network ofGigabit Eth- only technology, that is, we only used commodity proernet, includingdual24-portswitches. Themaximumpower cessors and network elements. In order to achieve a high consumption of the cluster unit is 320 Wa which is compadensity implementation we developed cluster chassis units rable with that oftoday 's high-endPC servers for high perwhich can house a number of processors in a small space. formance clusters.
The recent trend of dual-core CPUs such as Pentium-D or Performance evaluation using NPB kernels and HPL Opteron-D clearly demonstrates that the best way to imshows that the performance of MegaProto/E exceeds that prove total performance is to introduce multi-processors of a dual-Xeon server in all the benchmarks, and its perhaving relatively low-performances, instead of increasing formance ratio ranges from 1.3 to 3.7. These results rethe CPU clock frequency on a single processor. Based on veal that our solution of implementing a number of ultra this concept, an ideal platform is a cluster of ultra lowlow-powerprocessors in compactpackaging is an excellent power processors implemented in a small space with high way to achieve extremely high performance in applications density.
with a certain degree ofparallelism. We are now building a Green Destiny [1] is a successful example of the above multi-unit cluster with 128 CPUs (8 units) to prove that this concept. While it consists of a commercial blade-style proadvantage still holds with higher scalability.
cessor card, we have designed and implemented a higher density collection of processors in a 1U height chassis containing 17 Transmeta Crusoe processors. This prototype unit was named "MegaProto"[2, 6] which provides 14.9 high performance/cost ratio supported by a non-HPC marGflops of peak performance. When we designed the first ket. prototype of MegaProto, we also intended to build an enIn our Mega-Scale Computing Project[5], we investihanced version incorporating more powerful processor and gated (i) hardware/software cooperative low-power technol-I/O bus. This enhanced version has now been completed ogy and (ii) workload modeling and model-based managewith the name of "MegaProto/E" ('E' stands for Efficeon, ment of large scale parallel tasks together with the faults octhe name of the new processor). In this paper, we describe cuffing in such a system. Our study covers the processor arthe design, implementation and performance evaluation of chitecture, compiler, networking, cluster management and the MegaProto/E cluster unit.
programming for a system based on the above concept. The rest of the paper consists of the following. Section 2 gives an overview of our Mega-Scale computing project 2.2. Conceptual Design of MegaProto along with the conceptual design of the MegaProto series cluster unit. In Section 3, we describe in detail the deHereafter, the word "MegaProto" iS used to refer to our signand mpleenttionof MgaPrto/. Afer dscriing overall prototyping system used for the feasibility study of [3] , the state-of-the-art MPP today, one promising block as the basic unit. Neither the performance or the ing way to achieve Peta-Flops computing is to build an MPP power consumption criteria can be achieved with today's having very low-power processors and a simple switchhigh-end CPUs such as Intel Xeon, Intel Itanium2 or AMD less network to reduce both space and power consumption.
Opteron, even with multi-way SMP configuration. HowHowever, such a system requires a dedicated hardware platever, it is possible to satisfy both criteria using state-of-theform including specially designed processor chips and netart, low power CPUs with DVS and very low voltage drive work and system racks, which are expensive to implement if we can aggregate 10 to 20 CPUs in a single 1U chasand require a long time to design and implement.
sis. Even using blade-style cards, it is impossible to achieve On the other hand, the Figure 1 . System configuration node processor, the network bandwidth can be reduced to used, and IBM Japan had already provided a processor card several hundred MByte/sec. This range can be covered by with CPU, memory and I/O bus extension as a commertrunking several channels of a Gigabit Ethernet, and we decial product for embedded controlling systems. In Megacided to use dual channels per processing node. It is quite Proto/C, Transmeta Crusoe (TM5800) with 933 MHz clock important from the viewpoint of achieving a high perforfrequency was employed. It has the peak performance of mance/cost ratio to introduce a commodity Gigabit Ether-933 MFlops because it can only issue a single floating point net as the interconnection. Since recent Gigabit Ethernet operation per clock. Thus, the peak performance with 16 switching fabric is quite inexpensive and small for 10 to CPUs on a cluster unit is limited to 14.9 GFlops, and it 20 ports of connection, it is possible to implement the bacould not achieve the goal described in the previous secsic switch itself on the mother board and to connect all tion. processing nodes which are mounted on the same mother However, we considered it as a good start for the deboard. Hereafter, we refer to this 1U chassis of the buildvelopment because we just had to develop a mother board ing unit containing multiple CPUs and intra-connection netto contain 17 of these processor cards as daughter cards. work switches as the "cluster unit".
Therefore, we decided to develop the mother board at the Figure 1 shows the conceptual view ofthe 1U cluster unit first stage of the plan and to develop a new daughter card and the overall system. In the figure, 16 CPUs are equipped later when the Efficeon processor became available. Thus with two channels of Gigabit Ethernet NICs and individual the production schedule of the Efficeon processor suited two switches are mounted on a cluster unit. A single system our two-staged plan. Actually, MegaProto/C (with Crurack contains 32 cluster units and interconnection switches, soe) was a kind of "prototype of a prototype" for softand finally hundreds of system racks makes up a Peta-Flops ware development, including Linux kernel tuning, drivers system.
for NICs and switches, compiler and MPI library settings, as well as determining the environment of power consump-3. Design and Implementation of Megation measurements [7] .
Proto/E On a cluster unit, there are two categories of interconnection networks, the data network and the management In this section, we describe the detailed design and imnetwork. Hereafter, the 16 processor cards for computation plementation of MegaProto/E compared with the previous are called "computation nodes" while the processor card for version, MegaProto/C[2, 6].
system management is called a "management node".
3.1. Implementation of MegaProto/C Data Network: It consists oftwo individual Gigabit Ethernet with switches. Each computation node is equipped Before introducing the detailed design and implementawith dual Gigabit Ethernet ports, and each port is contion of MegaProto/E, we first describe the implementation nected to a 24-port Gigabit Ethernet switch (Broadcom of the previous model. We planned to develop the Mega-BCM5692'). Since only the computation nodes are conProto as the first and second versions of the prototype acnected to this network, there are 8 unconnected ports cording to the availability of parts and modules. When we on each switch. These 8 links are connected to external started the design and implementation of MegaProto, Trans-RJ-45 ports for inter-unit connection outside the clusmeta Crusoe was the best candidate for the CPU to be ter unit (See Section 5). A computation node can drive peak bandwidth of 500 MByte/sec. As a result of using a Figure 2 shows the block diagram of MegaProto/C cluscommercial embedded controller module as the computater unit. Only the management node is equipped with a 2.5 tion node on MegaProto/C, there was a severe bandwidth inch hard disk drive with 60 GByte capacity to contain all bottleneck on it. This was improved by using an upgraded system files for the 17 nodes in the cluster unit. At the sys-I/O bus as reflected by several benchmark performances tem boot time, all disk-less computation nodes are booted (See Section 4). The memory throughput was also improved via the Management Network sharing the binary images on from SDR-133 to DDR-266 in addition to the doubled cathe HDD of the management node. User's home directopacity on MegaProto/E. ries are built on the outside file server to be shared by mulSince the computation performance is not directly limtiple cluster units through the external links of the Manageited by the performance of the management node, the proment Network.
cessor card for the management node was kept the same as that used in MegaProto/C, that is, we did not use an Efficeon 3.2. Design of MegaProto/E processor here. Therefore, the management node and computation nodes had a heterogeneous CPU configuration on As mentioned in the previous subsection, we designed
MegaProto/E. There was no actual problem encountered at and implemented the second version of MegaProto while this point. simultaneously constructing the software environment of MegaProto/C and evaluating it. The new version of clus-3.3. Implementation of MegaProto/E ter unit was named "MegaProto/E" (with Efficeon). On this version, we developed a new processor card (daughAs described above, the main work done in implementter card) equipped with an enhanced processor, memory and ing MegaProto/E was on the processor card for computa-I/O bus compared with those on MegaProto/C. Several mition nodes. The improvements to the processor card from nor changes were also made to the mother board to improve that of MegaProto/C are summarized in Table 1 . In particthe system stability. ular, the enhancements to memory throughput and PCI bus The processor card was designed to fit the connection are expected to be reflected in performance improvements socket of the mother board of MegaProto/C, however, the of both the single CPU and parallel processing derived by Q that of Crusoe CPUs, the power consumption on the mem-I m e Es ory module andthe bus andthe PCI-X bridge are greater. As Figure 4 . Thermograph and temperature trana result, the power consumption of each processor card is sition on NPB kernel CG slightly increased, and the maximum total power consumption of the MegaProto/E cluster unit is 320 W while that of MegaProto/C is 300 W. However, this small power in-4. Performance Evaluation crease is acceptable considering the greatly enhanced memory and I/O bus performance as well as more than twice the In this section, we evaluate the basic performance of a floating point performance of the CPU. A photograph of the single cluster unit of MegaProto/E compared with that of cluster unit is shown in Figure 3 .
MegaProto/C and of an ordinary high-end PC server with The most important performance improvement on Megadual Xeon in SMP configuration. To assist in comparing Proto/E is that we are able to obtain an excellent perfornetwork performance, we also refer to the performance of a mance/power/space ratio of 1.024 TFlops/10.24 kW/rack, two-node system with the same configuration of dual-Xeon which satisfies our goal.
servers.
The benchmark programs referred to in this evalu-3.4. Air Cooling ation are the commonly used HPL (High Performance Linpack) [8] and NPB (NAS Parallel Benchmarks) [9] kerMegaProto/E is equipped with ordinary small sized nels. For NPB kernels, the problem size is class-A. For multi-fan to fit to its 1U cluster unit chassis. We have taken HPL, the performance with N =10,000 is shown. All much care of the air flow issue because 17 of node processources were compiled with gcc/g77 version 3.2.2, linked sors are implemented only with small heat sinks without inwith LAM-MPI version 7.1.1, and executed using a Linux dividual cooling fans. We designed the mother board of the kernel version 2.4.22mmpu. The environment of the dualunit to make the air flow over all heat sinks. However, it is Xeon 1U server had similar software environments but impossible to control the surface temperature of all processlightly different versions; gcc/g77 version 3.4.3, LAMsors evenly since the air flows from the bottom to the top in MPI version 6.5.6 and Linux kernel 2.4.20-20.7smp. For all Figure 3 where four processors on the top of three horizonthe benchmarks, only a single channel Gigabit Ethernet was tal blocks.
used in order to avoid software overhead of trunking and to To confirm the actual temperature on the different lokeep the fairness ofthe two-node dual-Xeon servers. cations, we measured the approximate temperature on the heat sinks of several points by a thermography camera. Fig-4 .1. Performance Improvements from Megaure 4 shows the result image of thermograph and the therProto/C mal transition when running NAS Parallel Benchmark kernel CG (class-B) with 16 computation nodes. On the ther- Figure 5 shows the overall performance comparison bemograph, each small rectangle displays a node processor.
tween MegaProto/C and MegaProto/E. In all graphs, the As shown here, even the highest temperature on the surspeed-up ratios compared with the performance of Megaface is lower than 40°C and the difference between points Proto/C with 4 CPUs are shown. A and C is within 8°C. As a result, the air cooling on MegaAs shown in these results, the performance of MegaProto/E works well and actually we have no problem on Proto/E is between 1.06 to 2.38 times that of MegaProto/C. thermal condition so far.
We analyzed these results as follows: This performance result seems to be anomalous be-O L2F
O~~~~~~~~~~cause EP is basically a floating point bound bencho 0~~~~~~~~~~~~~~m ark. Xeon, two-node dual-Xeon, MegaProto/C and MegaProto/E. A Dual-Xeon system is a single node PC server FT, MG: 1.37 1.65 times performance gain is due bain SMP configuration, and two-node dual-Xeon (labeled as sically to the improvement in floating point operation "dual-Xeon x 2") is a small cluster to connect two dualspeed, which is more than twice that of MegaProto/C.
Xeon nodes with a single Gigabit Ethernet. All perfor-CG: The communication data amount in CG is larger than mances are given relative to that of a dual-Xeon. The PC other benchmarks. The performance improvement on server used here was Appro 1 124Xi with 3.06 GHz In-PCI-X bus gives rise to a performance gain of 2.38 tel Xeon and 1 GByte of DDR memory. The total TDP times.
and peak performance of processors were 170 W and 12.2 HPL: 1.74 times performance gain is obtained as a result Gflops, respectively [10] , and the maximum AC power ratof the upgraded floating point performance similar to ing of the entire 1U system was 400 W. Since these specithat of FT and MG. However, the efficiency of Linfications are all comparable to our cluster unit with 16 propack performance to the theoretical peak is only 30.100 cessors, the dual-Xeon server is a good reference for perfordue to the small capacity of the memory. MegaProto/C mance comparison.
achieved 3800 ofpeak performance with16 processors, First, we can see the performance of MegaProto/E aland the memory capacity problem is more serious on ways exceeds that of dual-Xeon, ranging from 1.30 to MegaProto/E in reference to its powerful floating point 3.69 times improvement. In MG, EP and FT, it achieves performance. a remarkable score. Especially for MG and FT, which are IS: Almost no performance gain because of the lack of CPU performance and memory throughput bound benchfloating point operations. The gain of CPU clock fremarks, MegaProto/E achieves excellent performance. Alquency is only 700, which is reasonable.
though MegaProto/C showed a markedly inferior performance to that of the dual-Xeon for CG and HPL, Megaeach colored route is mapped on different VLAN-id Proto/E overcomes this with its improved network bandwidth and increased memory capacity.
upper-level switches With the exception of CG and HPL, MegaProto/E achieved higher scores than the two-node dual-Xeon syslower-level switches tem, even though it runs with less than half the power consumption of the two-node dual-Xeon. Since both systems are based on a single channel Gigabit Ethernet, this means Figure 7 . Gigabit Ethernet switches on MegaProto/E is in Layer-2, which is the physical limit of the IEEE802. 1 q protocol. a so-called "broadcast storm" occurs if we simply connect
The on-board Gigabit Ethernet switch on MegaProto can two or more uplink ports of multiple cluster units. The handle this mechanism, and we can scale up the entire sysbroadcast storm occurs when multiple links make one or tem to utilize all the allowed tags. It seems expensive to inmore loops including the node PC and any of the intermetroduce eight 8-port switches, but actually a switch can be diate Layer-2 switches.
virtualized into multiple logical switches with VLAN. Ifwe There are two ways to solve this problem:
carefully select the assigned VLAN-tag without any conflict in the system, it is possible to configure the system network 1. Using Layer-3 switchs which have the IP-base routwith a minimum set of 
