ABSTRACT In this paper, we propose the architecture of the ARM-based server system and physically build the ARM-based server cluster board (SCB). Based on a real implementation, the efficiency of the microserver is evaluated to verify the benefits of the proposed scheme. In experiments, every SCB is equipped with four server-grade ARM quad-core processors for enterprise-class applications. The major difference between our work and other studies is that this paper tests the performance of the server-grade processors, whereas previous works use the cluster system formed by the embedded processors. The experiment results show that the SCB-based systems perform with lower power-consumption than the x86 systems. Since various types of cluster systems can be built with the SCB through either the PCI Express (PCIe) interface or gigabit Ethernet interface, the proposed design of the ARM server board and the systems can, therefore, be extensively applied. This physical implementation and measurement show that the proposed architecture can work well. The proposed energy-saving (i.e., green) server designs with low-power processors are suitable for relative applications in the coming era of Industry 4.0 and the Internet of Things.
I. INTRODUCTION
In the era of Industry 4.0 and the Internet of Things (IoT) [1] , key issues concerning performance and energy efficiency have to be taken into account for systems design, for example, the applications of embedded systems and sensor networks. [2] - [5] . For multicore [6] - [11] chip vendors, increasing processor core density is a typical method used to address the above stated issues for processors. Well-known examples of multicore processors are the Snapdragon processor from Qualcomm, the ARMADA XP processor from Marvell, the Cell B/E processor from IBM, and the Xeon Phi coprocessor from Intel. To further improve the density within the server case, these chip multicore processors are often packaged as a separate device, external to the host system, e.g., the Cell Accelerator Board from Mercury Computer Systems, the Xeon Phi coprocessor board from Intel, and the Quad-Node EnergyCard from Calxeda.
Recently, microservers have been developed to address the needs of high server density and low power consumption; the ARM-based system [12] , [13] is one of the mainstream solutions adopted by the big server vendors. For example, Dell and Calxeda both announced low-power server solutions based on quad-core ARM processors [12] , [13] . Unfortunately, while ARM server products are available on the market and few performance studies have been performed for ARM-based systems [14] - [16] , to the best of our knowledge, no other work studies the performance and energy consumption of parallel and server workloads on the servergrade ARM processors.
In this paper, we present the architecture design of a custom built ARM server board, called the ARM-based Server Cluster Board (SCB), and evaluate the efficiency of the SCB with various types of workloads. We adopt 32-bit ARM multicore processors to develop the system. The developed system could be operated in two modes: i) connecting several ARM boards together to form a cluster system, and ii) plugging the card to an existing computer system as an accelerating card. The latter mode represents a heterogeneous computing system. The SCB consists of four ARM quad-core processors and can support up to four ARM server nodes. The SCB is packaged as the PCI Express (PCIe) interface card, which communicates with the host system or other devices on the system via the host PCIe bus. Additionally, the on-board gigabit Ethernet ports allow the SCB to connect external systems, which improves the flexibility for networking operations. One can build ARM server clusters easily with the SCB as standard interconnects are adopted by the board.
We evaluated the performance of the SCB using the parallel benchmarks and web server software. To further analyze the energy efficiency, the experimental results are compared with those for the x86 [11] machine. Our results show that the SCB requires 39% less energy than the x86 systems for parallel workloads due to its low power design. On the other hand, our SCB is up to 6x more energy efficient than the x86 system for the server workload.
The rest of the paper is organized as follows. Section II compares the SCB with existing works. The internal architecture, the usage models, and the programming models for the SCB are discussed further in Section III. We measure the performance and consumed energy of the SCB and x86 system in Section IV. Finally, the conclusion is presented in Section V.
II. RELATED WORK
As this paper covers the design of the ARM server and the performance/power evaluation of such a system, we compare our work with existing ARM server designs in Subsection A of Section II and the studies evaluating the performance/energy trade-offs of ARM-based systems in Subsection B of Section II.
A. ARM SERVER DESIGNS
It is interesting that the latest ARM servers are both built with ARM quad-core processors. Copper from Dell is a system that supports up to 48 independent servers in 12 sleds (server cards) within a 3U chassis; each sled has four ARMADA XP processors [12] . Based on the specification on its website, it is similar to our SCB, except for the frequency of the processor and the maximum supported DRAM size. Another difference is that the SCB design allows inter-processor communication through the PCIe interface, which is not indicated in their specification.
Calxeda has developed the Quad-Node EnergyCard, which features the company's quad-core ARM processor SoCs (Server-on-Chips), for building ARM-based cluster systems. The primary difference between the EnergyCore processor and the ARMADA XP processor is that the former is equipped with faster networking interfaces, eight 10 gigabit fabric links, to facilitate networking among processor chips. To support high-speed networking, a crossbar switch is built inside the chip for configuring the network topology and routing network traffic. While the high-speed interconnect is a plus for the system, further studies are required to determine if the high-speed fabric is power-efficient for current ARM processors.
In summary, the major difference between the above systems and the SCB is the networking interface for inter-processor communication. The Copper system uses a gigabit Ethernet interface, whereas the EnergyCore Server adopts 10 gigabit network links. Compared with the two systems, in addition to the gigabit Ethernet interface, PCIe is an alternative mechanism for processors in the SCB system to talk with each other. Each processor theoretically has 2GB/s of bandwidth delivered by the PCIe Gen2 interface (in 4 lanes), which is comparable to the 10 gigabit network link offered by the EnergyCore Server. Thus, we believe that the SCB is representative for evaluating the efficiency of ARM servers.
B. PERFORMANCE AND ENERGY EFFICIENCY FOR ARM SYSTEMS
Recently, as the concept of microservers has become popular, one of the emerging research trends is to compare the efficiency of the ARM and x86 systems and to evaluate the tradeoff between the performance and power consumption of ARM systems. An informal study [17] performed experiments on the ARM processor and low-end x86 processors and claimed that the ARM processors are more power efficient than the x86 processors.
Blem et al. examined the performance, power, energy and trade-offs for multiple types of benchmarks on the x86 and ARM platforms [15] . In particular, they compared the microarchitectural differences among the Intel Atom, Intel i7, ARM Cortex-A8, and ARM Cortex-A9 processors for the workloads, including mobile, desktop and server applications. The challenges, missteps, and hardware/software bugs are outlined in the paper to facilitate similar studies in the future. They concluded that the x86 implementations generally consume more power than the ARM implementations. In comparison with our work, while Blem et al. provides extensive experiments for several processors with different Instruction Set Architectures (ISAs) [11] , [16] and microarchitectures, they did not present the results for parallel workloads.
Padoin et al. [18] measured the time-to-solution, peak power, and energy-to-solution of the six NPB (NAS Parallel Benchmarks 1 ) programs [19] on the x86 systems with 16 and 64 hardware threads and on the ARM system with four hardware threads. They concluded that the x86 system is more efficient than the ARM system. While the ARM system has less peak power consumption, the energy spent on the ARM system to obtain the results is almost always greater than that of the 16-thread x86 system due to the much longer execution time to execute the programs.
Ou et al. examined the energy-and cost-efficiency of server related workloads, including Web server throughput, in-memory database, and video transcoding, on the Intel quad-core workstation and on the ARM cluster system with four dual-core ARM Cortex-A9 processors [20] . They used energy efficiency [21] -the ratio of the work conducted to the energy consumed -to assess the systems. Their experimental results show that these workloads are more energy-efficient in the ARM cluster than in the x86 workstation. In particular, the energy efficient ratio of the dual-core system and the quad-core system is about 9.5 for the in-memory database application.
Göddeke et al. evaluated the performance and energy consumption for partial differential equations applications on the ARM-based cluster with 96 ARM Cortex-A9 dual-core processors and on the x86-based cluster with 32 dual-socket quad-core Intel Xeon processors [14] . They quantitatively evaluated the trade-off between the total number of execution times and the energy consumption of the applications under different configurations on both platforms. They found that for some workloads, reducing total energy is possible for the ARM-based system with moderate slowdowns compared to the x86-based cluster, but for compute-intensive applications, the peak performance between the two architectures is too large to save the energy via ARM systems.
The major difference between our work and the above studies is that our work tests the performance of the server grade processors, whereas previous works use the cluster system formed by the embedded processors, i.e., OMAP 4330 from TI and Tegra 2 from NVIDIA. While the ARMADA XP processor and the embedded processors all share common ISA (ARM v7), the configurations of the server-grade processor are different from the others. For example, the ARM system built by Padoin et al. has four hardware threads and a 100 Mbps Ethernet, which is not as powerful as the ARMADA XP processor that we used in this work. 2 Table 1 gives a detailed comparison of studies on ARM systems.
III. ARM SERVER CLUSTER BOARD (SCB)
In this section, we describe the SCB from different perspectives, from hardware architecture to the software environment. We first address the architecture of the ARM SCB and the systems that can be built with it. From a software perspective, we introduce the computing paradigms for the SCB-related systems and the programming methodologies of these systems. A. SYSTEM OVERVIEW Figure 1 depicts the architecture of the ARM SCB. The major components of the SCB are four ARMADA XP processors, namely, MV78460 processors from Marvell. The configuration of the processors on the SCB is listed in Table 2 . Each processor has a dedicated link to the PCIe switch, meaning that a processor has a maximum bandwidth of 2GB/s to communicate with other nodes. In addition, the on-board gigabit Ethernet ports can also be used for inter-processor communication.
The SCB can be viewed as complete computer systems. From the user's point of view, the SCB is a distributed memory cluster system. Each system node runs a Linux operating system on a quad-core processor with its own memory address space. The terminal connection is available through the USB port. Similar to standard Linux-based systems, the remote login service is also available through the Ethernet port if the service is installed in the system.
Communication among server nodes either inside or outside the SCB can be done through the provided network interconnects, i.e., gigabit Ethernet 3 and PCIe. 4 Existing 32-bit parallel programs run on the SCB system with little modification (and in some cases with no modification), as the data can be passed to the general network interface card emulated by our software driver, where the data traffic is actually routed to one of two physical interfaces. It is worth noting that to avoid walking through heavy TCP/IP protocol layers via a conventional Ethernet network, a customized library could be used to take advantage of the native performance provided by the PCIe network.
B. COMPUTING PARADIGMS AND PROGRAMMING METHODOLOGIES
The computing paradigm of the SCB-based system depends on the ISAs [11] , [16] of processors on the system. If all of the processors have the same ISA, the system is referred to as a homogeneous computing system. Otherwise, it is referred to as a heterogeneous computing system. 
1) HOMOGENEOUS COMPUTING SYSTEM
Building a homogeneous computing system with SCBs can be achieved by plugging these boards into a PCIe baseboard that connects to a PCIe switch device. The baseboard provides both the electric power required by the SCB and the communication interfaces, as shown in Fig. 2 . Additionally, the system can be built and linked with an Ethernet network, as shown in Fig. 3 . Each SCB has an on-board power socket to connect with the extra power source. Note that while the SATA interface is provided on the processor chips for mounting hard disks to the system, in our current system deployment, the kernel image and programs are stored in the SD card. If hard drives are needed, an extra power source should be plugged into the on-board power socket.
A standard software stack can be installed on the built system for parallel computing, for example, the Linux operating system, compiler, debugger, and concurrent programming libraries. The message-passing model is a typical method of parallel programming on such a distributed memory system. As the message-passing implementations, such as Open MPI [22] and MVAPICH [23] for the Message-Passing Interface (MPI) standard, are available for ARM systems, the SCB systems can be seen as ordinary cluster systems, and existing message-passing programs should be ported to the system with little modification. In our experiments, Open MPI is installed in our system, and the parallel benchmarks run on the SCB with no modification. Please refer to Section IV for more details.
Furthermore, OpenMP [24] , one of the popular shared memory programming libraries, is well supported on the ARM systems. OpenCL [25] - [35] is another promising alternative to replacing OpenMP. To connect all the multicore cluster nodes, these shared memory programming libraries can further work with MPI, forming the hybrid execution environment on the cluster system, e.g., OpenMP/MPI and OpenCL/MPI.
2) HETEROGENEOUS COMPUTING SYSTEM
A typical way to establish a heterogeneous computing system is to plug the SCB into an existing computer (either a PC or a server with different ISA). In such a closely-coupled system, where the data can be transferred among processors via the PCIe bus on the motherboard, the SCB can work under one of three execution scenarios: offload, hybrid, and standalone.
To offload jobs to the specialized hardware device from the host system, the special library from the vendor is often required for user-space programs to transfer the data from the host memory to the device memory and command the device to perform the tasks. This is similar to programming for special-purpose devices, e.g., the FPGA (FieldProgrammable Gate Array) and GPU (Graphics Processing Unit) [26] , [31] , [33] , [36] cards. For instance, graphic cards from NVIDIA can be programmed via the CUDA (Compute Unified Device Architecture) [32] programming library.
While the specialized library can deliver native performance offered by the hardware, the programs are highly platform dependent. OpenCL has been released as a standard framework for writing portable programs to execute on the heterogeneous systems that are comprised of central processing units (CPUs), GPUs, and other types of processors/accelerators. The cross platform standard has become popular in the heterogeneous computing community. The Xeon Phi coprocessor is a more recent example that supports OpenCL. Unfortunately, since OpenCL does not currently support the ARM-based compute devices that are adopted in SCB, we are not able to perform an experiment in this context. We will perform the interesting experiments in the near future after this feature is implemented.
For hybrid execution, the host system and the compute devices, i.e., SCBs, work together to execute programs concurrently. Again, MPI can be used in this system configuration since it provides a high-level abstraction for underlying architectures and is widely supported on different platforms.
In this sense, OpenMP/MPI and OpenCL/MPI can be adopted as well. Finally, for standalone execution, where SCBs work independently from the host system, they act exactly as the aforementioned homogeneous computing system.
IV. EXPERIMENT RESULT
In this section, we demonstrate the ARM SCB systems built in Section IV-A. Before presenting the experimental results, we describe the system setups and performance metrics used in the experiments in Section IV-B. Next, the scalability of the SCB system under various workloads and hardware configurations is illustrated in Section IV-C to examine both the functional correctness and delivered performance for the SCB system. Finally, we performed three case studies to evaluate the efficiency of the proposed design with parallel and server workloads against the x86 system in Section IV-D, IV-E, and IV-F. 
A. THE ARM SCB PROTOTYPE
The prototype of the ARM SCB was built based on the design elaborated in Section III-A. Figure 4 and Figure 5 give the screenshots of the front and back sides of the ARM SCB. The front side contains major system components, such as the Marvell ARMADA XP processors and networking modules for inter-node communications via PCIe and Ethernet interfaces, whereas the back side contains the system memory and USB slots. Figure 6 and Figure 7 illustrate the physical systems that were built based on the developed ARM SCBs.
It is worth noting that the two-level PCIe switching mechanism was adopted on the SCB; i.e., each of the quad-core VOLUME 5, 2017 processors is linked to the level-1 PCIe switch, which bridges the ARM processor and the level-2 switch (shown in Fig. 4) . Because the DMA engines are equipped in the level-1 PCIe switches, our hierarchical design offers direct data transformation between server nodes on the SCB. Compared with the single PCIe switch design adopted by Marvell, our design offers a way to move data between server nodes without the intervention of the host system. While our design prevents the redundant data copy, the power consumption and the latency for data accesses on the SCB system is increased. Further experiments are required to analyze the trade-offs between conventional designs and our designs.
B. EXPERIMENTAL SETUP
As described in Section III, the SCB system consists of four nodes, each of which is hosted by the ARM quad-core processor. For the system scalability experiment, all four system nodes in the SCB system are used to test the scale-up and scale-out performance. For the two case studies, only single ARM node is activated to comply with the configurations of the x86 system. The detailed experimental setup is listed in Table 3 . Note that the gigabit Ethernet is used throughout the experiments in this paper for exchanging data among computer nodes. While we would like to test the capability of the PCIe interface, its driver is under development at the time of writing. Hence, the built cluster system is of the form shown in Fig. 6 .
We used the ratio of saved energy to indicate how much energy can be saved when adopting the ARM system for running the given workload instead of using the x86 system. Equation (1) defines the ratio of saved energy formally, in which E x86 and E ARM represent the total energy consumed by the x86 and ARM systems, respectively. Note that as the Ethernet network is adopted for communication among server nodes, the measured power numbers exclude the power consumed by the PCIe switches.
The slowdown ratio (SR) is used to measure the performance difference between the two systems. Equation (2) shows how the SR is computed, in which the T x86 and T ARM are the time required by the systems to finish the same given workload. Finally, as web servicing is used in the case study, the energy efficiency [21] of the systems is defined as the number of web requests that can be handled per unit of energy. The larger the number, the better the efficiency of the system.
(1)
C. SCALABILITY OF THE SCB SYSTEM
We evaluated the scalability of the SCB system with the entire NPB benchmark suite. 5 Problem classes A, B, and C were used as the input data sets for the benchmarks. 6 Each benchmark was run by varying the number of hardware cores from 1 to 16 on the SCB in multiples of 2. The runtime measurements presented in this study are an average of five runs, excluding outliers. Figures 8-10 depict the speedups of the programs for different problem sets and hardware threads. The linear speedups obtained by the embarrassingly parallel program, EP, under different workloads demonstrates that the SCB system is scalable. That is, its performance scales well both in the quad-core processor and across the processors. On the other hand, the IS program scales poorly for more than eight cores. The poor scalability is primarily due to extensive off-chip communications, where the communication time dominates the total execution time. 5 The NPB benchmark programs are derived from computational fluid dynamics applications and comprise five kernels, IS for Integer Sort, EP for Embarrassingly Parallel, CG for Conjugate Gradient, MG for Multi-Grid on a sequence of meshes, and FT for discrete 3D fast Fourier Transform, and three pseudo-applications, BT for Block Tridiagonal solver, SP for Scalar Pentadiagonal solver, and LU for Lower-Upper Gauss-Seidel solver. 6 Problem class A, B, and C are standard test problems for the NPB benchmark programs, where the data size is four times larger from one class to the next. 
D. PARALLEL WORKLOADS -NPB
We measured the time and energy required to execute the benchmarks in NPB on the ARM and x86 systems. A power meter was used to obtain the total power consumed by both systems. Note that the hyperthreading technology for the Intel Core i7 processor is disabled in this experiment to match the core number used in the ARM system. Figures 11-12 illustrate the energy saved, E saved , by the ARM system, where benchmarks were executed with single and four processor cores, and one and four software threads were used respectively. For the single processor core experiment (shown in Fig. 11 ), the power-efficient system consumed at least 39% less energy than the x86 system across all workloads. This was achieved at the cost of 8.6x slower execution speed on average. On the other hand, for the four-processor-core experiment, since the brawny processor operated at 2x higher frequency with 4x larger last level cache, it was 11.8x faster than the wimpy processor in terms of program execution on average. Hence, average energy saving decreased to 26%, as shown in Fig. 12 .
It is important to note that since a large portion of the benchmarks involves floating point operations, the hardware support for floating point operations in the ARM processor should be enabled; otherwise, the ARM system takes too much time to deliver the results; it is about 10x slower using software emulation for the floating point operations.
E. PARALLEL WORKLOADS -Hadoop BENCHMARKS
To establish the multi-node Hadoop cluster environment [37], we used another x86 system as the master node for receiving the user commands and distributing the jobs to the worker node, which could be either the ARM or the x86 server. The three programs, Grep, WordCount, and Sort, which come with the Hadoop source package, were tested in this experiment with data sizes that varied from 128 to 1,024 MB. We adjusted the heap memory size of the Java virtual machine on the both x86 and ARM worker nodes to be larger than 1GB to avoid performance degradation. Figure 13 depicts the saved energy of the ARM system against the x86 system for the single processor core experiment. Overall, the amount of saved energy for the small data size is encouraging, ranging from 44% to 75% with a 4x to 9x performance slowdown. The numbers decrease as the data size grows, which is different from the data shown in Fig. 11 , where the saved energy is relatively steady across various data sizes. The reason for the degradation in energy saving efficiency is that we used the USB drive (USB 2.0 standard) as the primary means of data storage, and extensive disk I/O operations on the ARM system limit the overall performance. From our experiments, when the data size is larger than 256MB, the I/O performance delivered by the ARM system is several orders of magnitude larger than that achieved by the x86 system. For example, the ARM server requires 6.1x longer than the x86 system to transfer 1GB of data. This problem can be solved by using a faster storage media, such as the SATA II disk drive, which is supported by the Marvell processor.
The same trend is found as in the four-processor-core experiment. However, as the processor cores on the same system node contend for the input data in data storage, the performance speedup against the single core experiment is small. Hence, the saving energy efficiency of the ARM system against the x86 drops. For the WordCount program, 1.8x and 1.6x speedups were achieved by the quad-core x86 and ARM systems for handling the 128MB input data, respectively. The amount of energy saved decreased from 44% (single core) to 28% (four cores). Further optimization techniques would need to be developed to take advantage of the computing power from the multicore processor systems.
F. SERVER WORKLOADS
The efficiency of the ARM and x86 systems was evaluated using the web server benchmark lighttpd [38] . There was another x86 system included in the experiment that served as the client system to issue the web requests to the ARM and x86 servers. We made sure that the client machine was more powerful than the servers, so that it was not the source of a system bottleneck. The client system and the servers were all connected to the gigabit network switch.
The servers were tested with the same configurations, 50 concurrent connections and a total of 100 000 web requests, under different data sizes, i.e., web file web file sizes. The total power consumed by the server systems was measured by the power meter, omitting the power contributed by the client system and the network switch. In the following sections, the efficiency of both systems is evaluated from two perspectives, i.e., the performance and energy consumed to complete the designated job (Section IV-F1) and the delivered performance per unit of consumed energy (Section IV-F2). 
1) PERFORMANCE AND ENERGY TRADEOFF
We plot the saved energy and slowdown ratios for the ARM web server against the x86 server, as shown in Fig. 14 . For small web file (SWF), i.e., 512B, both processors were busy in servicing the web requests. In such cases, the energy consumed by the ARM system was only half of that consumed by the x86 counterpart, but 7.2x more time was required to handle the web requests. On the other hand, when the data size became larger, the loading was gradually shifted to the network subsystem for transmitting the web files, and hence, the performance gap between both systems narrowed as the impact of CPU computing power on the performance decreased. More specifically, when the data size was larger than 2KB, the energy benefit of the ARM processor outweighed its performance loss. For example, up to 85% was saved by the ARM system at the cost of a 1.8x slowdown while the 8KB web files were being tested. Figure 15 illustrates the number of requests handled by both systems under the same energy budget, where the number of requests that can be served per unit of consumed energy was computed by dividing the total serviced requests by the consumed energy. Overall, the ARM system is more efficient for web servicing since the network performance dominated the file-serving workload. Under the given power budget of one Joule, the ARM server serviced at least 2x more requests than the x86 counterpart. In particular, up to 6x more requests were handled by the ARM system when the 8KB web files were serviced.
2) ENERGY EFFICIENCY

V. CONCLUSION
In this paper, we present the architecture design for the ARM-based Server Cluster Board (SCB). The internal architecture of the SCB, usages of the SCB systems, as well as the software programming methodologies are considered and introduced. The performance and energy characteristics of ARM microservers are evaluated and discussed. Since the ARM processors have a strong software ecosystem, the parallel programming on the SCB systems is similar to that on conventional servers. By plugging our ARM SCB into the x86 multicore PC, the x86/ARM hybrid cluster system is set up. In addition, we implement the ARM-based server cluster with the SCB and test the scalability of the cluster system. To compare their performance and energy efficiency, we run the parallel and server workloads on the x86 and ARM systems. The results show that a significant amount of energy was saved (up to 85%) for the server workload in comparison with the x86 system with only moderate slowdown, i.e., a 1.8x longer execution time. The physical implementation and analysis demonstrate the advantages of the proposed architecture. Since the processor frequency is an important factor to be considered while integrating the multiple CPUs into a small size PCB (Printed Circuit Board), the 64-bit ARM processor may be an alternative for building server systems. The future work could be considered to build the 64-bit servers, compare the efficiency of the 32-bit and 64-bit systems, and evaluate an open-source implementation of OpenCL for ARM such as Pocl in 64-bit SCB. With the applications of ARM becoming popular, the proposed ARM-based microservers architecture will be helpful for systems design in the coming era of Industry 4.0 and the Internet of Things (IoT).
