In this paper, we extend an information-theoretic approach of computer performance evaluation to supercomputers. This approach is based on the notion of computer Capacity which can be estimated relying solely on the description of computer architecture. We describe the method of calculating Computer Capacity for supercomputers including the in°uence of the architecture of communication network. The suggested approach is applied to estimate the performance of three of the top 10 supercomputers (according to TOP500 June-2016 list) which are based on Haswell processors. For greater objectivity of results, we compared them relatively to values of another supercomputer which is based an Ivy Bridge processors (this microarchitecture di®ers from Haswell). The obtained results are compared with values of TOP500 LINPACK benchmark and theoretical peak and we arrive at conclusions about the applicability of the presented theoretical approach (nonexperimental) for performance evaluation of real supercomputers. In particular, it means that the estimations of the computer capacity can be used at the design stage of the development of supercomputers.
Introduction
The performance evaluation of computers and supercomputers is very important to developers and users. For now, there are many benchmarks which can give some estimations of computer performance for some speci¯c programs, but an application of benchmarks requires a working model of computer or supercomputer to estimate the performance, so it cannot be used at the design stage. Furthermore, there is a theoretical method of performance evaluation called theoretical peak, but it uses the IPC (instructions per cycle) value which is given by the manufacturer and also cannot be estimated theoretically (without a working model).
In this paper, we present the theoretical method (i.e., it works without any experiments over the working model of examined computer) of estimating the performance of supercomputers that could be a good alternative for existing methods. First, this method was presented in Ref. 1 where theoretical basis was given. Later, it was applied to real computers with Intel processors. 2 This method is closed in spirit to the main concept of Shannon's information theory 3 and it uses the characteristic Computer Capacity as a measure of performance. In this paper, we describe the application of the suggested method to performance evaluation of Haswell processors and three supercomputers based on them. We describe the application of this method to the performance estimation of three of the top 10 supercomputers according to TOP500 june 2016 list. Examined supercomputers are Trinity (7 place), Hazel Han (9 place) and Shaheen II (10 place), [4] [5] [6] each of them is Cray XC40 model. In order to show the e®ectiveness of the suggested method, we compare the obtained results of those three supercomputers relatively to the values of supercomputer iDataPlex DX360M4 (66 place). 7 The architecture of iDataPlex DX360M4 is based on processors with Ivy Bridge microarchitecture and di®ers from the architecture of previously described supercomputers which are based on Haswell processors and have another network structure and manufacturer.
Computer capacity of computing systems
The concept of Computer Capacity (CC) was suggested in Ref. 1 . This notation gives a possibility of the performance estimation of computers which is based on the description of architecture, without any experiments over them. Full detailed theory was presented in Ref. 1 so, in the Appendix A we present just a brief description paying the main attention on ideas.
Supercomputers and any similar computer systems can be considered as a set of computing nodes connected via network. It is expected that all calculations are performed on this system in parallel, so each node (furthermore, each core on each node) can operate independently except the communication via shared memory. Computing node is a set of processors, RAM and network processor (NP) (also it is called network interface controller (NIC) placed on the same board. In Ref. 1 it was shown that if there is a computing system with N ! 1 computing nodes I 1 ; . . . ; I N , where any node can be run individually and independently on other nodes, then the capacity of this system is the sum of capacities of individual nodes I 1 ; . . . ; I N . Thus the Computer Capacity (CC) of computing system is
where N is the number of nodes, C i is CC of ith computing node and C cs is CC of computing system. Here, CC of computing node is considered as
where N i is the number of cores at ith computing node, C core j is the computer capacity of jth computing core and C np i is CC of the network processor placed at ith node.
Computer Capacity of Modern Processors
First, we consider some basic features of modern computers, which must be taken into account in the construction of Eq. (A.2).
Cache memory
A present-day processor includes the memory with fast access in the core. This memory (called cache memory) is used to reduce the average time to access data from the main memory. This mechanism works as follows: when CPU tries to access some data from the memory it looks in the cache memory¯rst, if it was not found there, it then looks in the main memory. Requested data is stored in the cache memory in accordance with the caching algorithm (it may vary for di®erent processors). We need to take this feature into account in a process of building Eq. (A.2). Let us consider a computer with instruction set I, main memory M and cache memory L. If there is an instruction s 2 I that has single operand which is a memory cell, ðsÞ is the basic execution time of this instruction (excluding the memory access), ðMÞ is the main memory access time and ðLÞ is the cache memory access time then it would be presented in Eq. (A.2) as follows:
where jLj is the size of cache memory and jMj is the size of main memory related to instruction operand size. It is necessary to explain terms of the presented part of Eq. (2) . The¯rst term describes the situation when memory operand is located in cache, so execution time of instruction is the sum of the cache memory access time and the basic execution time.
The number of such instruction types is equal to the size of cache memory related to the operand size, so after collecting similar terms we get single term with the numerator equal to jLj. The second term describes the situation when memory operand is located in main memory. Processor sends a request to the cache memory¯rst and after receiving the answer it sends the second request to the main memory, so the execution time of this type of instructions is the sum of the cache access time, the main memory access time and the basic execution time. The value of numerator is obtained the same way as for the¯rst term. It is important to note that latest processors could have up to three levels of cache memory. But we can easily expand the described case for multiple cache levels.
Pipeline
The next feature which needs to be described is a pipeline. All present-day processors have a pipeline whose task is to reduce the execution time of instructions. This is achieved through the execution of instructions by parts. A pipeline is a sequence of data processing elements where the output of one element is the input of the next one. Each element works independently and, for example, when the instruction runs at the decoding stage then another one can run at the computing stage. The execution of instruction at the stage of pipeline takes much less time than complete execution, so for single instruction the execution time grows, but for a sequence of instructions (a processor task) the execution time signi¯cantly reduces. For our method, it is necessary to take into account this feature, when we consider the de¯nition of execution time. In view of the described feature, we de¯ne the execution time of instruction as the maximum latency on the pipeline which is caused by this instruction. For example, let us consider a pipeline with four stages: the instruction decoder (stage 1), the register renaming stage (stage 2), the execution unit (stage 3) and the retirement stage (stage 4). And let there be two instructions MOV r1,r2 and ADD r1,r2 (`r1,r2' means that both operands are registers). Instruction MOV would execute for 1 clock cycle at the¯rst stage, 1 at the second, 2 cycles at the third stage and 1 cycle at the fourth. Instruction ADD would have the following execution times: 1, 1, 3 and 1 cycles. Then, if there is the sequence of instructions, for example, MOV AX, CX (1); MOV BX, DX (2); ADD AX, BX (3); MOV CX, AX (4) then it executes in the following way (each step is one clock cycle):
(1) Instruction (1) is decoded at stage 1. (1) has already left the pipeline, instruction (2) is still at stage 3, so instructions (3) and (4) are waiting. As we can see, when some sequence of instructions executes in the pipeline, the execution time of each instruction is equal to the number of clock cycles that runs between¯nishing this instruction and the previous. So, in this case, the execution time of instruction MOV equals 2 clock cycles and, analogously, the execution time of instruction ADD equals 3 clock cycles.
Haswell Processors
In this section we describe how to¯nd the Computer Capacity of Haswell processors. First of all we need to describe inherent features of these processors. Haswell processors have the register¯le with 168 integer registers and 168 vector registers. It means, for example, that the instruction can use 168 integer registers as the operand instead of 16 registers (this is made automatically at the register renaming stage). The next important feature is the translation of instructions into microoperations (so-called ops). Due to this mechanism, each instruction is translated into several ops which are executed signi¯cantly faster and often could be executed independently. In most cases ops are executed for 1 clock cycle except for special cases (for example, loading data from memory). It is important to note that in most cases the execution time of instruction equals to the number of ops generated by this instruction. Also the number of dependency chains which can be executed at the pipeline simultaneously must be taken into account. In Haswell processors this number is 4. In our method it means that we have 4 independent pipelines, so after calculating the Computer Capacity of Haswell processor, we need to multiply the obtained value by 4. Now we can describe the process of building the characteristic equation in Eq. (A.2) of Haswell processors. First, we need to make the¯le with the list of all instructions in the primary format. The primary format means that each instruction is presented by the instruction name, the list of types of operands and the basic execution time. The obtained¯le contains 681 entries. The next step is to transform the obtained¯le to the¯le with the list of instructions in the¯nal form. This form is obtained by substituting real numbers in place of operands according to their types. For example, instruction \MOV r,r 1" is transformed to \MOV 28224 1". The number 28224 is the number of di®erent instructions \MOV r,r 1" in I because each of \r" operands can take one of 168 di®erent values, so the number of di®erent combinations is 168 Â 168 ¼ 28224. In order to transform the¯le with the list in the primary format automatically, we use the program which is based at the recursive descent parsing algorithm. This algorithm is used for the correct parsing of the complex instructions (for example, the instruction MOV r8/r16,r8/r16 is represented as the set of instructions: fMOV r8,r8; MOV r8,r16; MOV r16,r8; MOV r16,r16g). The¯le with the¯nal form of instruction list can be found at Ref. 8 . Also it is necessary to use the¯le with technical characteristics (sizes of all cache levels, the size of RAM, access times for all types of memory and the number of registers). When all these¯les are obtained, the program for calculating the Computer Capacity of the pipeline of Haswell core can be used, so C p ðIÞ % log 2 524014684:672 % 28:965 bits per clock cycle. As we considered before, each core has 4 pipelines, so C ¼ C p Â 4 % 28:965 Â 4 % 115:86 bits per clock cycle. To estimate the Computer Capacity of real processors we need to multiply the obtained value by the value of the clock rate in order to transform the measurement units to bits per second. Next, we need to multiply the obtained result by the number of computing cores (these values may di®er for di®erent processors) as explained in Ref. 1 for multi-core processors. Equation (A.2) for Haswell processors after collecting similar terms is presented in Appendix.
Network Processor
The main task of the network processor is to form a packet with data and send it through the network to another network processor or to receive a packet from different network processor. So the set of instructions for this processor will consist of instructions which send some packet from one node to another (in our model each node contains a single network processor and each network processor corresponds to its node) or receive it. Taking into account the described features of network processors, we consider the characteristic equation of the network processor from the node with number k as follows:
where minSize, maxSize are the minimum and the maximum possible sizes of a packet in network, N is the number of nodes in supercomputer, M i;j is the number of di®erent possible packets with size i which can be formed in node j, and T j is the transmission time of the packet with size i between the node k and node j. Here, the numerator M i;k þ M i;j is built in such a way because it includes both sending instructions (M i;k ) and receiving instructions (M i;j ).
Aries interconnect
Three of examined supercomputers are Cray XC40 models and use Aries interconnect network. Let us consider the network structure: Cray XC40 systems composed of blades with four nodes (each node includes two processors and memory . NIC packetizes all requests and sends packets to the network. Each packet contains up to 64 bytes of data.
. Operating speed of an electrical link is 14 Gbps.
. Operating speed of an optical link is 12.5 Gbps.
. Communication between processors is performed through NICs (except the processors which are located at the same node).
Now our aim is to build Eq. (A.2) which is the same for each NIC in a supercomputer. The equation for NIC is as follows:
where i is size of the packet in bytes, M i is number of di®erent packets with size i that can be formed in one node, T el i is the time of transmission of the i-byte packet through the electrical link, T op i is the time of transmission of the i-byte packet through the optical link, N gr is number of nodes in group, N nd is number of nodes in the supercomputer. Let us take a closer look at Eq. (4). We need to consider three cases of the packet transmission through the network as follows:
(1) The data is transmitted between two nodes at the same blade. There are two transmissions, from the source node to the router and from the router to the destination node. (2) The data is transmitted between two nodes from the same group, but with di®erent blades. There are three transmissions here: from the source node to its router, from the source router to the router of the destination node (by electrical link) and from the router to the node. (3) The data is transmitted between two nodes from di®erent groups. There are also three transmissions: from the source node to its router, from the source router to the router of the destination node (by optical link) and from the router to the node.
In this way, case (1) 
In¯niBand
Supercomputer iDataPlex DX360M4 uses In¯niBand FDR 10 as interconnection. The method of calculation Computer Capacity of NP for In¯niBand is similar to the one previously discussed. Here some technical characteristics which are important to Eq. (3) are as follows:
. The minimum possible size of the transport packet in In¯niBand is 32 bytes (according to Ref. 10).
. The maximum possible size of the transmitted data is 4096 bytes.
. Theoretical e®ective throughput for 1x link is 13.64 Gbit/s.
CC of network processor is estimated for In¯niBand FDR interconnect of the examined supercomputer iDataPlex is % 9:24 Gbits per second.
iDataPlex DX360M4
Now we can present the results of the supercomputer iDataPlex DX360M4. This supercomputer is constructed with processors Intel Xeon E5-2680 v2, which have Ivy Bridge microarchitecture. Computer Capacity for Ivy Bridge is estimated closely to Haswell and its value is CðIÞ % 108:582 bits per clock cycle. We estimate the upper bound of Computer Capacity, so the maximum value of the clock rate is used for processors. For the single core of E5-2680 v2 processor this value is CðIÞ % 390:9 Gbits per second. Each node of this supercomputer contains two processors (each processor is 10-core). The number of cores for iDataPlex is 65320 so the number of nodes is 3266. With all presented values we can calculate the total Computer Capacity of supercomputer which is set by Eq. (1) and this value is % 25563:52 Tbits per second.
Analysis of Results
In order to show the e®ectiveness of Computer Capacity, we examined and compared four supercomputers: iDataPlex, Shaheen II, Hazel Han and Trinity. Let us emphasize that we consider the supercomputers of two di®erent types: iDataPlex corresponds to the¯rst type (In¯niBand FDR interconnect, Ivy Bridge processors) and the other supercomputers correspond to the second one (Aries interconnect, Haswell processors). All characteristics and results of examined supercomputer are presented in Table 1 . Following are some clari¯cations for Table 1 :
. CC of NIC is Computer Capacity of a single network processor.
. CC of core is Computer Capacity of a single computing core.
. CC of supercomputer is the total value of supercomputer's Computer Capacity set by Eq. (1).
. LINPACK is the name of benchmark presented in TOP500 list as Rmax.
. Theoretical peak is the name of characteristic called Rpeak in TOP500 list.
Measurement units of examined characteristics are di®erent so it is impossible to compare Computer Capacity with LINPACK and Theoretical peak directly. First, we sort the supercomputers in ascending order of Computer Capacity. Next, we divide each value by the corresponding value of iDataPlex supercomputer. In this case we get a new Table 2 , where the values of characteristics corresponding to iDataPlex supercomputer are equal to 1. The obtained values in Table 2 are relative and have no measurement units so we can compare them with each other directly. In order to present the results more visually, we build the graph presented in Fig. 1 . This graph shows us that the obtained results of the suggested method are closer to the experimental results of LINPACK than the results of Theoretical peak. It is important to note that the value of benchmark LINPACK is the major characteristic in TOP500 rating and the places are assigned to supercomputers according to this value (Theoretical peak is used only in the case of equality). Let us take a closer look at Fig. 1 . For example, if we take a look at the position corresponding to Shaheen II supercomputer, we can see that the points representing the values of LINPACK and Computer Capacity are almost fused in one point, unlike the value of the theoretical 
Appendix A
Let us consider a computer with instruction set I and memory M. Each instruction s 2 I is described by instruction name and values of its operand. It means that the same instructions with di®erent values of operands are included in I independently. For example, we consider that MOV R0 R1 and MOV R0 R2 (where R0, R1 and R2 are names of registers) are di®erent instructions and both included in I. We denote ðsÞ; s 2 I as the execution time of instruction s and S ¼ s 1 ; s 2 ; . . . ; s n ; s i 2 I as the computer task. Then the execution time of task S is considered as ðSÞ ¼ P n i¼1 ðs i Þ. We also suppose that all execution times ðsÞ are integers and the greatest common divisor of ðsÞ; s 2 I equals 1 (this assumption is valid for most of processors because there are instructions with execution time equal to one time unit, i.e., ðsÞ ¼ 1). We consider the number of di®erent problems with execution times equal T as ðT Þ and note that it is equal to the size of the set of all sequences of instructions with execution times equal T , i.e., ðT Þ ¼ NðT Þ, where NðT Þ ¼ jfS ¼ s 1 ; s 2 ; . . . ; s n : s i 2 I; ðSÞ ¼ T gj. Let there be a processor which can execute N di®erent sequences of instructions during 1 min. We can say that this processor can execute N 2 Fig. 1 . The graph of Table 2. sequences of instructions during 2 min because if S 1 and S 2 are 1-minute sequences, the combined sequence S 1 S 2 is a 2-minute one (we did not take into account a few extra 2-minute sequences with instruction which starts at the end of¯rst minute and nishes at the beginning of the second one). Analogously, % N k sequences can be executed during k minutes. So the number of possible sequences grows exponentially as a function of the time T (NðT Þ % 2 CT ), thus log NðT Þ=T (or the limit of this value) is adequate measure of processor capacity and CC de¯nes as follows:
It is worth noting that this situation is typical for Information Theory, where, for example, the capacity of a lossless channel is de¯ned by the rate of asymptotic growth of the number of allowed sequences of basic symbols (letters), whose length can be di®erent. The important question is how to calculate (or at least estimate) CðIÞ in (A.1). Obviously, the direct calculation of the limit is impossible, but in combinatorial analysis there exists a method of calculation the capacity CðIÞ. This method was used by Shannon 3 when he estimated a channel capacity. In this case we consider instruction set I as an alphabet and assume that all words (sequences of symbols) from that alphabet can be executed. This assumption allows to estimate the upperbound of the processor capacity, because for any processor the set of admissible sequence of instructions is a subset of all possible sequences. Considering the above, CðIÞ is equal to the logarithm of the largest real solution X 0 of the following characteristic equation:
It is shown in Ref. 1 that CC of multi-core processors is de¯ned as a sum of CCs of the cores.
