Using Information Theory to Study the Efficiency and Capacity of
  Computers and Similar Devices by Ryabko, Boris
ar
X
iv
:1
00
3.
36
19
v1
  [
cs
.IT
]  
18
 M
ar 
20
10
Using Information Theory to Study the Efficiency
and Capacity of Computers and Similar Devices
Boris Ryabko∗
∗ Siberian State University of Telecommunications and Informatics,
Institute of Computational Technologies of Siberian Branch of Russian Academy of Science,
Novosibirsk, Russia; boris@ryabko.net
Abstract
We address the problems of estimating the computer efficiency and the
computer capacity. We define the computer efficiency and capacity and
suggest a method for their estimation, based on the analysis of processor
instructions and kinds of accessible memory. It is shown how the suggested
method can be applied to estimate the computer capacity. In particular,
this consideration gives a new look at the organization of the memory
of a computer. Obtained results can be of some interest for practical
applications.
Keywords: computer capacity, computer efficiency, Shannon theory,
cache memory, Information Theory
1 Introduction
We address the problem of what the efficiency (or performance) and the capacity
of a computer are and how they can be estimated. More precisely, we consider a
computer with a certain set of instructions and several kinds of memory. What is
the computer capacity, if we know the execution time of each instruction and the
speed of each kind of memory? What is the computer efficiency if the computer
is used for solving problems of a certain kind (say, matrix multiplications)? On
the one hand, the questions about the computer efficiency and capacity are quite
natural, but, on the other hand, to the best of our knowledge, the computer
science does not give answers to those questions.
The first goal of this paper is to suggest a reasonable definition of the com-
puter efficiency and capacity and methods of their estimation. We will mainly
consider computers, but our approach can be applied to all devices which con-
tain processors, memories and instructions. (Among those devices we mention
mobile telephones and routers.) Second, we describe a method for estimation
of the computer capacity and apply it to several examples which are of some
1
theoretical and practical interest. In particular, this consideration gives a new
look at the organization of a computer memory.
The suggested approach is based on the concept of Shannon entropy, the
capacity of a discrete noiseless channel and some other ideas of C. Shannon [12]
that underly Information Theory.
2 The computer efficiency and capacity
2.1 The basic concepts and definitions
Let us first briefly describe the main point of the suggested approach and defi-
nitions. For a start, we will consider the simplified variant of a computer, which
consists of a set of instructions I and an accessible memory M .
We suppose that at the initial moment there is a program and data which
can be considered as binary words P and D, located in the memory of a com-
puter M . In what follows we will call the pair P and D a computer task. A
computer task < P,D > determines a certain sequence of instructions X(P,D)
= x1x2x3..., xi ∈ I. (It is supposed that an instruction may contain an ad-
dress of a memory location in M , the index of a register, etc.) For example,
if the program P contains a loop which will be executed ten times, then the
sequence X will contain the body of this loop repeated ten times. We say that
two computer tasks < P1, D1 > and < P2, D2 > are different, if the sequences
X(P1, D1) and X(P2, D2) are different.
Let us denote the execution time of an instruction x by τ(x). Then the
execution time τ(X) of a sequence of instructions X = x1x2x3...xt is given by
τ(X) =
t∑
i=1
τ(xi).
The key observation is as follows: the number of different computer tasks, whose
execution time equals T , is upper bounded by the size of the set of all sequences
of instructions, whose execution time equals T , i.e.
ν(T ) ≤ N(T ), (1)
where ν(T ) is the number of different problems, whose execution time equals T ,
and
N(T ) = |{X : τ(X) = T }|. (2)
Hence,
log ν(T ) ≤ logN(t). (3)
(Here and below log x ≡ log2 x and |Y | is the number of elements of Y if Y is
a set, and the length of Y if Y is a word.) In other words, the total number of
computer tasks executed in time T is upper bounded by (2).
Basing on this consideration we give the following definition.
2
Definition 1. Let there be a computer with a set of instructions I and let τ(x)
be the execution time of an instruction x ∈ I. The computer capacity C(I) is
defined as follows:
C(I) = lim
T→∞
logN(T )
T
, (4)
where N(T ) is defined in (2).
(That this limit always exists can be proven based on the lemma by M. Fekete
[9, lemma M. Fekete].)
The next question to be investigated is the definition of the computer ef-
ficiency (or performance), when a computer is used for solving problems of a
certain kind. For example, one computer can be a Web server, another can be
used for solving differential equations, etc. Certainly, the computer efficiency
depends on the problems the computer has to solve. In order to model this sit-
uation we suggest the following approach: there is an information source which
generates a sequence of computer tasks in such a way, that the computer begins
to solve each next task as soon as the previous task is finished. We will not
deal with a probability distribution on the sequences of the computer tasks, but
consider sequences of computer instructions, determined by sequences of the
computer tasks, as a stochastic processes. In what follows we will consider the
model when this stochastic process is stationary and ergodic, and we will define
the computer efficiency for this case.
The definition of efficiency will be based on results and ides of information
theory, which we introduce in what follows. Let there be a stationary and ergodic
process z = z1, z2, ... generating letters from a finite alphabet A (the definition of
stationary argodic process can be found, for ex., in [3]). The n−order Shannon
entropy and the limit Shannon entropy are defined as follows:
hn(z) = −
1
n+ 1
∑
u∈An+1
Pz(u) logPz(u), h∞(z) = lim
n→∞
hn(z) (5)
where n ≥ 0 , Pz(u) is the probability that z1z2...z|u| = u (this limit always
exists, see [3, 12]). We will consider so-called i.i.d. sources. By definition, they
generate independent and identically distributed random variables from some
set A. Now we can define the computer efficiency.
Definition 2. Let there be a computer with a set of instructions I and let τ(x) be
the execution time of an instruction x ∈ I. Let this computer be used for solving
such a randomly generated sequence of computer tasks, that the corresponding
sequence of the instructions z = z1z2..., zi ∈ I, is a stationary ergodic stochastic
process. Then the efficiency is defined as follows:
c(I, z) = h∞(z)/
∑
x∈I
Pz(x)τ(x), (6)
where Pz(x) is the probability that z1 = x, x ∈ I.
3
Informally, the Shannon entropy is a quantity of information (per letter),
which can be transmitted and the denominator in (6) is the average execution
time of an instruction.
More formally, if we take a large integer T and consider all T−letter se-
quences z1...zT , then, for large T , the number of “typical” sequences will be
approximately 2Th∞(z), whereas the total execution time of the sequence will
be approximately T
∑
x∈I Pz(x)τ(x). (By definition of a typical sequence, the
frequency of any word u in it is close to the probability Pz(u). The total proba-
bility of the set of all typical sequences is close to 1.) So, the ratio of log(2Th∞(z))
and the average execution time will be asymptotically equal to (6), if T → ∞.
A rigorous proof can be obtained basing on methods of information theory; see
[3]. We do not give it, because definitions do not need to be proven, but mention
that there are many results about channels which transmit letters of unequal
duration [2, 8, 10].
2.2 Methods for estimating the computer capacity
Now we consider the question of estimating the computer capacity and efficiency
defined above. The efficiency, in principle, can be estimated basing on statistical
data, which can be obtained by observing a computer which solves tasks of a
certain kind.
The computer capacity C(I) can be estimated in different situations by
different methods. In particular, a stream of instructions generated by different
computer tasks can be described as a sequence of words created by a formal
language, or the dependence between sequentially executed instructions can be
modeled by Markov chains, etc. Seemingly the most general approach is to
define the set of admissible sequences of instructions as a certain subset of all
possible sequences. More precisely, the set of admissible sequences G is defined
as a subset G ⊂ A∞, where A∞ is the set of one-side infinite words over the
alphabet A: A∞ = {x : x = x1x2...}, xi ∈ A, i = 1, 2, .... In this case the the
capacity of G is deeply connected with the topological entropy and Hausdorff
dimension; for definitions and examples see [1, 4, 5, 11] and references therein.
We do not consider this approach in details, because it seems to be difficult
to use it for solving applied problems which require a finite description of the
channels.
The simplest estimate of computer capacity can be obtained if we suppose
that all sequences of the instructions are admissible. In other words, we con-
sider the set of instructions I as an alphabet and suppose that all sequences of
letters (instructions) can be executed. In this case the method of calculation of
the lossless channel capacity, given by C.Shannon in [12], can be used. It is im-
portant to note that this method can be used for upper-bounding the computer
capacity for all other models, because for any computer the set of admissible
sequences of instructions is a subset of all words over the ”alphabet” I.
Let, as before, there be a computer with a set of instructions I whose execu-
tion time is τ(x), x ∈ I, and all sequences of instructions are allowed. In other
words, if we consider the set I as an alphabet, then all possible words over
4
this alphabet can be considered as admissible sequences of instructions for the
computer. The question we consider now is how one can calculate (or estimate)
the capacity (4) for this case. The solution is suggested by C. Shannon [12]
who showed that the capacity C(I) is equal to the logarithm of the largest real
solution X0 of the following equation:
X−τ(x1) +X−τ(x2) + ...+X−τ(xs) = 1, (7)
where I = {x1, ..., xs}. In other words, C(I) = logX0.
It is easy to see that the efficiency (6) is maximal, if the sequence of in-
structions x1x2..., xi ∈ I is generated by an i.i.d. source with probabilities
p∗(x) = X
−τ(x)
0 , where X0 is the largest real solution to the equation (7), x ∈ I.
Indeed, having taken into account that h∞(z) = h0(z) for i.i.d. source [3] and
the definition of entropy (5), the direct calculation of c(I, p∗) in (6) shows that
c(I, p∗) = logX0 and, hence, c(I, p
∗) = C(I).
It will be convenient to combine all the results about computer capacity and
efficiency in the following statement:
Theorem 1. Let there be a computer with a set of instructions I and let τ(x)
be the execution time of x ∈ I. Suppose that all sequences of instructions are
admissible computer programs. Then the following equalities are valid:
i) The alphabet capacity C(I) (4) equals logX0, where X0 is the largest real
solution to the equation (7).
ii) The efficiency (6) is maximal if the sequences of instructions are generated
by an i.i.d. source with probabilities p∗(x) = X
−τ(x)
0 , x ∈ I.
3 MIX and MMIX
As an example we briefly consider the MIX and MMIX computers suggested by
D.Knuth [6, 7]. The point is that those computers are described in details and
MMIX can be considered as a model of a modern computer, whereas MIX can
be considered as a model of computers produced in the 1970th. The purpose
of this consideration is to investigate the given definitions and to look at how
various characteristics of a computer influence its capacity, therefore we give
some details of the description of MIX and MMIX.
We consider a binary version of MIX [6], whose instructions are represented
by 31−bit words. Each machine instruction occupies one word in the memory,
and consists of 4 parts: the address (12 bits and the sign of the word) in memory
to read or write; an index specification (1 byte, describing which register to use)
to add to the address; a modification (1 byte) that specifies which parts of the
register or memory location will be read or altered; and the operation code (1
byte). So, almost all 31−bit words can be considered as possible instructions
and the upper bound of the number of the set of instructions I (and letters
of the ”computer alphabet”) is 231. Each MIX instruction has an associated
5
execution time, given in arbitrary units. For example, the instruction JMP
(jump) has the execution time 1 unit, the execution times of MUL and DIV
(multiplication and division) are 10 units and 12 units, correspondingly. There
are special instructions whose execution time is not constant. For example, the
instructionMOV E is intended to copy information from several cells of memory
and the execution time equals 1 + 2F , where F is the number of cells.
From the description of MIX instructions and Theorem 1 we obtain the
following equation for calculating the upper bound of the capacity of MIX:
228
X
+
226
X2
+
226
X10
+
225
X12
+
225∑
F=0
225
X1+2F
= 1. (8)
Here the first summand corresponds to operations with execution time 1, etc.
It is easy to see that the last sum can be estimated as follows:
∑225
F=0
225
X1+2F
<
225
X
X2
X2−1 . Having taken into account this inequality and (8), we can obtain by
direct calculation that the MIX capacity is approximately 28 bits per time unit.
The MMIX computer has 256 general-purpose registers, 32 special-purpose
ones and 264 bytes of virtual memory [7]. The MMIX instructions are presented
as 32-bit words and in this case the ”computer alphabet” consists of almost 232
words (almost, because some combinations of bits do not make sense). In [7] the
execution (or running) time is assigned to each instruction in such a way that
each instruction takes an integer number of υ, where υ is a unit that represents
the clock cycle time. Besides, it is assumed that the running time depends on
the number of memory references (mems) that a program uses [7]. For example,
it is assumed that the execution time of each of the LOAD instructions is υ+µ,
where µ is an average time of memory reference [7]. If we consider υ as the
time unit and define µˆ = µ/υ, we obtain from the description of MMIX [7] and
Theorem 1 the following equation for finding an upper bound on the MMIX
capacity:
224 (
139
X
+
32
X2
+
5
X3
+
17
X4
+
3
X5
+
4
X10
+
2
X40
+
4
X60
+
46
X1+µˆ
+
2
X1+20µˆ
+
46
X2+2µˆ
) = 1. (9)
The value µˆ depends on the realization of MMIX and is larger than 1 for modern
computers [7]. So, as in the previous example, the first term has the most
influence and MMIX capacity is approximately 31.5 bits per time unit.
These examples show that the capacity of both computers is mainly deter-
mined by the subsets of instructions whose execution time is minimal.
Theorem 1 gives a possibility to estimate frequencies of the instructions,
if the computer performance efficiency is maximal (and equals its capacity).
First, the frequencies of instructions with equal running time have to be equal.
In turn, it means that all memory cells should be used equally often. Second, the
frequency of instructions exponentially decreases as their running time increases.
It is interesting that in the modern computer MMIX the share of fast commands
6
is larger than in the old computer MIX and, hence, the efficiency of MMIX is
larger. It is reached due to the usage of registers instead of the (slow) memory.
4 Possible applications
It is natural to use estimations of the computer capacity at the design stage.
We consider examples of such estimations that are intended to illustrate some
possibilities of the suggested approach.
First we consider a computer, whose design is close to the MMIX computer.
Suppose a designer has decided to use the MMIX set of registers. Suppose
further, that he/she has a possibility to use two different kinds of memory, such
that the time of one reference to the memory and the cost of one cell are τ1,
c1 and τ2, c2, correspondingly. It is natural to suppose that the total price of
the memory is required not to exceed a certain bound C. As in the example
with MMIX we define µˆ1 = τ1/υ, µˆ2 = τ2/υ, where, as before, υ is a unit that
represents the clock cycle time.
As in the case of MMIX, we suppose that there are instructions for writing
and reading information from a register to a cell. The set of these instructions
coincides with the corresponding set of the MMIX computer. If we denote the
number of the memory cells by S, then the number of the instructions which
can be used for reading and writing, is proportional to S. Having taken into
account that MMIX has 28 registers and the equation (3) , we can see that the
designer should consider two following equations
(224 (
139
X
+
32
X2
+
5
X3
+
17
X4
+
3
X5
+
4
X10
+
2
X40
+
4
X60
))+
28 Si (
46
X1+µˆi
+
2
X1+20µˆi
+
46
X2+2µˆi
) ) = 1. (10)
for i = 1, 2, where Si = C/ci, i.e. Si is the number of cells of the i−th kind
of memory, i = 1, 2. The designer can calculate the maximal roots for each
equation (i = 1, 2) and then he/she can choose that kind of memory for which
the solution is larger. It will mean that the computer capacity will be larger for
the chosen kind of memory. For example, suppose that the total price should
not exceed 1 (C = 1), the prices of one cell of memory are c1 = 2
−30 and
c2 = 2
−34, whereas µˆ1 = 1.2, µˆ2 = 1.4. The direct calculation of the equation
(4) for S1 = 2
30 and S2 = 2
34 shows that the former is preferable, because the
computer capacity is larger for the first kind of memory.
Obviously, this model can be generalized for different set of instructions
and different kinds of memory. In such a case the considered problem can be
described as follows. We suppose that there are instructions µwi (n) for writing
information from a special register to n-th cell of i-th kind of memory (n =
0, ..., ni − 1, 1 = 1, ..., k, and similar instructions µ
r
i (n) for reading. Moreover,
it is supposed that all other instructions cannot directly read or write to the
memory of those kinds, i.e. they can write to and read from the registers only.
7
(It is worth noting that this model is quite close to some real computers.) Denote
the execution time of the instructions µwi (n) and µ
r
i (n) by τ˙i, 1 = 1, ..., k.
In order to get an upper bound of the computer capacity for the described
model we, as before, consider the set of instructions as an alphabet and estimate
its capacity applying Theorem 1. From (7) we obtain that the capacity is logX0,
where X0 is the largest real solution of the following equation:
∑
x∈I∗
X−τ(x) +R (
2n1
X τ˙1
+
2n2
X τ˙2
+ ...+
2nk
X τ˙k
) = 1 , (11)
where I∗ contains all instructions except µri (n) and µ
w
i (n), 1 = 1, ..., k, R is a
number of registers. (The summand 2ni
X τ˙i
corresponds to the instructions µwi (n)
and µri (n).)
Let us suppose that a price of one cell of ith kind of memory is ci whereas
the total cost of memory is limited by C. Then, from the previous equation we
obtain the following optimization problem:
logX0 −→ maximum,
where X0 is the maximal real solution of the equation (11) and
c1n1 + c2n2 + ...+ cknk ≤ C; ni ≥ 0, i = 1, ..., k.
The solution of this problem can be found using standard methods and used by
computer designers.
The suggested approach can be applied to optimization of different param-
eters of computers including the structure of the set of instructions, etc.
5 Conclusion
We have suggested a definition of the computer capacity and its efficiency as well
as a method for their estimation. It can be suggested that this approach may
be useful on the design stage when developing computers and similar devices.
It would be interesting to analyze the “evolution” of computers from the
point of view of their capacity. The preliminary analysis shows that the de-
velopment of the RISC processors, the increase in quantity of the registers and
some other innovations, lead to the increase of the capacity of computers. More-
over, such methods as using cache memory can be interpreted as an attempt to
increase the efficiency of a computer.
It is worth noting that the suggested approach in general can be extended
to multi-core processors and special kinds of cache memory.
References
[1] D.V. Anosov. ”Topological entropy,” In: Encyclopaedia
of Mathematics, Edited by Michiel Hazewinkel, Springer,
http://eom.springer.de/T/t093040.htm
8
[2] I. Csiszar. ”Simple proofs of some theorems on noiseless channels,” Inform.
Contr., vol.14, pp.285298, 1969.
[3] T.M. Cover, J.A. Thomas. Elements of information theory. Wiley, 2006 .
[4] D. Doty. ” Dimension Extractors and Optimal Decompression.,” Theory of
Computing Systems, vol. 43, no. 3-4, pp. 432-4350
[5] L. Fortnow, J.H. Lutz . ”Prediction and dimension,” Journal of Computer
and System Sciences, vol. 70, no. 4, pp. 570-589, 2005.
[6] D.E. Knuth. The Art of Computer Programming Volume 1: Fundamental
Algorithms, 1968.
[7] D.E. Knuth. The Art of Computer Programming, Volume 1, Fascicle 1,
MMIX: A RISC Computer for the New Millennium, 2005.
[8] R. M. Krause. ”Channels which transmit letters of unequal duration,” In-
form. Contr., vol. 5, pp.1324, 1962.
[9] R. Krichevsky, Universal Compression and Retrival, Kluver Academic
Publishers, 1993.
[10] K. Mehlhorn. ”An efficient algorithm for constructing nearly optimal prefix
codes,” IEEE Trans. Inform. Theory, v.26, pp. 513517, 1980.
[11] B. Ya. Ryabko, ”Noiseless coding of combinatorial sources, Hausdorff di-
mension, and Kolmogorov complexity,” Problems of Information Transmis-
sion, vol. 22, pp. 170-179, 1986.
[12] C. E. Shannon, ”A mathematical theory of communication,” Bell Sys.
Tech. J. , vol. 27, pp. 379–423, pp. 623–656, 1948.
9
