Cite this article: Serb A, Kobyzev I, Wang J, Prodromakis T. 2019 A semi-holographic hyperdimensional representation system for hardware-friendly cognitive computing. Phil. Trans. R. Soc. A 378: 20190162. http://dx.One contribution of 13 to a theme issue 'Harmonizing energy-autonomous computing and intelligence' . Subject Areas: artificial intelligence, electrical engineering, computational mathematics
One of the main, long-term objectives of artificial intelligence is the creation of thinking machines. To that end, substantial effort has been placed into designing cognitive systems; i.e. systems that can manipulate semantic-level information. A substantial part of that effort is oriented towards designing the mathematical machinery underlying cognition in a way that is very efficiently implementable in hardware. In this work, we propose a 'semiholographic' representation system that can be implemented in hardware using only multiplexing and addition operations, thus avoiding the need for expensive multiplication. The resulting architecture can be readily constructed by recycling standard microprocessor elements and is capable of performing two key mathematical operations frequently used in cognition, superposition and binding, within a budget of below 6 pJ for 64-bit operands. Our proposed 'cognitive processing unit' is intended as just one (albeit crucial) part of much larger cognitive systems where artificial neural networks of all kinds and associative memories work in concord to give rise to intelligence.
This article is part of the theme issue 'Harmonizing energy-autonomous computing and intelligence'.
Introduction
The explosive scale of research output and investment in the field of artificial intelligence (AI) and machine learning (ML) testify to the tremendous impact of the field to the world. Thus far this has manifested itself as a mass-scale proliferation of artificial neural network-based (ANN) algorithms for data classification. This covers multiple data modalities such as most prominently images [1] and speech/sound [2] , and relies on a number of standard, popular ANN architectures, most notably multi-layer perceptrons [3] , recurrent NNs (in particular, LSTM [4] and GRU [5] ) and convolutional NNs [6] among many others [7, 8] .
Thus far the vast majority of market-relevant ANN-based systems belong to the domain of statistical learning, i.e. perform tasks which can be generally reduced to some sort of pattern recognition and interpolation (in time, space, etc.). This, though demonstrably useful, is akin to memorizing every answer to every question plus some ability to cope with uncertainty. By contrast, higher-level intelligence must be able to support fluid reasoning and syntactic generalization, i.e. applying previous knowledge/experience to solve novel problems. This requires the packaging of classified information generated by traditional ANNs into higherlevel variables (which we may call 'semantic objects'), which can then be fluently manipulated at that higher level of abstraction. A number of cognitive architectures have been proposed to perform such post-processing, most notably the ACT-R architecture [9] and the semantic pointer architecture (SPA) [10] , which is an effort to manipulate symbols using neuron-based implementations.
Handling the complex interactions/operations between semantic objects requires both orderly semantic object representations and machinery to carry out useful object manipulation operations. Hyperdimensional vector-based representation systems [11] have emerged as the de facto standard approach and are employed in both the SPA and ACT-R. Their mathematical machinery typically includes generalized vector addition (combine two vectors in such way that the result is as similar to both operands as possible), vector binding (combine two vectors in such way that the result is as dissimilar to both operands as possible) and normalization (scale vector elements so that overall vector magnitude remains constant). These operations may be instantiated in holographic (all operands and results have fixed, common length) or non-holographic manners. Non-holographic systems have employed convolution [12] or tensor products [13] as binding. Holographic approaches have used circular convolution [11] and element-wise XOR [14] . Meanwhile, element-wise addition tends to remain the vector addition operation of choice across the board.
Finally, whichever computational methodology is adopted for cognitive computing must be implementable in hardware with extremely high power efficiency in order to realize its full potential for practical impact. This is the objective pursued by a number of accelerator architectures spanning from limited precision analogue neuron-based circuits [15] , through analogue/digital mixtures [16] to fully analogue chips seeking to emulate the diffusive kinetics of real synapses [17] . More recently, memristor-based architectures have also emerged [18] .
In this work, we summarize an existing, abstract mathematical structure for carrying out semantic object manipulation computations and propose an alternative, hardware-friendly instantiation. Our approach uses vector concatenation and modular addition as its fundamental operations (in contrast with the more typical element-wise vector addition and matrixvector multiplication, respectively). Crucially, the chosen set of operations no longer forms a holographic representation system. This trades away some 'expressivity' (ability to form semantic object expressions within limited resources) in exchange for compression: unlike holographic representations semantic object vector length depends on its information content. Furthermore, the proposed system avoids the use of multiplication completely, thus allowing for both fast and efficient processing in hardware (avoiding both expensive multipliers and relatively slow spiking systems). Finally, we illustrate how the proposed system can be easily mapped onto a simple vector processing unit and provide some preliminary, expected performance metrics based on a commercially available 65 nm technology. 
Mathematical foundations and motivation
Generalizing the series of work on models of associative memory, many of them inspired from the world of optics [11, 13, 14, [19] [20] [21] [22] [23] [24] , one may inspect the most abstract algebraic formulation of it. All we need is a commutative ring R with a distance metric dist.
In order to give this mathematical machinery sufficient power to describe cognitive tasks, one must initially specify the ring operations and impose some restrictions on them. The primary operation (ring addition, 1 denoted by +) is used to carry out the semantic object-level operation of superposition; 2 that is the combination of two elements in such way that the result is equidistant from its operands under the metric dist (i.e. for a, b ∈ R, one has dist(a + b, a) ≈ dist(a + b, b)). The secondary operation (ring multiplication, 3 denoted by * ) similarly enables the semantic object operation of binding; that is the combination of two elements in such manner that the result is ideally completely different from both operands. Next, one needs to store a (finite) set of elements of R including both invertible elements, which we call 'pointers' (or roles), and not necessarily invertible ones, which we call 'fillers'.
As a toy example to help with visualization let us construct a simple ring (the notation used in this paragraph refers to this paragraph only and is separate from all the remaining text). Let our semantic objects be all vectors with three real number elements A = [a 0 , a 1 , a 2 ]. Let our superposition be the element-wise addition of two semantic object vectors:
. Let our binding be the following operation: multiply the two semantic objects element-wise and then permute the result by one position to the right so that:
. We let the distance metric remain the Euclidean: dist(A, B) = i (a i − b i ) 2 . This toy example illustrates the basic principles of constructing such system as well as alluding to the almost complete freedom that one has in building said system: Let us now give an example of how such mathematical machinery may give rise to simple cognition. Assume that we have a ring R with the distance dist and the operations, satisfying the desired properties. Also assume that we fixed five elements of R: obj and col are invertible, red, green and car are any elements. We can now construct a new element: s = obj * car + col * red, which can be interpreted as a semantic object 'red car'. Now one can ask: what colour is this car? The answer can be accessed by performing an algebraic operation: col −1 * s = red + col −1 * obj * car. Then if the term col −1 * obj * car is either close to zero or in some other way does not interfere with the computation of dist, the stored memory element closest to the result of the query is red. Mathematically, the query is argmin r∈R (dist(col −1 * s, r)) = red. Thus, we observe that mathematically, AI is underpinned by a solid computational/information processing foundation whose functionality must be preserved in any proposed alternative representation system, even if not necessarily via a distance-equipped commutative ring.
The classical realization of the commutative ring-based cognition principle is the holographiclike memory [11] . In this case, R is defined as follows: the set is a collection of n-dimensional real vectors (n-vectors) R n . Figure 1 . Summary of key terms used throughout this text exemplified by a full-length chain. (Online version in colour.) convolution. The distance metric is the simple Euclidean. To define a pointer or a filler one just needs to independently sample each entry of the vector from the normal distribution N (0, 1/n).
Finally, the operations of the system must be ideally implementable in hardware in a way that minimizes power and area requirements. In practice, this means that the fundamental superposition and binding operations must rely on energetically cheap building block operations such as thresholding (an inverter), shifts (flip-flop chain), addition (sum of currents on a wire or digital adder) or possibly analogue multiplication (memristor + switch) [25] . Implementation details will ultimately determine the actual cost of each operation. The main approaches so far either use too many multiply-accumulate (MAC) operations (circular convolution-based binding from [11] requires ≈ n 2 MACs/binding), or are applicable only to binary vectors (radix k = 2) [14] .
Proposed semi-holographic representation system
In this section, we provide an intuitive overview followed by a rigorous mathematical explanation of the proposed architecture interwoven with pointers on how our design decisions aim towards hardware efficiency. Overall, in order to achieve a more hardware-friendly cognitive algebra realization, we trade away some of the mathematical simplicity from the previous section for implementability. The algebraic structure we are using for cognition is no longer a ring, but a rather exotic construction. It consists of an underlying set and two binary operations (superposition, binding).
(a) Building a set of semantic objects
In our proposed system, the set of semantic objects is perhaps best understood in terms of two subsets: (i) fixed-length 'base items', each consisting of y integer elements in the range [0, p − 1]. The choices of p and y link to desired memory capacity, i.e. the number of semantic objects the system is capable of representing reliably (see §4). (ii) Variable-length 'item chains' consisting of multiple concatenated base elements. The maximum length for chains is d base items for a total of n numerical elements, where d is determined by the hardware design 4 and affects the capacity of the system to hold/express multiple basic items at the same time. The number of base items in a chain is defined as the rank of the chain. The terminology is summarized in figure 1 .
Some observations about our implementation: (i) base items are generally intended for encoding the fundamental vocabulary items of the system (e.g. 'red', 'apple', 'colour') and possible bindings, including the classical 'pointer-filler' pairings (e.g. 'colour' * 'red': the value of the colour 'attribute' is 'red'). By contrast, chains are intended for simultaneously holding (superpositions of) multiple base items in memory (e.g. composite descriptions of objects such as: colour * red + object * apple (a red apple), or collections of unrelated items such as: shape * circle + shape * square (a circle and a square)). The order in which the superposed items are kept in memory does not bear any functional significance; for the purposes of our system items are either present or absent from a chain. Cognitive systems that are order-or even position-dependent can be, of course, conceived; all that is necessary is for each item to have some mechanism (e.g. a position Table 1 . Summary of relations between basic mathematical objects used in this work. As an example of how to read the table, the top left entry states that: 'In every element there are p = 2 l states' . The term 'chain' refers to a maximum size chain. The term 'item' refers to base items. All parameters are integers. indicator) for marking its location within a chain. (ii) Setting p, y, d, n as powers of 2 offers the attribute of naturally advantageous implementation in digital hardware. This is the approach we choose in this work, as shown in table 1. The choice of p is not necessarily obvious as what constitutes a 'good' choice of p will depend on the specific implementations of superposition and binding. (iii) Any chain can be zero-padded until it forms a maximum-length chain. Mathematically, the above can be described as follows: fix natural numbers p and y as above. Then the set of base items is a group B = (Z/p) y (under element-wise mod p summation). The way to form item chains is by executing a direct product of copies of B. Then we say that any element where B r = r i=1 B has rank r. The chain of maximal length will be an element where B n , and n = d · y.
(b) Superposition and binding
Next, we define our set of basic operations. The superposition operation '+' is defined as follows: if a and b are semantic objects, then:
which is a standard direct sum. The result contains both a and b operands preserved completely intact. This can be contrasted with superposition implemented as regular elementwise summation, where each operand is 'blurred' and merged into the result. Superpositions of semantic objects whose combined ranks exceed d are not allowed. 5 Formally speaking, given a ∈ B d 1 , b ∈ B d 2 , the superposition a + b is just an element (a, b) in a direct product of the groups B d 1 +d 2 . If d 1 + d 2 > n, the operation is not defined.
Next, the binding operation ' * ' is defined as a variant of a tensor product between semantic objects where the individual pairings are subsequently subjected to element-wise addition modulo p. Mathematically, for given natural numbers d 1 and d 2 , such that d 1 · d 2 < n, one can define the binding operation * :
where + is the group operation in B = (Z/p) y . One can see that any element from B (base item) is invertible under this binding. 6 One should note that modular addition is losslessly reversible: we may indefinitely add and subtract n-vectors, and therefore can perfectly extract any individual term from any multi-term royalsocietypublishing.org/journal/rsta Phil. Trans. R. Soc. A 378: 20190162
binding combination if we bind with the modulo p summation inverses of all other terms. We also remark that within the context of the order-independence property any binding of chains with length greater than 1 item is effectively a convenient shorthand for describing multiple base item bindings and adds no further computational (or indeed semantic) value.
We conclude this section by highlighting that our superposition operation is not lengthpreserving but our binding is when one of the operands consists of 1 basic item. Thus we describe our system as semi-holographic. Interestingly, this is the opposite of the classical convolutionbased system from [12] , where the binding operation is not length-preserving but superposition (element-wise average) is.
(c) Similarity metric
Let us define a distance. First, we use a 'circular distance' on Z/p: for a ∈ Z/p, one has dist • (a, 0) = min(|a|, |p − a|), here we also denoted by a = a + 0 · p the corresponding representative in Z. For example, for 4 ∈ Z/5, dist(4,0) = 1. Analogously one defines a distance dist • (a, b) for any a, b ∈ Z/p as min(|b − a|, p − |b − a|). For two vectors a, b ∈ B, one defines the distance as:
One can note that dist(a + b, a) = 0 for any a ∈ B.
(d) Basic properties
In terms of fundamental mathematical properties: the superposition operation is not closed in general, but it acts as closed when our restriction on the sum of the ranks of the operands is met. It is associative but not commutative. It has an identity element (the empty string), but no inverse operation as such. The binding operation is not closed, but acts as closed when the restriction on the product of the ranks of the operands is met. This is always the case when one of the operands is a basic item, i.e. a ∈ B. If a is a basic item, then for any b ∈ B d , we have commutativity: 3 , and at least one of d i = 1, then we have associativity: (a * b) * c = a * (b * c). In general, it is neither associative nor commutative, however, modulo permutation group on basic item components, it has those properties.
Finally, one has distributivity in case of a basic item:
In general, as above, this property no longer holds (unless we do not care about the order of terms and factorize by the action of permutation group).
The identity element is the zero element of B. All basic elements are invertible under binding. These properties form a good start for building a cognitive system.
Capacity
In terms of higher-level properties, a key metric is memory capacity: the maximum number of basic elements storable given some minimum upper bound for memory recall reliability. Each rank 1 semantic object (base item), the smallest type of independent semantic objects, must be uniquely identifiable. As a result, there can be no more than Q = p y basic memories in total without guaranteeing at least one ambiguous recall, i.e. Q is the maximum memory capacity. However, an additional sparsity requirement is necessary in order to guarantee that the system is capable of unambiguously answering queries. Returning to the example from §2, in order for the term col −1 * obj * car to be culled from any semantic pointer or filler from our vocabulary it should not coincide with a valid object from the fixed fundamental vocabulary. In order to achieve that, we may impose that our memory safely stores only up to Q s vocabulary objects, where s ∈ R is the desired sparsity factor, and the following formula holds:
A lower bound for s is given by calculating the number of basic items J that the system can generate given a set of Q s vocabulary items and allowed complexity. These will all need to be accommodated unambiguously for guaranteeing reliable recall. In our proposed system, the only operation that can generate basic items from combinations of vocabulary items is the binding operation. Therefore, for Q s vocabulary items, we obtain Q 2 s /2 derived items arising from all the possible unordered (to account for the commutativity) pairwise bindings. This rises to Q γ s /γ ! for exactly γ allowed bindings, and in general the system can generate:
basic items, if we allow anything between 0 and Γ bindings in total. Ideally, we want to account for all possible basic items from the fundamental vocabulary via bindings, so J = Q(= p y ), and therefore we can transform equation (4.2) into:
revealing how expressivity is traded against capacity, at least in the absence of any further allowances to combat possible uncertainty in the encoding, decoding or recall of semantic objects. Whether this boundary can be reached in practice requires further study as the particular encodings of each basic item will determine whether specific bindings coincide with pre-learnt vocabulary or other bindings. Let us observe that the more binding is allowed in the system, the less fundamental vocabulary it can memorize (hint: lim x→∞
). This is an example of a trade-off between capacity and complexity.
Example: if we choose p = 16, y = 128 and we allow the system to have at most Γ = 20 bindings, then the upper bound on the length of the core dictionary we can encode is 422 million items.
Additional semantic object manipulations
In order to complete the description of the proposed system, we need to cover two further issues: (i) How does the system cope with uncertainty? (ii) Since the system is semi-holographic how does the system map multi-item chains to single base items when necessary? In this work, we provide some cursory answers as these questions merit substantially deeper study in the own right.
Dealing with uncertainty: the implementation of de-noising will strongly depend on the form of the uncertainty present in the system. We may define uncertainty as a probability distribution that encodes how likely it is to obtain semantic object x when in fact, the ground truth is x. For example, if the probability density only depends on the 'circular distance' (equation (3. 3)) between the x and x objects 8 we may use an adaptation of element-wise average for de-noising. The average is computed as the mid-point along the geodesic. In particular, for a ∈ Z/p let also denote by the same symbols a = a + 0 · p its representative in Z. Also denote by = ceil(dist • (a, b)/2). greater representative (say, it is b), then avg(a, b) = (b + ) mod p. In general, for items a, b ∈ B, we define the average as the element-wise average.
To this, we add the following observations: (i) the purpose of the de-noising average is to reconcile multiple, corrupted versions of a single semantic object vector, not combine different vectors into new semantic objects (i.e. a i is expected to be reasonably close to b i most of the time). Nevertheless, when used with radically different semantic objects as inputs, it is inescapable to observe that the operation acts very similarly to binding. The effects of using a binding-like operation for de-noising (a task usually handled by superposition) are an interesting subject for further study. (ii) Different uncertainty descriptors (probability distribution functions) may lend themselves to different de-noising strategies. So will different metrics. (iii) Even with fixed underlying probability distribution assumptions, de-noising may be carried out using multiple alternative strategies. Examples applicable to our assumptions would be majority voting (select element-wise mode instead of mean-works best for large number of input sample terms) or median selection.
Compressing long chains into basic items: ideally any cognitive system should be able to take any expression and collapse it into a new memory that can be stored, recalled and used with the facileness that basic items enjoy. In our case, this requires compressing chains into the size of a basic item. In principle, any compression algorithm will suffice. Examples could be applying genetic algorithm-like methods [27] on the items of a chain or combining said items using any multiplication (e.g. circular convolution etc).
We conclude by remarking that the operation of creating a new semantic object can be reasonably expected to be executed orders of magnitude less frequently than any of the other operations. As such, it is possible to dedicate hardware that is both more complex (luxury of using relatively heavy computation) and more remotely located from the core of the semantic object processor (luxury of preventing the layout footprint of the semantic object generator from impacting the layout efficiency of the processor core).
Hardware implementation
In this section, we examine how the mathematical machinery can be mapped onto a hardware module which we call the 'cognitive processing unit' (CoPU). The system receives chains as input operands and generates new chains at its output after executing the requested superposition and/or binding operations. The CoPU is based on a common block-level design blueprint which can then be instantiated as specific CoPU designs. It is at the point of instantiating a particular CoPU design that the values of key parameters p, y, d are decided upon.
(a) Hardware system design
The proposed semi-holographic representation mathematical machinery can be implemented as a fully digital system in a very straightforward manner as shown in the block diagram of figure 2. The underlying set will be implicitly determined by the bit-width used. The inverses of each n-vector element under element-wise modular addition are simply their 2's complements. Full representation of any semantic object can, therefore, consist of d, log 2 p-bit words, plus x flag bits for tracking the number of items in any given chain.
The superposition operation can be handled by the hardware as 'APPEND' operations (akin to linked lists); the system need only know the operands and the state of their flag bits. In practice, this would be implemented as d 'SELECT' operations, which directly map onto a simple (l · n)-width 9 multiplexer/demultiplexer (MUX/DEMUX) pair. A small digital controller circuit determines the appropriate, successive configurations of the MUX/DEMUX structure depending on the flag bits of the operands (see below). The same circuit also computes and sets the flag bits of the resulting chain. The hardware-level complexity of our proposed system can be contrasted with the standard element-wise addition approach, which requires n times z = ceil(l)-level 'ADD' operations (cost: n, z-bit adders, or one time-shared z-size adder or valid trade-off solutions in between). The binding operation can be carried out by n element-wise addition/subtractions (ADD/SUB), implementable as n, z-bit ADD/SUB modules. Because of the modular arithmetic rules overflow bits are simply ignored. The ADDSUB terminal of each module can directly convert one of the operands into its 2's complement inverse as is standard. This is illustrated in figure 2b . The complexity of (a maximum of) n, z-bit additions can be contrasted to the computational cost of circular convolution, which would involve n 2 multiplications and n · (n − 1) additions (= n · (n − 1) MACs + n multiplications). On top of this, the additional hardware cost of shifting a chosen operand of the circular convolution n times in its entirety must also be considered.
Finally, the design is completed by a controller unit that orchestrates the operation of the entire system. The unit: (i) instructs the arithmetic-logic unit (ALU) what operation to execute (ADD/SUB signal) and when (EN signal), at the behest of a request signal (RQ), (ii) is informed by the ALU when the input operands are equal (EQ); useful for e.g. branch-equal-type assemblylevel operations, (iii) controls all multiplexers, (iv) internally executes the flag arithmetic and (v) outputs an operation termination flag (done). Shift register buffers capture the output of the CoPU and latch it for further use.
Naturally, alternative hardware implementations are also possible. This might include fully analogue ones, e.g. using analogue multiplexers for superposition and current-steering-based binding [28] . Alternatively, it might include 'packet'-based ones where chains are packaged into e.g. TCP-like (transmission control protocol) packets and communicated across an internetlike router structure. Each packet could contain a header detailing the number of items within the packet and a payload, a technique similar to the protocol used in neuromorphic systems communications over the internet [29] . The proposed implementation is chosen because it naturally maps onto easily synthesisable digital hardware. The most efficient implementation technique in any given system, however, will naturally depend on the rest of the system, e.g. on whether the broader environment operates in mainly analogue or digital. 
(b) CoPU: further details and performance evaluation
The CoPU from figure 2 has been designed in Cadence using TSMC's 65 nm technology for the purposes of performance evaluation. The CoPU used: l = 8, y = 1, d = 8 (table 1) . Performance was assessed in terms of power efficiency and transistor-count (proxy for area footprint).
(i) Power performance
The CoPU was assessed for power dissipation when: (i) executing an 4-item × 2-item binding operation, (ii) executing an 8-item superposition and (iii) in the idle state. In all cases, total system power dissipation figures include: (a) the internal power consumption of the system proper, (b) the energy spent by minimum-size inverters in order to drive the signal (semantic object) inputs and (c) the consumption of the output register buffers. For both superposition and binding, estimated worst-case figures are given. For superposition, worst case is expected to be obtained when transferring the 'all elements = 1' (all-1) item into locations where the 'all-0' item was previously stored. This is because all bits in both input drivers and output buffers will be flipped by the new input. Furthermore, for our tests, the entire system was initialized so that every node started at voltage 0 (GND), which means that the parasitic capacitances from input MUX to output register buffers also needed to be charged to logic 1. In binding, as for superposition, the system is initialized with all inputs (and also outputs) at logic 0. The worst case is expected to be given when adding two all-1 items. This is because all inputs and all outputs bar one need to be changed to logic 1. For example going from the state 0000 + 0000 = 0000 to 1111 + 1111 = 1110 requires us to flip all 8 input bits and 3/4 output bits. Additionally, we opted for a 4 × 2-item binding in order to capture the worst case in handling the flag bits as well (for a binding operation performing a total of eight 1 × 1-item suboperations). In both cases, a 20 ns clock period (50 MHz) was used and each operation lasted 9 clock cycles.
The performance figures indicate a power breakdown as summarized in table 2. Internal dissipation refers to the power consumed by the system shown in figure 2a, excluding the shift register buffers. Driver dissipation is the consumption of the inverters driving the inputs to the system (not shown in figure 2a ). Register dissipation refers to the buffer registers. Cycles/operation refers to how many clock cycles it takes to conclude the corresponding operation for each full item.
The figures in table 2 indicate that most of the power is dissipated in registering the outputs (greater than 50%). Next is the internal power dissipation, most of which occurs in the control module (≈1.6-1.7 pJ). We further note that superposition and binding cost similar amounts of energy though their internal breakdown is slightly different. The lower buffer register dissipation in binding (we only flip 7/8 bits at the output in our estimated worst case) is counterbalanced by an increase in energy expenditure for computing the sum of the operands (added internal dissipation). Finally, static power dissipation was calculated at ≈82.5 nW.
(ii) Transistor count
The transistor count for the overall system and its sub-components is summarized in table 3. We note that the data-path part of the system, which includes the MUX/DEMUX trees and ALU only requires 880 transistors. This means 110 transistors/bit of bit-width, of which 42 in the ALU and 68 in the MUX/DEMUX trees. In larger designs supporting longer item chains the multiplexer tree becomes deeper and adds extra transistors. We conclude with some observations: the CoPU can be constructed using relatively few, simple and standard electronic modules that are all very familiar to the digital designer. The relative costs of both basic operations of superposition and binding are also very similar, in contrast to the large energy imbalance between multiplication and addition carried out using conventional digital arithmetic circuits. Next, we note that the proposed architecture lends itself naturally to speed/complexity trade-offs. First, 2 · d DEMUX trees could be implemented in order to allow up to d items to be transferred simultaneously to any location of the output chain. Second, d ALUs could be arrayed in order to perform up to d × 1-item bindings in a single clock cycle. Naturally, the increased parallelism would result in bulkier, more power-hungry system versions. Finally, we remark that systems using smaller l in exchange for larger y will in principle be implemented by larger numbers of lower bit-width ALUs operating in parallel. This may simplify the handling of the carry and improve speed (certainly in ripple carry-based designs).
Simple example of operation
In order to help visualize the proposed processor in operation, let us explain how it would fit within a cohesive system and then present a brief, but representative example. A possible instantiation of a broader system is sketched in figure 3 . The processing core is surrounded by additional buffers, inputs from ANNs (preprocessors), a large associative memory storing the knowledge base of the system and an 'orchestrator' centre that is responsible for the control flow. These components can be understood as equivalents of registers, input devices, a hard disk, the instruction register and other hardware systems that typically surround the traditional central processing unit (CPU) found in every computer. Just as an isolated CPU is not much use, so does the CoPU require a full-fledged environment in order to perform meaningful computation. The details of all complementary components that surround the CoPU lie outside the scope of this work, but we shall quickly examine their intended operation as used in our toy example.
For the purposes of this example: the buffers store semantic objects on a temporary basis. Much like data registers we can load semantic objects into them from memory, sensory inputs or from the CoPU and likewise direct their contents back to memory or to the inputs of the CoPU. The associative memory can: (a) receive a distorted version of a semantic object and restore it to a 'clean' version. (b) return the relation(s) between two semantic objects, e.g. if queried with the cues 'elephant' and 'size' it will return 'large' as a response. The orchestrator executes the control flow for the programme. Finally, the system is assumed to 'be aware' of an important principle of logical thinking: objects inherit properties from their super-set classes e.g. if queried with 'What is the colour of a cumulus' it should be able to replace 'cumulus' with 'cloud' and try answering again. Overall, we observe that the system is reminiscent of a massively parallel (hypervectorbased) but otherwise broadly standard SIMD (Singe Instruction, Multiple Data) computer whose operation can be naturally described by an assembly-like dialect. Now let us set up the example task. The system is asked: 'What is the dollar of Japan?'. The system's knowledge base is assumed to include all necessary items, for example: 'currency = superset*dollar' and 'Yen = Japan*currency', i.e. it knows that the dollar is a subset of the concept of a currency, that the currency of Japan is, in fact, the Yen and all other relevant facts. Under these conditions a possible succession of steps used by the overall 'cognitive' system to solve the query is shown in table 4.
The system begins by loading the keywords of the query into its buffers INA and INB (step 1). Then it uses the CoPU to bind them together into a query (s2) which is subsequently sent to the memory for association extraction (s3). The memory returns a miss indicating no direct association between the semantic objects 'Dollar' and 'Japan', so the orchestrator uses prior knowledge to attempt a new solution. It asks the memory to fetch the concept of a super-set (s4), uses the CoPU to create a new query (s5) and then sends that into memory (s6). The memory returns the association 'Currency' and the orchestrator reformulates the question as 'What is the currency of Japan?' (s7). The CoPU forms the query (s8) and sends it to the memory (s9), which, in turn, returns the answer 'Yen'.
At this point, it is absolutely imperative to note the following: (1) The sequence of events in table 4 is only one of many possible solutions that a cognitive system may attempt, just like when solving, e.g. an algebraic manipulation problem humans try different options. (2) Furthermore, depending on the organization of the surrounding system the succession of events may be entirely different and even use entirely different basic instructions. 10 This means that the CoPU will 'speak' a different dialect of assembly, however, its basic function remains unchanged: execute superposition and binding operations. Table 4 . Example of how a functionally complete cognitive system may use the CoPU to successfully complete a simple task. BUFF1, BUFF2,..., RET are all semantic object buffers ( figure 3 ). Rows where the CoPU is called into action are highlighted in pale red. Blank spaces indicate 'do not care' or purged buffers. 'X' indicates a memory 'miss' , in other words, no match was found. Descriptions in quotation marks are intended to render the operation of the machine relatable to everyday human thinking processes. This is just one possible solution (succession of steps). 
(a) Performance benchmarking and comparison
In order to benchmark the performance of the system (and understand the role of the CoPU in improving it), we need to examine the system at three different levels of abstraction: Level 1: symbol manipulation level. At this level, we examine the sequence of steps in table 4 and compare it with other solutions. This is highly dependent on how the entire system is set up, including numbers and connectivities of buffers, the capabilities of the orchestrator (if indeed there is such component) and the structure (and consequently capabilities) of the associative memory. While study at this level is considered outside the scope of this paper, it is worth noting that an efficient CoPU will lead to more significant performance improvements in system architectures designed to make frequent use of it. Level 2: mathematical computation level. At this level, we examine the arithmetic operations taking place within the CoPU during execution and compare their numbers and types. In this particular example, we have three single item-single item bindings (at steps 2, 5 and 8 of table 4). Our CoPU achieves this using a total of 3y modular additions (please refer to table 1). By contrast, a traditional holographic representation system would require a total of 3(yd) 2 multiplications and 3(yd) * (yd − 1) additions (recall §6). Thus, from a purely mathematical perspective, the proposed CoPU requires much fewer operations of either type. This is attributable partly to the compression achieved by opting for a semi-holographic representation and partly because of the replacement of mathematical operations versus the standard holographic representation system: Multiplication was replaced with modular addition and addition was replaced with simple routing of results (affecting the third level-see below). Note: in hardware, results from any operation need to be routed from an input buffer to an output buffer, hence we do not consider the MUX/DEMUX operation used for superposition as a mathematical operation within the context of this particular level. figure 3 can in principle be the same with any other CoPU design. We note, however, that while we have made a conscious effort to render our CoPU as generally applicable as possible, full system design will require co-optimization between all components of the system; then the system can be benchmarked as a fully functioning whole.
In our example, the energy dissipation is expected to be ≈18 pJ (3× single-item bindings). For comparison, let us investigate the energetic cost of an 8-bit multiplication. This can be carried out by asking our system to act as a Wallace tree 8 × 8 multiplier [30] . For that, we would need approximately 64x single-bit full adder activations. Considering the cost of a single-item binding (equivalent to 8× 1-bit full adder activations) from table 2 is given as ≈5.8 pJ (note: this is a worstcase result), the cost of the 8-bit multiplication would be in excess of 46 pJ. This excludes the AND gates required by the Wallace tree input and any overheads in moving data in between steps. Thus, in the standard holographic approach our energy requirements for executing the example problem would be: E = 3(yd) 2 46 pJ + yd(yd − 1) · 5.8 pJ) ≈ 635.4 nJ (7.1)
We note that this improvement in performance is almost exclusively a result of the choice of arithmetic operations used to execute binding (simple modular addition instead of circular convolution) and the fact that we operate on compressed representations (we operate on length y items, as opposed to full length yd chains).
In terms of actual hardware design, we have been conservative and designed the CoPU in 65 nm technology. The best state-of-art computing system we could find was represented by the Google TPU [31] , where they quote 40 W system power dissipation running at 92 TOPS at 8-bit precision, thus yielding a power figure of just 0.43 pJ/FLOP, which would be equivalent to 0.43 pJ/binding in our system. The TPU is designed in 28 nm CMOS. With appropriate hardware/layout optimizations and downscaling it might be possible that the CoPU could achieve similar power performance at the hardware implementation level as well, thus compounding the mathematical advantage. For completeness, NVidia's Xavier SoC quotes 1 pJ/FLOP at 8-bits 11 and Graphcore's IPU quotes ≈1.3 pJ/FLOP at unknown precision (presumed 8-bit). 12 In terms of transistor count, the system will need to use 1/8 of its output registers (since it only handles single-item operations in this example), which corresponds to 288 transistors. Everything else remains the same, yielding a total of 2366 transistors actively used by the CoPU during the execution of this task. Other systems will require either more transistors in order to accommodate the multipliers (1952 transistors for 8-bit multiplier demonstrated [32] ) or slow down execution by reusing the same circuitry to break down the multiplication operations into additions (as the Wallace tree architecture does).
Discussion
The starting point of this work is the observation that any system consisting of a length n vector with p states per element (corresponding to some fixed number of digital signal lines) can only represent p n uniquely identifiable vectors. This is effectively a hardware resource constraint and imposes a number of trade-offs warranting design decisions.
Trade-off 1-expressivity versus capacity: in the classical holographic representation systems, all semantic object vectors are of equal length no matter how many times semantic objects are combined together through superposition or binding. By contrast, in our proposed system, some objects will be base items and others will be chains of various lengths. This introduces some constraints into which combinations of semantic objects are allowable, yet the system retains the capability of representing p n states overall. This seems to be a manifestation of (i) Operate on relatively few basic semantic objects (objects stored in memory as meaningful/significant) but allow many possible combinations between them, i.e. be expressive but low capacity. (ii) Operate on relatively many basic semantic objects but only accommodate certain possible combinations between them. This is the regime in which our proposed system operates.
We note that the question of the optimum balance between expressivity and capacity is highly complex and requires further study in its own right. In our proposed system capacity and expressivity are to some extent decoupled: p, y affect capacity and expressivity in a trade-off manner while d affects only capacity.
Trade-off 2-'holographicity' versus compression: cognitive systems can be conceived at different levels of 'holographicity' as determined by the percentage of operations that are operand length-preserving. For fixed maximum semantic object length, the choice lies between the extreme of always using the full length of n elements in order to represent every possible semantic object (full-holographic), or allowing some semantic objects to be shorter (non-holographic). This significantly impacts the amount of information each numerical element carries. In a fully holographic representation transmitting or processing even a single-item-equivalent semantic object requires handling of n elements; the same as transmitting/processing the equivalent of a long chain. The semantic information per element may dramatically differ in each situation. In our proposed system, however, superpositions of fewer items are represented by shorter chains. This illustrates how less holographic systems generally offer the option of operating on more compressed information, i.e. closer to the signal-to-noise ratio (SNR) limit. As a result, we may speculate that using a CoPU as proposed would allow the associative memory shown in figure 3 to be designed with a memory capacity smaller by a factor of d than in the case of a fully holographic system. However, we note that in principle the required memory capacity should correspond only to the size of the vocabulary to be stored. Any differentiation would arise from different sparsity requirements. The comparison between the required sparsity of a fully holographic system and our proposed compressed representations is not yet entirely clear and requires dedicated study.
Naturally, there is a price to pay for compression: when creating new semantic objects for storage it is extremely useful if these new objects can be mapped onto minimum-length units (the semantic object basis of any cognitive system). Mechanisms for mapping any arbitrary chain onto such units need to be supported, adding to system complexity. Furthermore, in a nonholographic, system any circuitry designed to support the last items of a chain may be used only infrequently. This is expected to strongly affect hardware design decisions.
Trade-off 3-long vectors with few states per element versus short vectors with many states per element: if we have a fixed number of binary lines (i.e. l · y = C), we have a choice of treating C as either: (i) one single, large identifier number, (ii) a collection of binary bits independent of one another or (iii) certain possibilities in between. For example, for C = 16 we can have {l, y} ∈ {(1, 16), (2, 8) , (4, 4) , (8, 2) , (16, 1)}. The number of states we can represent remains fixed at 2 ly , but: -The distance relationships between semantic objects will be different in each case. In the case (1, 16) , our item consists of a vector of 16× 1-bit elements, and therefore there are 16 nearest neighbours for each item (all items that differ from the base object at exactly one position). In the case (16,1), our item is a single 16-bit number which has exactly two nearest neighbours (the elements/items different from the base object by one unit of distance). Note that the case (1, 16) corresponds tightly to the spatter code system proposed by Kanerva [14] since modular addition now reduces to a simple XOR. -The degree of modularity achievable in hardware may be impacted in each case. The (1, 16) case requires 16× XOR gates in order to perform one item-item binding while in the (16,1) case requires a single 16-bit adder. In the case of large values of C there may be an additional impact on speed (how viable is to make a 512-bit adder that computes an answer in one clock cycle/step? -512× XOR gates, on the other hand, will compute 512 outputs in one step). This subject requires further, dedicated study.
Trade-off 4-operation complexity versus property attractiveness: as a rule of thumb operations with more attractive mathematical properties tend to introduce computational and implementational difficulties. This is perhaps well exemplified by examining different binding operations:
-convolution commutes, 'scrambles' the information well 13 and preserves information.
However, it lengthens the vectors that it processes and it is computationally heavy (many MACs); -circular convolution commutes and scrambles. Lengthening no longer occurs, but information is lost and the operation is still heavy on MACs; -modular arithmetic commutes. Lengthening does not occur and the operation is MAClightweight, but information is lost and the scrambling properties are similar to those of superposition by element-wise addition, so the similarity requirements for defining two semantic objects as corrupted versions of each other have to be substantially tightened.
Ultimately, a complex mix of factors/specs in all trade-off directions will determine the best cognitive system implementation. This may depend on the overall cognitive capabilities required of the system. In this work, we have focussed on a partially holographic system based on effectively multiplexing and addition as the system operations. The advantage of this implementation versus the holographic approach that we have used as standard and inspiration is that both operations have been simplified in hardware: superposition became a multiplexing operation instead of addition while binding became element-wise addition instead of circular convolution. The balance of these advantages versus the attributes that had to be traded-away (mathematical elegance, full holographicity, etc.) needs to be considered very carefully. In general, however, the system is designed for occasions where we have partially restricted expressivity (notable cap on chain length-effective number of successive superpositions allowed) but enables extreme implementational simplicity and high energy efficiency.
Finally, we envision that our proposed CoPU will form a core component of larger systems with cognitive capability. Much like in a traditional computer, our CPU-equivalent will need a memory to which it can communicate as well as peripheral structures. Work in that general direction has very recently begun to gain traction [18, 33] . Relating this back to biological brains we see the closest analogue of our CoPU in the putative attentional systems of the brain; the contents of the input buffers at any given time could be interpreted as the semantic objects in the machine's 'conscious attention'. Importantly, within this more complex architecture the CoPU remains an excellently modularized component allowing expansion of its raw computational power: because it processes hypervectors in a manner specifically designed to limit the interactions between neighbouring elements of each hypervector it is straightforward to expand it to longer hypervectors (vectors with larger y and d -more elements/item and items/chain respectively). In hardware parlance, it is an example of 'SIMD par excellence'. Similarly, extending the principle to larger numbers of p (higher vector element valences) is also quite straightforward, though the mathematical implications need to be considered carefully (notably how it may affect the distance relations between items). At a higher level, it is entirely possible that the CoPU will be eventually upgraded with additional functionality allowing it to support different operations that will turn out to be useful for cognition, e.g. hypervector barrel shifts. This would be akin to expanding the instruction set of our SIMD processor.
In conclusion, we envisage that future thinking machines will be complex systems consisting of multiple, heterogeneous modules including ANNs, memories (bioinspired or standard digital look-up tables), sensors, possibly even classical microprocessors and more; all working together
