Code placement techniques for instruction code have shown to increase an SOCs performance mostly due to the increased cache hit ratios and as such those techniques can be a major optimization strategy for embedded systems. Little has been investigated on the interdependencies between code placement techniques and interconnect traffic (e.g. bus traffic) and optimization techniques combining both. In this paper we show as the first approach of its kind that a carefully designed known code placement strategy combined and adapted to a known interconnect encoding scheme does not only lead to a performance increase but it does also lead to a significant reduction of interconnect-related energy consumption. This becomes especially interesting since future SOC bus systems (or more general: "networks on a chip") are predicted to be a dominant energy consumer of an SOC. We show that a high-level optimization strategy like code placement and a lower-level optimization strategy like interconnect encoding are NOT orthogonal. Specifically, we report cache miss reduction ratios of 32% in average combined with bus related energy savings of 50.4% in average (with a maximum of up to 95.7%) by means of our combined optimization strategy. The results have been verified by means of diverse real-world SOC applications.
Introduction
The advent of silicon technologies will lead to System on Chips (SOC) that will reach one billion transistor designs within the next few years. A major reason preventing the integration of several hundred million transistors on a single chip (indeed, this would be possible through today's mainstream 0.13µ silicon technologies and wafer technologies) is the energy dissipation problem. The per-square-mm and per-time generated heat energy can hardly be carried off-chip without substantial (i.e. costly) effort and thus prevent those designs finding their way into mainstream consumer products. An additional constraint is implied by mobile computing/communication/entertainment devices which draw their current from capacity-limited batteries. The problem has been addressed at various levels of abstraction starting from new silicon technologies, through gate-level, RT-level, architecturallevel and eventually to system-level approaches. However, it can be observed that many proposed approaches are orthogonal (at least as far as many methods within a certain abstractionlevel are concerned). In other words, power saving/reducing methods are often designed, without having complementary methods in mind and thus complicating or even preventing the effective implementation of one or more other power saving methods. There is currently promising evidence by new research at the system-level that the tuning of parameters of various system parts can lead to substantial power savings. Work in this direction is typically carried out at a high-level of abstraction and thus cannot capture subtle architectural characteristics. Table 1 )
In this paper we present the first approach that combines,

Benchmark
. Thus a bus encoding mechanism is needed to amplify the power savings in the processor and memory. This encoding leads to ultra-low bus power consumption combined with an effectively increased bus bandwidth that yields a higher system performance as well (we will later on explain why bus power and interconnect power in general will be a major contributor to the system power consumption in future silicon technologies). We achieve interconnect energy savings of 50.4% compared to the case where a single method is applied and thus report energy savings that top all other known stand-alone energy saving techniques addressing SOC interconnect.
The paper is structured as follows: the next Section 1.1 gives an introduction to the related work in both code placement and bus encoding. Our techniques for code placement and bus encoding are then introduced in detail in Sections 2 and 3, respectively. The validation environment is given in Section 4 and the experimental part follows in Section 5 and finally conclusions arrived at in Section 6.
Related work
In the following we report on the most relevant work on the two areas of code placement techniques and bus encoding techniques. Both areas are crucial to our approach which is the first one to combine, adapt and optimize two previously separately treated low energy methods to achieve an ultra-low energy dissipating interconnect for SOC designs. Several articles have appeared in recent literature about reducing cache misses by reorganizing data or instructions in the cache. The work on cache misses has predominantly concentrated on data cache optimisations [4] [13] [12] . Hwu et al, in [1] , McFarling in [2] and [3] , Chow in [7] , Tomiyama and Yasuura in [11] , Kirovski et al in [10] , Kirk in [8] and [9] , Li and Wolf in [5] , and Parameswaran in [15] have given various methods to reduce instruction cache misses in microprocessor based systems. All of these systems are at the function level.
In [27] , algorithms were presented in order to reduce the total cache misses at the assembler block level.
1
In recent work it has been recognized that inter-wire capacitances increasingly contribute to the power consumed on a bus system. Various approaches have tried to address this problem and bus-related power consumption in general. Initial work on bus encoding has been conducted by Stan/Burleson [16] . The basic idea is to transfer an inverted word through the bus whenever it can reduce the Hamming distance between a word and its previous word. Later in [17] , they introduced Limited-Weight Codes (LWC) for low power encoding and provided optimal statistic performance for random data. The above schemes belong to the class of space-time redundant encoding, where bus sizes are augmented. While the above encodings were developed for random input, researchers started to address data source properties. Panda/Dutt [20] developed a scheme to map arrays in memory for reducing energy on address bus. Exploiting the characteristic that consecutive memory accesses tend to have a consecutive addresses, Mehta et al. introduced Gray code for address bus [21] . To further reduce the energy on an address bus, Benini et al. proposed a prediction scheme taking high regularity of data on address buses into consideration [23] . E. Musoll et al. [18] proposed the WZE (WorkingZone Encoding) scheme to exploit locality of memory reference. Meanwhile, theoretical approaches for bus encoding were developed. In [25] , Ramprasad et al. started to use a general communication model to analyze bus-encoding schemes giving lower bounds on average signal transition activities. The work introduced so far is focussed on reducing transition activities on a bus based on the assumption that the inter-wire parasitic capacitances are negligible. However, with the advent of deep sub-micron technology, inter-wire parasitic capacitances become a major issue. Sotiriadis/Chandrakasan stated in [22] that simply minimizing transition might not lead to optimal power reduction. They developed a model to incorporate the inter-wire capacitances of a bus and search the code space to find the best codes for inputs based on their bus power model. In the work conducted by Kim et al. [24] , two new schemes have been introduced for low power buses. Schemes addressing data properties for deep sub-micron technologies are proposed by Henkel/Lekatsas [19] through re-arranging bus lines and then applying local bus invert. Macchiarulo et al. [26] have shown that the layout of an address bus can be arranged for low power consumption.
Motivation and Focus
Deep sub-micron silicon designs of 0.10u and beyond lead to a shift in optimization strategies for SOCs for several reasons:
1. due to the small feature sizes, inter-wire capacitances of, for instance, bus lines become dominating (compared to intrinsic bus line capacitances). Hence, the relative share of energy consumption of the SOC buses compared to all other components will increase by up to around 30% in future designs.
2. architectural optimizations like improved code placement techniques do not only increase the performance (as shown in [27] , they may also dramatically change the extension of the traffic on the buses involved. For example, an improved code placement technique might lead to a higher cache hit ratio and thus a) reduce the number of related bus transactions between the cache and the main memory b) shift the bus traffic to the processor-to-cache bus instead and c) decrease the amount of total related bus transactions.
This leads to interdependencies that were previously not considered to be orthogonal. Now, with deep sub-micron designs emerging, code placement (and other, higher level optimization techniques for SOCs) has a direct impact on the busrelated energy consumption and as such influences the energy consumption of the whole SOC. Previously considered relevant was the energy savings that come with the reduced execution time (improved performance) of an application with a more efficient code placement. Now, it does matter in which way a code placement technique implicitly shifts transactions from one bus to another and it does matter how efficiently these buses make use of bus encoding schemes to reduce the energy consumption. In this context it also does matter how long (physically) those buses are inducing a direct relationship between a code placement technique and physical parameters. This work focuses on these relationships and presents, as the first approach, a quantification and optimization of interdependencies between: a) code placement on the one side and processor-tocache and cache-to-main memory bus lengths on the other side b) code placement and energy-saving bus encoding schemes
We will show that these interdependencies and their exploitation lead to a reduction of 50.4% in average (maximum of 95.7%) of the address bus energy consumption of a whole SOC.
Assumptions
The following assumptions hold for the approach introduced later in this section:
1) The systems considered are single microprocessor systems, with memory and instruction cache which is configured for a single application. This is quite common in embedded systems.
2) The size of code block placed is no bigger than the size of the cache. This assumption is quite valid in embedded systems where the basic blocks are usually small enough to fit into small cache sizes. If the task is too large for the cache it is possible to break up the task into smaller granules such that each granule will fit into the cache. 3) Only Level-1 caches are available for use. Once again in an embedded system, where frequently there is no cache at all, it is unlikely that more than a single level of cache is going to be available for use. 4) The caches are direct mapped. High-speed systems frequently use direct mapped systems in order to speed up the system as much as possible. This assumption makes it easier to analyse due to the deterministic mapping to cache from memory. 5) The problem is sufficiently large so that the total size of the instructions (in bytes) are several times larger the size of cache. This is a reasonable assumption in a realistic system.
Allocation of Assembly Level Basic Blocks in Cache and Memory
This section details the code placement methodology. Note here that the methodology looks at the code at the assembler level. The basic blocks here are blocks of assembler instructions which are executed together. This methodology contains an algorithm with two parts. The first part places basic blocks in the cache so that basic blocks with high frequency are swapped out as little as possible. The second part of the algorithm takes the placed basic blocks and maps them into main memory. This algorithm is performed as a preprocessing step, taking the application's original instructions in memory and re-mapping them to different locations. In order to re-map instructions, it was necessary to identify basic blocks. After the identification of basic blocks, we had to identify which blocks executed consecutively. We identified them by running the application through an instruction set simulator and finding blocks of instructions which were always executed together. The number of basic blocks within applications under consideration varied from 100 -900. A more comprehensive study of the approach is given in [27] .
Part 1: Cache Allocation
The methodology used for ordering basic blocks in the cache is as follows. All the loops containing a particular basic block are grouped into a single super loop. Thus a loop will be only a member of one super loop. Each super loop's execution frequency (f sl ) is defined as the addition of all the execution frequencies of the component loops. The super loops are ordered in descending order of frequency. The ordered super loop list is given as sl 1 , sl 2 , ...sl p . For super loop sl 1 , the basic blocks within it are taken in order (from highest to lowest frequency of execution of basic blocks -f b ) and these basic blocks (only whole basic blocks are allowed) are allocated to the cache from the lowest address to the highest until the cache is completely filled or there are no remaining basic blocks within that super loop. Once this step is finished, and if there are any remaining basic blocks, we find the largest basic block from the remaining basic blocks of sl 1 . This large basic block is allocated to the bottom of the cache, say with starting address A ls , and ending at the end of the cache. After this we take the next largest basic block and allocate its starting address in the cache to A ls . The ending address will be less than the final address of the cache. Then if another unallocated basic block can be found which can go into the space (below the basic block we just allocated, and above the last cache address), we allocate that basic block into the available space. We keep doing this until we reach the end of the cache. We take the next largest unallocated basic block, and start it at address A ls and we repeat the process until all basic blocks are allocated. This is then repeated for all the other super loops in the ordered list.
Part 2: Memory Allocation
The memory allocation part of the algorithm takes the already placed basic blocks in the cache and directly maps them to the memory. Figure 1 shows an example of how the basic blocks are mapped to memory from the cache. In this figure blocks 1 to 5 are mapped directly on to the memory, but the block 6 is mapped to some memory locations further away, such that the mapped block will end up in the desired position in the cache. Thus if a basic block is mapped to the location from t x to e x in cache of the processor, then the basic block can be placed in memory in any one of the address ranges from addresses t x + i * S to e x + i * S, where i is a positive integer and S is the size of the cache. However, since basic blocks in the cache will wrap around the cache, an offset Z r , can be added to each basic block allocated from super loop sl r , and the basic block can be placed from memory location t x + Z r + i * S to memory location e x +Z r +i * S. This introduction of the offset allows the reduction in size of the total memory needed for the system. 
Algorithm
Enhancing Code Placement's Efficiency through Adaption of Bus Coding
The code placement algorithm introduced in Section 2 reduces the number of cache misses. Hence, the traffic on the CPU-tocache bus and the cache-to-main-memory bus is significantly reduced, leading to a higher performance of the whole system and a decrease of the interconnect energy dissipation. The aim of this Section is to adapt a bus encoding scheme that amplifies these two effects even further and thus leads to an ultra-low bus power consumption combined with an effectively increased bus-bandwidth that yields a higher system performance as well. Since we target sub-0.10µ technologies it is necessary to also provide means for cross-talk reduction since signal integrity is another major concern. In the following we introduce the bus encoding scheme to address these problems. 
Buses in Deep-submicron Designs
The closer geometrical proximity of adjacent bus lines in sub-0.10µ technologies leads to effects that are almost negligible in technologies not as advanced as 0.10µ and below: two adjacent bus lines form a parasitic capacitance between them. This effect does not only lead to cross-talk and delay effects, it also leads to an increased power consumption since the parasitic capacitance is charged and discharged when there is a voltage swing between two or more bus lines. Thus, each bus line's capacitance can be represented as
where C B is the base (or intrinsic) capacitance (capacitance between bus line and metal layers) and C C,lef t , C C,right are the left and right coupling capacitance between bus line I and it's left and right neighbor (if any) respectively. Table 2 shows the normalized coupling capacitance C C between a bus line i and one of its neighbors j according to the values the bus lines take at time T 1 = t and T 2 = t + 1. Obviously, the coupling effect is highest when both lines are subject to a transition in the opposite direction. We have measured the absolute capacitances C B and C C for a 0.10µ technology: C B = 42.22pF/m C C = 35.89pF/m According to Equation 1 and the table above, the maximum capacitance for a bus line I we can expect is:
Compared to the case where the inter-wire capacitances are negligible (i.e. C i,max = C C = 42.22pF/m this is 4.4 times higher. This is why inter-wire effects have to be taken into consideration. There are several means to diminish or at least reduce the problem of inter-wire capacitances: a) Widen the distance between bus lines: this is typically not preferred since the total area of the bus systems grows too large. b) Use P&R place & route tools that avoid side-by-side routing of bus lines. This is what is actually done in the newest generation of P&R tools. However, the interconnect complexity of one billion transistor SOCs with multiple bus hierarchies and long buses with many cores connected to them will prevent a satisfying solution at a feasible routing time (complexity of the routing problem). c) Change the geometrical shape of bus lines: the bus lines themselves can be re-shaped. For example, the cross-sectional shape can be made narrower such that the distance between two bus lines increases without sacrificing space for the whole bus. However, the main disadvantage of this approach is that the cross-sectional area of a bus line is fixed, since the current-per-area ratio is fixed for any certain technology. That typically leads to solutions where the bus line is buried deeper into the substrate with the height being larger than the width of a bus line. However, even though the inter-wire capacitance decreases due to a decreasing distance between bus lines, it does increase due to the increased flank area of two opposing bus lines.
In conclusion: what is won through a wider distance has to be, at least partly, given up through the effect of larger flank areas. d) Bus encoding techniques that take inter-wire capacitances into consideration when words are transmitted via a bus system.
Within this section, we focus on the latter technique, namely on finding a bus encoding technique that compliments the code placement technique introduced earlier and thus leads to an ultra-low bus power consumption combined with an effectively increased bus-bandwidth that yields a higher system performance
Reducing Power and Increasing Signal Integrity
In the following we introduce an encoding method that solves the problems discussed in sub-section 3.1 and that serves as an enhancement to further optimize the advantages achieved through the code placement techniques from Section2. Let us first define what we call a window:
with l, h being the lower and upper border bit positions of the window, respectively, ww the window size in bits and bw the bus size in bits. Now, let us first define what we call the Transition Activity TA for a window w l,h (ww). In order to make the formula easier readable we simply use w to denote the window. Furthermore, let us assume that b x is the x'th bit within a window with B x being the value of that bit (i.e. B x ∈ {0, 1}). Thus, we can define the TA measure as follows:
Thereby B
−1 i
gives the value of bit b i at time t−1 i.e. the temporal predecessing value. Thus, B i ⊕ B −1 i determines whether bit b i has a high/low or low/high transition (=1) or not (=0). Accordingly this specific bit will contribute to the TA measure or not. Figure 2 gives an idea on how TA is measured using an example. There, the portion of the TA measure contributed by i = a+1 is demonstrated. The dotted line shows the scope that is important for the calculation of the respective TA portion. It equals to 2. It is very important to note that TA as shown does NOT violate the causality principle as it might seem from the Figure 2 . This is because the bus word referring to time T = t − 1 is stored in a register. But even the bus word for time t is stored in a register since the word is not yet put on the bus (it is just in the I/O register of a device) and thus TA does work as intended by Equation 4 . According to Equation 4 every value of a bit different to the bit under review is contributing 1 or 0 to the value of TA depending on whether it is different in value or not. That each contribution is equally sized (1 or 0 with no other values allowed) is justified by our capacitance measure that gives us values of base capacitance compared to coupling capacitances of the closest neighbors (a maximum of three left or right neighbors in a 4-bit window) that are approximately the same and thus contribute the same to the power/energy consumption. Window sizes larger than 4 bit yielded lower energy savings since such a model would assume that inter-wire capacitances reach far beyond the closest neighbor (which is actually not the case). Window sizes less than 4 bits on the other side might be more beneficial in terms of power savings (3 would be ideal since it exactly reflects the physical relationship of adjacent bus lines) but the additional hardware effort cannot be justified. In the next step we use the TA as a measure to determine whether we should invert the information in the window or not. Please note that the TA scheme is able to measure the impact of coupling capacitances. A Hamming Distance measure, as used for regular invert schemes would not lead to a reasonable improvement in power/energy consumption. It would only reduce the number of transitions. But the number of transitions does not necessarily reflect the amount of power/energy that is consumed. Our whole scheme works according to the following procedure: For all windows the TA measure is calculated
Strategy of the Scheme 1) For All windows wi
If hi ta > (#windows)/2 8) Then
9)
For All windows wi ∈ W 10) invert(wi) 11) done. 1-2) . If the measure exceeds half of the maximum value (dependent on the window size ww) then it is counted (lines 3-5). After all TA measures are calculated, it is determined whether more than half of the windows have a high TA value (Equation 4) If that is the case the information in the windows is transmitted inverted. Please note that decoding can be done inversely. Only 1 extra bit line is used for that since all windows will be inverted or not (majority vote). Also, note that this code explains only the strategy. It does not in any way reflect the implementation that, of course, is in hardware.
Hardware related issues
The design bus encoding interface including an encode/decode pair uses approximately 400 gates; it does not incur an additional clock cycle; the critical path is between 2-3ns)).The whole encoding scheme has been designed with signal integrity in mind since this is another major issue in sub-0.10µ designs. As explained earlier, the scheme aims to minimize the switching activities within a certain window as the TA measure (Equation 4) shows. That means that the probability of switching within a window is being reduced and thus reduces Figure 4 : The whole set-up for power and performance estimation and including the focus of this work i.e. code placement strategy adapted to bus encoding and bus length determination. the probability of violating the signal integrity through, for example, crosstalk between two adjacent bus lines. Bus lines located at the border of a window may still interfere with bus lines located at the border of adjacent windows. But note that due to the scheme, this effect is not any larger than in the nonencoded case. Minimizing the remaining border-to-border effects could be achieved by increasing the window size. However, this would decrease the efficiency of the encoding scheme and thus it is contrary to the low power goal.
Validation Environment
We explain the experiments in Section 5, but in prelude to it we briefly introduce our validation environment. It is the main goal of this work to show the efficiency of combining code placement and bus coding for an ultra-low power bus/interconnect for an SOC. These two methodologies are highlighted in a dashed box in Fig. 4 . The bus lengths of the involved buses (i.e. buses between instruction cache and main memory and instruction cache and CPU) are crucial parameters for the power consumption. The lengths are determined by the results of the core placement (memory, cache and CPU). The results of the code placement and the bus coding scheme are fed into the power models of instruction cache and the bus system, respectively. Further power models in the environment are a CPU power model, a data cache power model and a main memory power model. All models plus the code placement and bus encoding mechanisms are fed by instruction traces through the "QPT"/"Dinero" tool set sequence [6] . The output of the environment is power and performance data. For more detailed information please refer to [14] .
Experiments and Results
The target system the experiments were conducted on is shown in Fig. 5 : it shows a chip layout with the interesting parts magnified: the CPU, the instruction cache ("I$"), and the main memory banks. The buses that are affected by our code placement and bus encoding methodologies are buses "Bus1" and "Bus2". The length and/or the ratio of the lengths of these buses varies with the placement and relative size of all cores comprised within this SOC. Hence, the bus power consumption will not only depend on our methodologies (see Sections 2 Figure 5 : Chip-layout and buses "Bus1" and "Bus2" that are subject to extension/contraction according to the placement of the involved cores. and 3) but also on the geometrical characteristics of "Bus1" and "Bus2". This is one of the parameters that will be investigated in this section. The experiments were conducted with the evaluation environment shown in Fig. 4 . Here are the main steps: 1) Placing the instruction code according to Section 2.
2) Generating traces for the new code allocation.
3) Running the traces through the bus encoding scheme (Section 3). 4) Measuring power and performance with the evaluation environment. 5) Varying instruction cache sizes. 6) Varying the ratios of bus lengths "Bus1" to "Bus2" (see Fig. 5 ) according to different placement scenarios of the affected cores. 7) Repeating steps 1)-4) for all applied combinations of parameters.
We performed experiments on a set of five applications. The applications have been chosen with as much variety in characteristics as possible in order to show the wide application area of the methodology. Thus, the applications varied in size (8k to 200k), application area (video, animation, algorithmic etc.) and application domain (data dominated or control dominated). The applications used were: a complete MPEGII video encoder mpeg, a video trick animation algorithm trick1, the Whetston benchmark sequences whetston, the unix command compress compress, and a chromakey video mixer as part of a digital video studio equipment.
The results achieved are shown in Table 5 . The first column gives the application name and the number of instructions executed for that application. The second column gives the cache sizes. The third column gives the cache miss rates before code placement and the fourth column gives the miss rates after code placement. The next five columns (columns 5-9) are results which have been obtained by simulating with bus lengths of 0.2mm for the cpu-cache bus and 3.8mm for the cachememory bus. The fifth column gives the energy expended for a system without optimization. Column six shows the the energy expended with address coding only, and the seventh column shows energy consumption in the busses if only code placement was performed and finally in column eight we show the energy consumption when both address coding and code placement methodologies are applied. Column nine shows the percentage improvement between column eight and column five. Figure 6 : The graph of Max Savings
As can be seen from the figures, the energy savings of the bus system are quite substantial. The maximum savings achieved for each of the applications with differing bus lengths are given in Figure 6 . Even though the energy consumption of the buses using current mainstream silicon technologies (i.e. 0.18µ, 0.13µ) is only around 5% to 20% of the whole SOC, it is anticipated that in sub-0.10µ technologies this portion raises to 20% to 30% [28] . On average, for the applications investigated in our experiments, the energy consumption of the bus system is reduced by 54.6% for the 0.2/3.8 mm ("Bus1" / "Bus2") bus case, 50.4% for the 0.5/3.5 mm bus case and 46.6% for the 0.8/3.2 mm bus case.
Another factor is the relative lengths of the busses: at some point the relative length of the busses are going to be of importance. If the two busses were of equal length, then the energy consumption might have even increased. At that point the designer will have to decide upon the relative merits of energy saving, crosstalk reduction and performance. Typically, "Bus1" will be much shorter than "Bus2" since this is the preferable outcome of place&route according to the sizes of the involved cores (instruction cache, main memory banks, CPU). Our methodology favors this tendency as we can observe that results are best for a smaller "Bus1" to "Bus2" ratios for reasonable cache misses. As the cache misses increase beyond 50% the energy dissipation does not always favor smaller bus ratios. Note that the best results were achieved for "feasible" instruction cache sizes i.e. cache sizes that are neither too small nor too large as to fit the entire code in the instruction cache anyway. These are design points chosen by designers as the best compromise between effort (i.e. costs) and results (e.g. performance, power). In those cases (bold highlighted) we can always achieve a large drop in cache misses. The implications are a higher performance and a lower energy consumption. Also, the performance of the entire system increases substantially because of the reduction in the cache misses (due to the code placement strategy) as can be seen from columns 3 and 4. Thus we can state that the two combined methods should be used in conjunction to produce ultra low power systems with increased performance and reduced crosstalk. For certain applications it can be seen that address coding alone will produce superior results to that of the combined scheme (see Compress with a cache size of 64 in Table 2 ). This is due to the fact that the code placement algorithms has a number of jump instructions which are further away than in the noncode-placed algorithm and this is causing it to increase the bus activity. Note that for this application there was little reduction Table 3 : Table of results in cache misses. We mentioned earlier that the signal integrity has been a concern in this work even though performance and power consumption were the main goals:
a) The code placement method leads to a decreased traffic on the CPU-to-cache bus and the cache-to-mainmemory bus. This reduced traffic is not traded against an increased traffic elsewhere. Consequently, there is a lower vulnerability through crosstalk just through the minimized traffic on the buses. b) In a second step, the data on these buses is encoded to increase signal integrity (see Section 3).
Conclusions
In this work we have expoited the interdependencies between a high-level optimization technique, namely code placement, and a lower-level optimization technique, namely bus encoding. It could be shown that these previously orthogonally handled techniques are in fact interdependent on each other due to the increasing influence of deep sub-micron effects. As a result we have achieved much higher interconnect energy savings than any of these methods can achieve when applied solely (according to related research): the average SOC interconnect energy savings are 50% with a maximum of 95.7% The performance improvements are shown by large reductions in cache misses.. We have validated the results by means of real-world SOC applications that range in size between 8k and 200k lines of code. As an added benefit, the probability of crosstalk effects is reduced by both code placement and bus encoding techniques.
