We analyze an Alpha 21264-1ike Globally-Asynchronous. Locally-Synchronous (GALS) processor organized as a Mul tiple Cl ock Domain (MCD) microarchitecture and identify the architecturalfeatures of the processor that influence the limited performance degradation measured. We show that the out-of order superscalar execution fe atures of a processor, which al low traditional instruction execution latency to be hidden, are the same fe atures that re duce the performance degradation im pact of the synchronization costs of an MCD processor. In the case of our Alpha 21264-1ike processor, up to 94% of the MCD synchronization delays are hidden and do not impact overall performance. In addition, we show that by adding out-of-order superscalar execution capabilities to a simpler microarchitec ture, such as an Intel StrongARM-like processor, as much as 62% of the performance degradation caused by synchroniza tion delays can be eliminated.
Introduction
Globally Asynchronous, Locally Synchronous (GALS) de signs are an intermediate approach between fully asynchronous and fully synchronous clocking styles. A GALS design has the advantage in that it eliminates the tinting and cost overhead of global clock distribution while maintaining a synchronous de sign style within each clock domain. One such GALS processor approach, which we call MCD (Multiple Clock Domain), pro vides the capability of independently configuring each domain to execute at frequency/voltage settings at or below the maxi mum values [20] . This allows domains that are not executing operations critical to performance to be configured at a lower frequency, and consequently, an MCD processor has the advan tage that energy can be saved [14, 19] . However, an MCD pro cessor has the disadvantage that inter-domain communication may incur a synchronization penalty resulting in performance degradation.
In this paper, we analyze the MCD nricroarchitecture and describe how the processor architecture infl uences the perfor mance degradation due to synchronization. We describe the ture that served to decouple different pipeline functions, or ii) there was relatively little inter-function communication. The baseline microarchitecture upon which our MCD processor is built is a four-way dynamic superscalar processor with spec ulative execution similar in organization to the Compaq Al pha 21264 [11, 12, 13] . The MCD microarchitecture is characterized by a number of parameters, which define the unique features of the architec ture. The values chosen represent present and near-term state of-the-art circuit capabilities. These MCD-specific parameters are summarized in Table 1 and described in detail in the follow ing sections.
Domain Clocking Style
We investigated two models for dynamic voltage and fre quency scaling: an XScale model and a Tra nsmeta model, both of which are based on published information from their respec tive companies [8, 9] . For both of these models, we assume that the frequency change can be initiated immediately when tran sitioning to a lower frequency and voltage, while the desired voltage must be reached first before increasing frequency. For the Transmeta model, we assume a total of 32 separate voltage steps, at 28.6m V intervals, with a voltage adjustment time of 20�LS per step. Frequency changes require the PLL to re-Iock. Until it does, the domain remains idle. The PLL locking circuit is assumed to require a lock time that is normally distributed with a mean time of 15�s and a range of lO-20�s . For the XS cale model, we assume that frequency changes occur as soon as the voltage changes and circuits operate through the change.
Although we investigated both models, it became clear early on in the investigation that the Transmeta model is not appro priate for an MCD microarchitecture. The reason for this is that the Transmeta model requires the PLL to re-Iock after each frequency change, stopping the domain during that time. With the tight interaction between the domains, the suspension of one domain quickly cascades to other domains. The end re sult is that whenever any domain frequency is changed in the Transmeta model, all domains are stalled for nearly all of the time required to re-Iock the PLL. As one can imagine, this has a profound impact on overall performance. For these reasons the Transmeta model was not investigated further and is not in cluded in any of the analysis that follows.
Clock Design
In this work we assume a system clock of l.OGHz, derived from an external 1 OOMHz source using on-chip PLLs where all domains frequencies are l.OGHz but all are independent. These PLLs produce jitter. Moreover, since the external clock is de rived from a lower frequency crystal using an external PLL, the external clock will also have jitter, and these jitters are additive, i.e., JitterTota l = JitterC r ysta l +Jitterfirzrna l +Jitter-J,"tl r n a l . The JiUcrCrystal can be expected to be extremely small and due entirely to thennal changes; we assume a value of zero.
The Jittcrfi'ftzr n a l is governed by the quality and design of the PLL chip used. A survey of available ICs reveals that most devices are specified as lOOps jitter; we use this number in our study. The Jitter� 1 ; r n a l is also governed by the qual ity and design of the PLL. We assume a circuit of the same caliber as the external PLL, which implies a worst case jitter of lOps (comparable error on a lOx shorter cycle). Since clock jitter varies over time with a long-term average of zero, the 1 JitterTot a l I::; 110ps (::; 11 % of a l.OGHz clock period).
Starting with the clock design of the Alpha 21264 [1] , we derive an MCD clock design by dividing the clock grid into re gions corresponding to the MCD domain partitions. Each do main requires a separate PLL and clock grid driver circuit, all of which are fed by the 1.0GHz clock source. Although each of these domain PLLs derives its timing from a common source, we do not assume any phase relationship between the PLL out puts. We do this because we assume that the skew requirements for the l.OGHz clock have been relaxed and we cannot guaran tee that the clocks are in phase when they arrive at the domain PLLs.
Domain Interface Circuits
Fundamental to the operation of an MCD processor is the inter-domain communication. There are two types of commu nication that must be modeled: FIFO queue structures and is sue queue structures. FIFO queues have the advantage that the inter-domain synchronization penalty can be hidden whenever the FIFO is neither full nor empty. The mechanism by which this is achieved and the precise definitions of Full and Empty are described in Section 4.1 and are similar to mixed clock FIFOs proposed by others [3, 5, 6, 7, 21] . The disadvantage of FIFO queue structures is that they can only be used where strict First-In-First-Out queue organization is applicable. Fur thermore, if the communication is such that the FIFO is almost always empty (or full), the bull\: of the synchronization penalty will not be hidden. The issue queue structure is very similar to the FIFO queue except that the inter-domain synchronization penalty must be assessed as each entry is put into the queue rather than only if the FIFO is empty or full. This is necessary since with an issue queue structure, the entries are consumed in an undefined order. Therefore, it is important to accurately model precisely when an entry becomes "visible" to the con sumer.
FIFO Queue Structures
What follows is an analysis of the timing and characteris tics of queues used to buffer control and data signals between different clock domains. Fundamental to this analysis are the following assumptions:
Fi gure 2. FIFO queue interfaces.
• Each interface is synchronous; i.e., information is always read and written on the rising edge of the respective clock.
• Status signals which emanate from the interface are gen erated synchronously to the respective interface clock.
• Restrictions regarding the ratio of the maximum to min imum frequency are necessary but no other assumptions are made regarding the relative frequency or phase of the interface clocks. The general queue structure that we use for inter-domain communication is shown in Figure 2 and is the same as in [5] , with the following minor extensions. First, the queue design is such that the Full and Empty fl ags always reflect what the state of the queue will be after subsequent read/write cycles.
In other words, the Full and Empty fl ags need to be generated prior to the queue actually being full or empty, using condi tions that take into account the differences in the clock speed and potential skew of the two domains. The worst-case situ ation occurs when the producer is operating at the maximum frequency and the consumer at the minimum. There are several possible approaches to handling this problem. In this paper, we assume additional queue entries to absorb writes from the producer in order to recognize the potential delay of actually determining that the queue is full. In other words, the Full signal is generated using the condition that the queue length is within ::_��:� + 1 of the maximum queue size. Note that our results do not account for the potential performance advantage of these additional entries. Second, the synchronization time of the clock arbitration circuit, Ts, represents the minimum time required between the source and destination clocks in order for the signal to be successfully latched and seen at the destina tion. It is this Ts which defines the conditions under which data which has been written into a queue is prevented from being read by the destination domain. Although the logical structure of our queues is similar to [5] , we assume the arbitration and synchronization overhead described in [22] to defi ne Ts, i.e., we assume a Ts of 30% of the period of the highest frequency.
Even with completely independent clocks for each interface, the queue structure is able to operate at full speed for both reading and writing under certain conditions. A logic-level schematic of the FIFO queue structure is shown in Figure 3 for a 4-entry queue. Note that writing to and reading from the structure are independent operations with synchronization oc curring in the generation of the Full and Empty status signals which are based on the Valid bits associated with each queue entry. Although the combinatorial logic associated with the generation of Full and Empty has been simplifi ed in this fi g ure for demonstration purposes (the logic shown does not take into account early status indication required for worst-case fre quency differences), the synchronization circuits shown are as required in all cases. That is, the Valid bits for the queue en tries must be synchronized to both the read and write clock do- 
F;.n iE;;, ;; ; ; G�� ; � " � �t;;�;L J n� l erated too closely to the falling edge of the read clock (cycle Ro) for it to be synchronized during that cycle (i.e., Tl < Ts); therefore, Empty is generated one cycle later. The next criti cal timing occurs as the fourth entry is written and the queue becomes full. In this case, Valid [3] is generated on the ris ing edge of the write clock (cycle W4) and synchronization is complete before the corresponding falling edge of the write clock (i.e., T2 > Ts). Note that in this case, the synchroniza tion will always be complete before the falling edge because both the originating event (Valid [3] VALID [2] VALID [3] ______ ��� i ____________ � :
(i.e., T4 > Ts). This sample timing diagram illustrates how the is the destination domain. Consider the case when the queue synchronization parameter, Ts, manifests itself as read and/or is initially empty. Data is written into the queue on the riswrite cycle penalties on respective clock domain interfaces.
ing edge of Fl (edge 1). Data can be read out of the queue as
Issue Queue Structures
Many of the queues that we use as synchronization points have a different interface than that described above. With an issue queue, each entry has Valid and Ready fl ags that the scheduler uses to detennine if an entry should be read (issued). By design, the scheduler will never issue more than the number of valid and ready entries in the queue. Note, however, that due to synchronization, there may be a delay before the scheduler sees newly written queue data.
The issue queue structure design follows directly from the FIFO design but must be modified in an important way. The ability to hide the synchronization penalty when the FIFO is not full or not empty does not exist for issue queue type structures. The reason for this is that the information put into the queue is needed on the output of the queue as soon as possible (because the order in which entries are put into the queue does not dictate the order in which they may be removed).
Synchronization With Queues
The delay associated with crossing a clock domain interface is a function of the following:
• The synchronization time of the clock arbitration circuit, Ts, which represents the minimum time required between the source and destination clocks in order for the signal to be successfully latched at the destination. We assume the arbitration and synchronization circuits developed by Sjogren and Myers [22] that detect whether the source and destination clock edges are sufficiently far apart (at mini mum, Ts) such that a source-generated signal can be suc cessfully clocked at the destination.
• The ratio of the frequencies of the interface clocks.
• The relative phases of the interface clocks. This delay can best be understood by examining a timing diagram of the two clocks ( Figure 5 ). Without loss of general ity, the following discussion assumes Fl is the source and F2 early as the second rising edge of F2 (edge 3), if and only if T > Ts, i.e., Empty has become false on the F2 interface be fore the next rising edge of F2 (edge 2). This two-cycle delay is necessary for the reading interface to recognize that the queue is non-empty and enter a state where data can be read from the queue. Therefore, the delay would be seen by F2, and would be two clock cycles when T > Ts (one rising edge to recognize Empty and another rising edge to begin reading) and three cy cles when T:S; Ts (the additional rising edge occurring during the interface metastable region). The value of T is determined by the relative frequency and phases of Fl and F2 and may change over time. The cost of synchronization is controlled by the relationship between T and Ts.
An optimal design would minimize Ts in order to allow wide frequency and/or phase variations between Fl and F2 and increase the probability of a two-cycle delay. Alternatively, controlling the relative frequencies and phases of Fl and F2 would allow a two-cycle delay to be guaranteed. Note that this analysis assumes Ts < A and Ts < A. The analogous sit uation exists when the queue is Full, replacing Empty with Full, edge 1 with edge 2, and edge 3 with edge 4, in the above discussion. Figure 6 shows a transistor-level schematic of a synchro nization circuit adapted from [17] . The most salient character istic of this circuit (and others like it [18] ) is that it is guaranteed to be glitch-free. This is the case because the internal signals Ro and Rl are integral voltages --note that the voltage source for this stage is not Vdd but rather the output nodes of the previous stage. Because these signals are integrals they are guaranteed to be monotonically changing signals. Therefore, Ro and Rl com prise the synchronized input in dual-rail logic. These signals are then used to drive the last stage of the synchronizer, which is a simple RS latch, to produce a single-rail logic output. The cir cuit synchronizes the Dataln signal on the falling edge of the Clock, producing a guaranteed glitch-free and synchronized signal on DataOut (recall that the Valid[n] signals in Figure 3 are synchronized on the falling edges of the clocks). Figure 7 shows the results of S PICE simulation of the synchronization circuit of Figure 6 with a number of falling clock edges coin ciding with transitions of the input data signal. Figure 8 shows one such transition at 4.8nsec in detail where it can be seen that the output signal is glitch-free and monotonic. The circuit was simulated using level-8 0.25J-Lm TSMC SCN025 transistor models. This synchronizer circuit provides the necessary com ponent to the synchronizing FIFO of Figure 3 . With the data stored in the FIFO regardless of the synchronizer outcome, and the Valid signals in the FIFO properly synchronized to both domain clocks, the effect is that data is possibly delayed by one clock in the receiving domain (since the receiver cannot 'see' the data until the Valid signal has been synchronized).
Since the queue structure proposed does not represent a sig nificant departure from queues already in use in modem micro processors, we do not expect that the size of these queues would be appreciably impacted by the changes required for synchro nization. For the circuit shown in Figure 6 , 36 transistors are required per bit of synchronization. Although this is not incon sequential it is also not a significant increase in the overall size requirements of the queue. Notice that the number of synchro nization circuits required is a function of the depth of the queue and not the width (Figure 3 , one Valid bit per data word). Oth ers [5, 6] have demonstrated that the addition of clock synchro nization/arbitration circuits to the FIFO structure results in a negligible impact on the total structure area.
MCD Synchronization Points
The manner in which the processor is partitioned as part of transforming it into an MCD processor is of critical importance since the performance degradation that will be imposed as a re sult of the inter-domain communication penalties directly fol lows from the processor partitions.
The following list identifies all modeled inter-domain com munication channels. These communication channels were de tennined by analyzing the micro architecture of the MCD par- Fi gure 8. Synchronization timing SPICE results (4-7nsec). titioning and are a direct consequence of that partitioning. For each cOlmnunication channel, we discuss the data being trans ferred between domains and classify the synchronization mech anism required to ensure reliable cOlmnunication. In addition, each channel is identified as using an existing queue structure or as requiring the addition of the queue entirely due to inter domain synchronization. Each communication channel is iden tified by number in Figure 1 .
Communication Channell
Information: L1 Cache Line Transfer: L1 Instruction Cache {= L2 Unified Cache Domains: Fetch/Dispatch {= Load/Store Reason: L1 Instruction Cache Miss Synchronization Type: FIFO Queue Structure Discussion: New queue structure. When a level-1 instruc tion cache miss occurs, a cache line must be brought in from the level-2 cache. The fetch/dispatch domain only initiates an L2 cache request when there is an L1 I-cache miss, and since the front end must stall until that request is satisfied, we know that there can only be one such outstanding request at a time. Therefore, any inter-domain penalty that is to be assessed can not be hidden behind a partially full FIFO. Hence, the synchronization penalty must be detennined based solely on the relative frequency and phase of the domain clocks for each communi cation transaction. Although cOlmnunication channels such as this one, which can only contain one element at a time, would be implemented as a single synchronization register, we prefer to maintain the generalized FIFO model since the perfonnance characteristics of both are identicaL
bidirectional, is composed of two in dependent unidirectional channels. Since they are similar and related, they will be addressed together. Notice also that the reason for this communication can be either an L2 cache miss due to a data reference, or an L2 cache miss due to an instruc tion fetch. The characteristics of this interface are the same as the previous description with the exception that multiple out standing requests to the L2 unified cache are supported. That fact alone makes it possible that some of the inter-domain syn chronization penalties can be hidden. As it turns out, there are few times when there are enough outstanding requests such that the penalty would actually be hidden. For our MCD processor, there would have to be more than 5 outstanding requests (See Section 4. 1) to ensure that the penalty did not impact mem ory access time. Fortunately, the inter-domain penalty is only one cycle. Given that the memory access time for the fi rst L2 cache line is 80 cycles (subsequent, contiguous lines have an access time of 2 cycles), the inter-domain synchronization penalty alone is not likely to have an appreciable impact (the penalty represents an increase in memory access time of 1.06-1.25%, based on 8-word cache line fill transactions).
Communication Channel 3
Infonnation: Branch Prediction Outcome Transfer: Branch Predictor -¢= Integer ALU Domains: Fetch/Dispatch -¢= Integer Reason: Branch Instruction Synchronization Type: FIFO Queue Structure Discussion: New queue structure. When a branch instruction is committed, the machine state must be updated with the branch outcome (i.e., Taken/Taken). In addition, the fetch unit needs to know if the prediction was correct or not since it must change fetch paths if the prediction was not correct. This communi cation path is, by its very nature, sequential. That is, branch outcomes need to be processed by the fetch/dispatch domain in a strict first-in-first-out manner. This is also a communication path that is not likely to benefit from the penalty hiding property of the FIFO queue structure since doing so would require 5 or more branch outcomes to be buffered waiting to be processed by the fetch/dispatch domain.
There is another aspect of a branch mis-prediction that must be properly handled. That is, when a branch is mis-predicted, all speculative instructions must be squashed from the machine pipeline. This can happen as soon as the branch outcome is known and does not need to occur after the branch predictor is updated. The communication in this case occurs from the in teger domain (where the branch outcome is detennined) to all other domains. To handle this case, we chose to squash specu lative instructions on the rising edge of each domain clock aft er the synchronization has occurred in the fetch/dispatch domain. Although this represents a conservative approach, it simplifies the design considerably since it eliminates the synchroniza tion FIFO between the integer domain and the fl oating-point and load/store domains. The effect of implementing specula tive instruction squashing in this manner is that resources used by those soon-to-be-squashed instructions are used longer than they would be otherwise. This may have a negative impact on perfonnance if instructions cannot be issued because processor resources are in use by one of these instructions. Existing queue structure. The load value may orig inate in the Load/Store Queue (LSQ), the L1 Data Cache, the L2 Cache or Main Memory; any synchronization that is re quired to get the value to the LSQ is assumed to have already been assessed. This communication channel handles only the synchronization between the LSQ and the integer register file. Since loads can complete out-of-order (a load that is found in the LSQ could have issued after a load that is waiting on a cache miss), the synchronization mechanism in this case must be of the issue queue structure type since the load result is useful to the integer domain as soon as possible after becoming avail able. Discussion: Existing queue structure. Each load and store in struction is broken into two operations: i) the load/store opera tion, which accesses the L1 data cache, and ii) the addition op eration, which computes the effective address for the memory reference. The load/store operation is issued to the load/store domain, and the effective address calculation is issued to the integer domain. The load/store operation is dependent on the address calculation since the memory reference cannot be per formed until the address is known. Since both integer opera tions in the integer issue queue (IIQ) and load/store operations in the load/store queue (LSQ) are free to proceed out-of-order, the effective address calculation can potentially be used by the LSQ as soon as the calculation is complete, regardless of the program order in which the memory references were made. These characteristics require that the effective address result that is transferred from the integer domain to the load/store do main be stored in an issue queue structure. New queue structure. Converting a floating-point value into an integer value is performed by the floating-point hardware, but the result is written to the integer register file. In order to ensure that the converted value is properly received in the integer domain, a simple first-in-first-out structure is nec essary since there is no requirement that the operations occur out of order. That being the case, it is also unlikely that the synchronization penalty can be hidden by the occupancy of the FIFO unless a stream of conversion operations is performed. Although this situation is rare, it may occur for vector floating point applications. In those cases, the existence of the FIFO is likely to reduce the effective synchronization penalty seen by the application. Existing queue structure. When instructions com plete execution, the machine state must be updated. Although the update of machine state occurs in program order, the com pletion of instructions is out-of-order. The Reorder Buffer (ROB) is the structure that cOlmnits instructions in program order. The ROB does this by examining all instructions that have completed execution. When an instruction has completed execution and it is the next instruction in program order, it is committed. If only one instruction could be committed in a given cycle, then instruction completion information could be transferred from the various domains in-order (i.e., in order of completion, not in program order). Committing only one in struction per cycle would significantly degrade the performance of a superscalar processor that can issue and execute more than one instruction in paralleL Therefore, the ROB must be capable of committing more than one instruction per cycle. The ability to commit more than one instruction per cycle requires that the synchronization mechanism used for the completion informa tion be of the issue queue structure type since the ROB can use the completion infonnation in any order. Note that the syn chronization delay may result in increased pressure (i.e., higher average occupancy) on the ROB structure. New queue structure. If the processor is config ured without a level-2 cache, then Ll I-cache misses must be filled from main memory. This interface is a simple FIFO since the front end must wait for Ll I-cache misses to complete, so there is no benefit to allowing out-of-order transactions. Note that this channel only exists in processor configurations without an L2-cache.
Simulation Methodology
Our simulation environment is based on the SimpleScalar toolset [4] with the Wattch [2] power estimation extension and the MCD processor extensions [20] . The MCD extensions in clude modifications to more closely model the microarchitec ture of the Alpha 21264 microprocessor [12] . A sUlmnary of our simulation parameters for the Alpha 2l264-like processor appears in Table 2 . For comparison to an in-order processor, we also simulate a StrongARM SA-lllO-like processor. A sum mary of the simulation parameters for that processor is given in Table 3 .
We selected a broad mix of compute bound, memory bound, and multimedia applications from the MediaBench, Olden, and Spec2000 benchmark suites (shown in Figure 9) .
Our simulator tracks the relationships among the domain clocks on a cycle-by-cycle basis. Initially, all clock starting times are randomized. To determine the time of the next clock pulse in a domain, the domain cycle time is added to the start ing time, and the jitter for that cycle (which may be positive or negative) is added to this sum. By perfonning this calcula tion for all domains on a cycle-by-cycle basis, the relationships among all clock edges are tracked. In this way, we can accu rately account for synchronization costs due to violations of the clock edge relationship. Synchronization performance degradation is measured by comparing the overall program execution time of the MCD pro cessor with an identically configured, fully synchronous (i.e., single, global clock domain) processor. The fully synchronous processor is clocked at l.OGHz. Each domain of the MCD pro cessor is clocked by an independent 1.0GHz clock (i.e., inde pendent jitter for each domain clock, no phase relationship be tween any domain clocks).
Results
The average performance degradation of the Alpha 21264-like MCD processor over all 30 benchmarks is less than 2%. The individual results show ( Figure 9 ) that there is variation in the performance degradation caused by the MCD microar chitecture ranging from a maximum of 3.7% for swim to a performance improvement of 0.7% for power. Although this performance improvement is quite modest, it is surprising since we expect the MCD microarchitecture to impose a perfonnance penalty. To understand what is happening, we have to look 8% 7% c 0 6%
. .;::: ::; closely at what happens to instructions within an MCD proces- There are side effects to adding cycles to the execution of indi vidual instructions, i.e., the instruction commit rate and cache access patterns are changed slightly. Although these changes typically do not have a significant impact on the application ex ecution, there are cases where the impact is appreciable. This is precisely the case for power. By examining detailed per interval processor statistics, we could determine that the change in instruction tinting causes an increase in the branch mis predict penalty of 2.2% but a decrease in the average memory access time of approximately 1.2%. Since greater than one out of every four instructions in the simulated instruction window is a memory reference, and less than one in eight is a branch, the decreased memory access time translates into an overall performance improvement.
Intuitively, it makes sense that there would be higher perfor mance degradation and greater variation for the in-order Stron gARM MCD processor than for the out-of-order Alpha 21264. In fact, the maximum performance degradation for the SA11lO-like MCD processor is 7.0% for power and a 0.1 % per formance improvement for gsm. The larger range of perfor mance degradation occurs because the internal operations of the StrongARM are inherently more serial than those of the Al pha 2l264-like processor. At first glance it would seem that the performance degradation of the StrongARM-like processor should be significantly higher than that of the Alpha 21264-like processor given the vastly different architectures. The per cent change in perfonnance is somewhat misleading in this re gard. The baseline StrongARM processor has a Cycles Per In struction (CPI) of approximately 3.5, whereas the Alpha 21264 baseline CPI is approximately 1.0. Therefore, although the per formance degradation percent values are not significantly dif ferent, the actual change in CPI is larger for the in-order proces sor. When comparing the changes in CPI directly, the increase for the StrongARM processor is nearly 5x that of the Alpha processor configuration. As such, a significant portion of the MCD performance degradation can be eliminated by the addi tion of out-of-order superscalar execution features as demon strated in Table 4 , using the in-order StrongARM-like proces sor as a starting point; where Pe rformance Degradation Elimi nated is the percentage of the baseline performance degradation that was eliminated as a result of adding out-of-order super scalar execution features to the in-order processor architecture.
Inter-domain synchronization penalties naturally occur over the most highly utilized communication channels. However, the synchronization penalties do not necessarily result in per formance degradation. This is true because the most highly utilized channels (shown in Figure 10 for the Alpha 21264-like and StrongARM -like families of MCD processors) are the most dominant channels, i.e., the integer instruction commit (# 12) and load/store instruction cOlmnit (# 14), and the inter-domain synchronization penalties in this case are largely hidden by the Re-order Buffer (ROB). This means that an MCD processor is likely to put greater pressure on the ROB resources and could potentially benefi t from a larger ROB. Close examination of Figure 1 would show that channel pairs #9 & #12 and #11 & # 14 should incur the same fraction of synchronization penalties since these pairs represent instruction processing through the integer and load/store domains, respectively. The reason for the somewhat counter-intuitive results of Figure 10 is that channels #9 and #11 (instruction dispatch) are significantly wider than channels #12 and #14 (instruction completion). Whereas in struction dispatch width is governed by the type of instructions present in the dynamic instruction stream, instruction comple tion is governed by the actual instruction level parallelism avail able, which is likely to be considerably less. This means that for the same number of instructions passing through channel pairs #9 & #12 and #11 & #14, the instruction completion paths will contain infonnation more frequently, but with each transaction being smaller than for the instruction dispatch paths. This re sults in more possible synchronization penalties being accessed over the instruction completion paths. Table 5 shows this characteristic for two Alpha 21264-like processor and two Strong ARM SA-llIO-like proces sor configurations. In these figures, Synchronization Ti me is the percentage of the total execution time that events within the processor were delayed due to inter-domain synchroniza tion overhead. Hidden Synchronization Cost is the percent age of synchronization time which did not result in actual per fonnance degradation. For the aggressive out-of-order super scalar 2l264-like processor, the figure shows that as the out of-order and superscalar features of the processor are elimi nated, more of the inter-domain synchronization penalty results in perfonnance degradation. The opposite effect is seen when an in-order processor such as the StrongARM SA-ll1O is aug mented with out-of-order superscalar architectural features. Ta ble 6 shows that although various out-of-order superscalar fea tures were removed from the 2l264-like processor, and vari ous out-of-order superscalar features were added to the SA l11O-like processor, the effects of these changes on CPI were consistent. This underscores the idea that the in-order versus out-of-order instruction issue and serial versus superscalar exe cution are the most dominant factors in detennining the effects of inter-domain synchronization overhead.
Related Work
There has been significant research into the characteristics, reliability, and performance of low-level synchronization cir cuits. In [23] , two asynchronous data communication mecha nisms are presented, which utilize self-timed circuits to achieve reliable and low-latency data transfers between independently clocked circuits. In [16] , a pausible clock is used to ensure re liable communication and an arbiter circuit is used to reduce the frequency of occurrence of delaying the clock. Other ap proaches to synchronization have been proposed, such as [22] , which detect when the possibility of a metastability problem may occur (i.e., data transitioning too close to a clock edge) and delay either the data or the clock until there is no possibil ity of metastability.
In addition to the evaluation of low-level synchronization circuits, there has been research into high-level evaluation of GALS systems. In [15] , the potential power savings of a GALS system is quantified. In addition, an algorithmic technique for partitioning a synchronous system into a GALS system is presented with the goal of achieving low power. This work shows promising power savings results. Unfortunately, corre sponding performance evaluation was not presented. In [10] , a GALS processor with 5 independent domains is evaluated with respect to power consumption and performance degradation. This work is most closely related to our analysis. A conclusion of this work is that with their GALS processor a performance degradation of 10% on average was achieved. The authors also conclude that studies of latency hiding techniques are needed to more fully understand the architectural reasons for perfor mance degradation. We concur with this conclusion and have performed this study on our GALS processor. We attribute the disparity in perfonnance degradation between these two studies to the domain partitioning differences (i.e., five versus four do mains). We have detennined that a clock domain partitioning that separates the fetch circuits (instruction cache and branch predictor) from the dispatch circuits can significantly degraded overall perfonnance.
Conclusions
We have shown that the out-of-order superscalar execution features of modern, aggressive processors are the same features that minimize the impact of the MCD synchronization over head on processor performance. With an aggressive, out-of order superscalar processor such as the Alpha 21264, as much as 94% of the inter-domain synchronization penalties are ef fectively hidden and do not result in performance degradation. When out-of-order superscalar execution features are added to a simple in-order serial processor like the StrongARM SA-1110, as much as 62% of the MCD performance degradation can be eliminated. Our prior work studied the potential energy saving advantage of our GALS MCD processor. In that work we demonstrated approximately 20% reduction in energy con sumption. These results combine to demonstrate the overall benefit of our GALS MCD processor design.
The resulting modest performance impact due to synchro nization can potentially be overcome by faster per-domain clocks permitted by the removal of a global clock skew require ment, and the ability to independently tune each domain clock. These are areas proposed for future work.
