Abstract-With ever expanding design space and workload space in multicore era, it is a challenge to identify optimal design points quickly, desirable during the early stage of multicore processor design or programming phase. To meet this challenge, this paper proposes a theoretical framework that can capture the general performance properties for a class of multicore processors of interest over a large design space and workload space, free of scalability issues. The idea is to model multicore processors at the thread-level, overlooking instruction-level and microarchitectural details. In particular, queuing network models that model multicore processors at the thread level are developed and solved based on an iterative procedure over a large design space and workload space. This framework scales to virtually unlimited numbers of cores and threads. The testing of the procedure demonstrates that the throughput performance for many-core processors with 1000 cores can be evaluated within a few seconds on an Intel Pentium 4 computer and the results are within 5% of the simulation data obtained based on a thread-level simulator.
INTRODUCTION
As chip multiprocessors (CMPs) become the mainstream processor technology, challenges arise as to how to design and program CMPs to achieve desired performance for applications of diverse nature. There are two scalability barriers that the existing CMP analysis approaches (e.g., simulation and benchmark testing) find difficult to overcome. The first barrier is the difficulty for the existing approaches to effectively analyze CMP performance as the numbers of cores and threads of execution become large. The second barrier is the difficulty for the existing approaches to perform comprehensive comparative studies of different architectures as CMPs proliferate. In addition to these barriers, how to analyze the performance of various possible design/programming choices during the initial CMP design/programming phase is particularly challenging, when the actual instruction-level program is not available.
To overcome the above scalability barriers, approaches that work at much coarser granularities (e.g., overlooking microarchitectural details) than the existing approaches should be sought to keep up with the ever growing design space. Such an approach should be able to characterize the general performance properties for a wide variety of CMP architectures and a large workload space at coarse granularity. Moreover, such an approach should not require the availability of the instruction-level programs as input for performance analysis. The aim is to narrow down the design space of interest at coarse granularity, in which the existing approaches can work efficiently to further pin down the optimal points at finer granularities. To this end, we believe that an overarching theoretical approach, encompassing both existing and future design and workload spaces, must be sought. In this paper, we develop a theoretical framework of such kind.
Two unique features are employed in our theoretical framework to overcome the scalability barriers. First, the framework works at the thread level, overlooking instructionlevel and microarchitectural details, except those having significant impact on thread level performance. A simulation tool developed at this granularity [15] was found to be capable of predicting the system performance pretty accurately, i.e., within 6% of the cycle-accurate simulation results. Moreover, as we shall see in Section II, this granularity is particularly amenable to large design space exploration and theoretical analysis.
Second, the approach taken for the design space exploration in our theoretical framework is unconventional. Instead of exploring the design space based on sampled points in the space, the framework directly study the general performance properties of system classes over the entire design space. Here a system class characterizes a class of multicore architectures, a workload space, and a set of performance measures. Understanding the general performance properties of a system class leads to the understanding of the properties of all individual points in the design space (i.e., specific multicore architectures, specific workloads, and the associated performance data). This approach is quite similar to Function Analysis in mathematics that analyzes general properties of functions over the entire vector space, as illustrated in Figure 1 . At the core of this approach is to derive the generation function G(x) for a system class of interest defined in a large design space, through which all the performance measures can be further derived. In our framework, the design space and system classes are expressed mathematically using the language of queuing network models. This paper makes the following major contributions. First, it develops a thread-level modeling technique for multithreaded multicore processors, which is amenable for queuing analysis. Second, based on this modeling technique, it establishes a mapping between large classes of multicore processors and queuing network models defined in a large design and workload space. Third, it develops an iterative procedure that allows G(x) to be derived approximately for classes of multicore processors with virtually unlimited numbers of cores, overcoming the above mentioned scalability barriers.
The rest of the paper is organized as follows. Section II describes the proposed theoretical framework. Section III tests the performance of the proposed framework. Section IV discusses the related work. Finally, section V concludes the paper.
II. THEORETICAL FRAMEWORK
In this section, we first describe the thread-level modeling concepts and how the thread-level modeling can be mapped to queuing network models. Then we introduce a large design space that can be represented in terms of queuing network models, whose generation functions have closed-form solutions. Finally, we design an iterative procedure that allows these generation functions to be quickly calculated with high precision, in the presence of virtually unlimited numbers of cores.
A. Thread-Level Modeling Concepts:
At the core of the proposed theoretical framework is the modeling of the workload, defined as a mapping of program tasks to threads in different cores and system components, known as code paths. As shown in Figure 2 , a code path handled by a given thread in a given core is a sequence of segments (measured in the unit of CPU cycles) representing the durations the thread is serviced by the CPU and other resources (not including queuing delays or other idle times) throughout the execution of the entire program or program task.
The code path is defined at the thread level, in the sense that it only captures the events that have major impact on the thread-level performance. In other words, the instruction-level and microarchitectural details are overlooked, unless they trigger events that may have a significant effect at the thread level, such as an instruction for memory access that causes the thread to stall or instructions corresponding to a critical region that causes serialization effect at the thread level. A code path defined at this level can be easily derived from a pseudo code, rather than an instruction-level program, which may not be available during the processor design or initial programming phase.
Correspondingly, all the components including CPU, cache/memory, and interconnection network are modeled at a highly abstract level, overlooking microarchitectural details, just enough to capture the thread-level activities. For example, for a CPU running a coarse-grained thread scheduling discipline and a FIFO memory, they are modeled simply as queuing servers running a coarse-grained thread scheduling algorithm and FIFO discipline, respectively. As an example, we consider a single coarse-grained core with a FIFO memory. The core runs two active threads loaded with the same code path (on the left in Figure 3 ). The execution sequence is shown on the right in Figure 3 . The black segments are thread idle times. Now consider a closed queuing network composed of two FIFO queuing servers, modeling the coarse-grained CPU and the FIFO memory, as shown in Figure 4 (a). Assume that there are two jobs circulating in this network, modeling the two active threads. As one can see, without considering the queuing times or thread waiting times, a thread making a round-trip, CPU-toMemory-to-CPU, plus CPU-to-Memory, generates three segments corresponding to the code path in Figure 3 . If the service times at the CPU and the memory exactly match the corresponding segment lengths of the code path, this queuing network model exactly emulates the execution sequence for that thread. Now, with two threads, it is not difficult to convince ourselves that due to the queuing effect, the two threads making such a trip will generate the same pattern of the execution sequence as the one on the right in Figure 3 . Again, if the service times exactly match the segment lengths, the thread circulation exactly recovers the execution sequence in Figure 3 .
So far we have been trying to emulate the actual execution process for the threads, which is no different from simulating the actual process at the thread level. Now we need to realize that the queuing models are in essence stochastic models, which are meant to capture long-run stochastic/statistic effects of a real system (open queuing network models may need to be used if the workload may be on and off, which however, can always be transformed into closed queuing network models [16] ). In other words, the service time for a queuing server is in general a random number, following a given distribution, denoted as ȝ i , for queuing server i. As a result, it is the distribution of the segment lengths, not the individual segment lengths, that needs to be used to characterize the service time. Moreover, for a code path that characterizes a workload for a processor with multiple parallel resources, such as the one in Figure 2 , the corresponding closed queuing network, as depicted in Figure 4 (b), also involves a routing probability p 0i for a thread to go to the i-th resource upon exiting the CPU server. This parameter should also be evaluated statistically by counting the frequency of such occurrences in the long-run code paths handled by these threads.
From the above examples, we conclude that at the thread level, any types of CMPs with N components and any long-run workloads can be generally modeled as a closed queuing network with N queuing servers of various service types in terms of queue scheduling disciplines and a workload space ({ȝ i }, {p ij }) spanned by various possible combinations of service time distributions and routing probabilities. The central task of this proposed framework is to develop mathematical techniques to analytically solve this closed queuing network model. The solution should be able to account for as many service types and as large a workload space as possible, aiming at covering a large design space.
Finally, with regard to the workload, there is a fundamental difference between analytical modeling and simulation/benchmark testing. For the latter, one does not know the actual code path until the testing is over (since there might be conditional branching and dynamic program generation at runtime), whereas for the former, one can assume that the actual code path is known in advance (since the aim of analytic modeling is to try to explain what have happened, i. 
B. Design Space:
We want the design space to be as large as possible to encompass as many multicore architectures and workloads as possible. Figure 5 depicts such a design space. It is a five dimensional space, including resource-access dimension, thread-scheduling-discipline dimension, program dimension, number-of-thread-per-core dimension, and numberof-core dimension.
The Thread-Scheduling-Discipline dimension determines what CPU or core type is in use. The existing commercial processors use fine-grained, coarse-grained, simultaneous multithreading (SMT), and hybrid coarse-and-fine-grained thread scheduling disciplines. Some systems may also allow a thread to be migrated from one core to another.
The Resource-Access dimension determines the thread access mechanisms to CMP resources other than CPU. It may include memory, cache, interconnection network, and even a critical region. The typical resource access mechanisms include first-come-first-serve (FCFS), process sharing (parallel access), parallel resources (e.g., memory bank), and pipelined access. For cache access, a cache hit model may have to be incorporated, which may be load dependent.
The Program dimension includes all possible programs. This dimension is mapped to a workload space, involving all possible code paths, for a given type of processor organization (e.g., a code path for a single-core processor with three parallel resources as in Figure 2 ).
We expect that both Number-of-Cores and Number-ofThreads-per-Core dimensions will reach thousands in the near future. Our theoretical framework needs to be able to deal with CMPs of such scale. Moreover, the theoretical framework needs to be able to account for dynamic multithreading, where the number of threads used for a program/program task may change over time.
The queuing network modeling techniques at our disposal restrict the size of the design space to one that must be mathematically tractable. This makes the coverage of the design space in Figure 5 a challenge. As shown in Figure 5 (i.e., the small cone on the left), the part of the design space that has been (incompletely) explored by the existing work using queuing network modeling techniques is only a tiny part of the entire space (see Section IV for more details).
Number-of-Threads-per-Core
Memory Bank
Number-of-Cores
Program (workload)
Thread-SchedulingDiscipline Resource Access
Coarse-grained In what follows, we discuss how our framework allows almost the entire design space in Figure 5 to be explored, except dynamic multithreading (see Section V for more details). We look at different dimensions of the design space separately.
Resource-Access
and Thread-Scheduling-Discipline Dimensions: Without resorting to any approximation techniques, the existing queuing network modeling techniques will allow both of these dimensions to be largely explored analytically. Any instance in either of these two dimensions can be approximately modeled using a queuing server model that has local balance equations (i.e., it leads to queuing network solutions of product form or closed form). More specifically, Table I Note that the memory banks should be modeled as separate queuing servers and hence, are not listed in this table. Also note that for all the multithread scheduling disciplines except the Hybrid-Fine-and-Coarse-Grained one (to be explained below) in Table I , the service time distribution of a queuing model models the time distribution for a thread to be serviced at the corresponding queuing server. With these in mind, the following explains the rationales behind the mappings in Table  I: • SMT: It allows multiple issues in one clock cycle from independent threads, creating multiple virtual CPUs. If the number of threads in use is no greater than the number of issues in one clock cycle, the CPU can be approximately modeled as an M/G/ queue, mimicking multiple CPUs handling all the threads in parallel, otherwise, it can be modeled as an M/M/m queue, i.e., not enough virtual CPUs to handle all the threads and some may have to be queued.
• Fine-grained thread scheduling discipline: All the threads access the CPU resource will share the CPU resource at the finest granularity, i.e., one instruction per thread in a round-robin fashion. This discipline can be modeled as an M/G/1 PS queue, i.e., all the threads share equal amount of the total CPU resource in parallel.
• Coarse-Grained thread scheduling discipline: All the threads access the CPU resource will be serviced in a round-robin fashion and the context is switched only when the thread is stalled, waiting for the return of other resource accesses. This can be approximately modeled as a FCFS queue, e.g., an M/M/1 queue.
• Hybrid-Fine-and-Coarse-Grained thread scheduling discipline: It allows up to a given number of threads, say m, to be processed in a fine-grained fashion and the rest be queued in a FCFS queue. This can be modeled as an M/M/m FCFS queue. In this queuing model, the average service time for each thread being serviced is m times longer than the service time if only one thread were being serviced, mimicking fine-grained processor sharing effect.
• Resources dedicated to individual threads: Such resources can be collectively modeled as a single M/G/ queue, i.e., there is no contention among different threads accessing these resources.
• We note that the resource access dimension also includes load-dependent cache hit rate. The cache hit probability (i.e., the routing probability to move back to the CPU) is generally load-dependent in the sense that it may be either positively or negatively correlated with the number of threads in use due to temporal locality or cache resource contention, respectively. These effects can be accounted for in our framework without approximation, by means of the existing load-dependent routing techniques (e.g. [26] ).
We also note that the thread-scheduling-discipline dimension includes thread migration. The thread migration allows a thread to be migrated from one core to another for, e.g., load balancing purpose. This effect can be accounted for without approximation by allowing jobs to have non-zero probabilities to switch from one class to another [14] [16]. Program Dimension: In principle, this dimension can be fully explored through a thorough study of the workload space, characterized by the service time distributions and routing probabilities, i.e., a collection of ({ȝ i }, {p ij })'s. However, for the solvable queuing server models in Table I , such as M/M/m and M/M/1 queues, the service time distribution ȝ i is a given, i.e., exponential distribution. Since the exponential distribution is characterized by only a single parameter, i.e., the mean service time t i , it can only capture the first order statistics of the code path segments corresponding to that server, hence providing a first order approximation of the program dimension or workload space. Although as part of our future work, we will consider more sophisticated queuing models in an attempt to overcome this limitation, it is widely recognized that the queuing performance for closed queuing networks is insensitive to the service distributions of the queuing servers, generally known as the property of robustness of the closed queuing networks [14] . Hence, we should expect that our first order approximation provides a good coverage of the workload space.
Number-of-Cores
and Number-of-Threads-per-Core Dimensions: First, we note that the number of threads dimension should allow dynamic multithreading, meaning that at different program execution stages, the number of active threads may vary. In Section V, we propose a possible solution, which will be further studied in the future. Second, we need to address the scalability issues in calculating the generation functions as the numbers of cores and threads increase. We consider a general closed queuing network modeling an Ncore (or core cluster) system with K shared resources. We want to be able to get closed-form generation function G for such closed queuing networks, from which any performance measures can be derived. As long as all the queuing servers in the system have local balance equations (e.g., following the queuing server models in Table I ), the generation function (also known as the normalization function in queuing theory) can be generally written as:
where f i (m ik ) is a function corresponding to the probability that there are m ik threads currently in core i (for i = 1,..., N) for thread class k (for k = 1,..., N, i.e., the threads from each core forms a class), f j (m j1 ,…,m jN ) is a function corresponding to the probability that there are m j1 threads of class one, m j2 threads of class two, and so on, in shared memory queuing server j (for j = N+1, …, N+K), and M i is the total number of threads belonging to core i (for i = 1,..., N). f i 's take different forms for different core organizations, in terms of e.g., CPU, cache, and local memory of different types from the resource-access dimension and thread-scheduling-discipline dimension of the design space.
On one hand, we note that G is defined in the entire design space (with the first order approximation of the programdimension or workload space). Understanding the general properties of G over this space will allow the properties of individual points in the design space to be understood, just like function analysis (see Figure 1) . On the other hand, we also note that the number-of-core and number-of-thread-per-core dimensions create scalability barriers that prevent us from being able to effectively calculate G. This is because the computational complexity for G is O (N S M N+K ), where M is the average number of threads per core and N S is the number of queuing servers per core. Our experiments on an Intel CoreDuo, T2400, 1.83 GHz processor showed that for N S = M = 2 and K = 1, it takes about 24 hours to compute the generation function for a 20-core system. Clearly, it is computationally too expensive to cover the entire number-of-core and numberof-thread dimensions. In the following, we develop an iterative procedure to overcome this scalability barrier.
C. An Iterative Procedure: The difficulty for calculating G(x) lies in the fact that different cores interact with one another through shared resources. A key intuition is that the effect on each core due to resource sharing would become more and more dependent on the first order statistics (i.e., mean values) and less sensitive to the higher order statistics (e.g., variances) or the actual distributions, as the number of cores sharing the resources increases (reminiscent of Law of Large Numbers and Central Limit Theorem in statistics and the
Mean Field Theory in physics, although actual formal analysis could be difficult). With this observation in mind, we were able to design an iterative procedure to decouple the interactions among cores, so that the performance of each core can be evaluated quickly as if it were a stand-alone core.
Assume there are N C cores sharing a common FIFO memory (extension to multiple shared resources is straightforward). Initially, we calculate the sojourn time ܶ ሺͲሻ and throughput ߣ ሺͲሻ for single core system i consisting of a core and the common memory (for i = 1,…, N C ). Then the initial mean sojourn time for all the cores, ܶ ‫כ‬ , is calculated based on the following iteration formulae:
Then we enter an iteration loop as shown in Figure 6 . At the nth iteration, first the average sojourn time for the common memory, ܶ ሺ݊ሻ, is calculated based on a two-server queuing network (on the left of Figure 6 ), including a queuing server for the common memory and an M/M/ queuing server characterized by the mean service time ܶ ‫כ‬ ሺ݊ሻ. There are a total
threads circulating in this network, where m i is the number of active threads in core i. In other words, we approximate the aggregate effect of all the threads from all the cores on the common memory using a single M/M/ queuing server with the mean service time ܶ ‫כ‬ ሺ݊ሻ . Then, we test if ȁܶ ሺ݊ሻെܶ ሺ݊ െ ͳሻȁ ɂ holds, for a predefined small value İ. If it does, exit the loop and finish, otherwise do the following.
The sojourn time ܶ ሺ݊ሻ and throughput ߣ ሺ݊ሻ for core i (for i = 1, …, N) are updated based on the closed queuing network on the right of Figure 6 . This time, the effects of other cores on core i Figure 7 . Iteration Algorithm is approximated by a single M/M/ server with the mean service time ܶ ሺ݊ሻ. There are ݉ threads circulating in this network. Finally, ܶ ‫כ‬ ሺ݊ሻ will be updated based on the interaction formulae, before going to the next iteration. Note that both steps involve only queuing network models that have closed-form solutions, which make the iterations extremely fast. The iteration procedure is summarized in Figure 7 .
III. TESTING
In this section, we test the accuracy of the queuing modeling techniques. The results are compared against those obtained by a simulation tool. Note that the simulation tool must be able to perform a significantly large number of simulation runs (e.g., 10 4 í 10 6 ) in a reasonable amount of time for many-core systems (e.g., with 1000 cores). The existing simulation tools are not up to the task. Fortunately, our simulation tool [15] is well suited for the purpose. In what follows, we briefly introduce the tool and the ideas for testing.
The simulation tool in [15] is a thread-level, event-driven simulation approach which simulates the thread-level activities only, in line with the thread-level modeling approach studied in this paper. In other words, the simulation tool takes the code path defined in Figure 2 as input for the simulation. An event is defined as a boundary between two segments and the event type corresponds to the actual event to be taken place on the right side of the boundary (e.g., the boundary between the first and the second segment is an event for memory access in Figure 2) . The simulation will jump from one event to the next. This means that for a program involving a dozen resource accesses and a dozen threads, only a few hundreds of events need to be processed. This makes the technique extremely fast. The technique is tested against the cycle-accurate simulators. It turns out that all the performance data obtained with this technique are within 6% of those obtained by the cycleaccurate simulators [15] . 
Tm
Step 1 ȝ* = 1/T*
Step 2 Step2: Get a sojourn time ܶ ሺ݊ሻ for each core in Figure 6 and calculate the ܶ ‫כ‬ ሺ݊ሻ. Go to Step1.
To test the performance of the proposed core decoupling procedure, we consider a CMP with 1000 cores and core clusters sharing a common FIFO memory. There are two types of cores (Core Type 1 with 6 active threads and Core Type 2 with 9 active threads) and core clusters (Core Type 3 with 8 active threads and Core Type 4 with 10 active threads), with 250 each, as given in Figure 8 . Core Type 1 may model a CPU with a cache (with hit probability p 11 , overlooking queuing effects), a local memory (with routing probability p 12 ), and routing probability p 1m to access the common memory. Core Type 2 involves an additional server, modeling, e.g., an L2 cache. Core Type 3 models a two-core cluster with dedicated L1 and L2 caches and local shared memory or L3 cache. Core Type 4 differs from Core Type 3 just in one of its cores, which runs SMT CPU instead of a coarse-grained one (i.e., an M/M/m queue and all the rest are M/M/1 queues).
The parameter settings for each type of cores are given as follows: For the simulation tool, simulation stops when all nodes in all cores execute at least 10 6 events. For the iteration procedure, İ in the iteration stop condition is set to 0.1 % of T m (n).
The throughputs for the proposed Queuing Model (QM) and Simulation Tool (ST) at three different common memory service rates are listed in Table II In each table, there are three columns with different common memory service rates representing three distinctive cases. In the first case, the memory capacity is larger than the aggregate capacity of all cores (i.e., the memory is in under loaded condition). In the second case, their capacities are almost the same. In the last case, the aggregate capacity of all cores is larger than the memory capacity (i.e., the shared memory is potential bottleneck resource). Each column has three sub columns, with the first sub column showing the results from QM and the second from ST and the last the difference between the two. Table II gives the results for the  above parameter settings. Tables III and IV give the results with some parameter changes (see the parameter changes at the top of each table). It turns out that our iterative procedure is highly accurate and fast. For the cases in Tables II and IV, it takes less than 12 iterations to get the results. For the case in Table III , the number of iterations increases up to 26. For all the cases studied, the technique is three orders of magnitude faster than ST, finishing within a few seconds. This allows a large design space (or parameter space) to be scanned numerically. Moreover for all the cases, the results are consistently within 5 % of the simulation results.
One can further reduce the time complexity by running the iterative procedure only once for a given set of parameters to get effective T m . Then study the performance for any target core in a range of workload parameters based on the closed queuing network in Figure 6 (the one on the right) with T m fixed. One can expect that changing parameters for just one core out of 1000 cores should not significantly affect T m as is evidenced from the case study in table IV compared with table III. This approximation can further reduce the time complexity by another two orders of magnitude.
As one can see, with the proposed technique, the closed queuing network models can now be effectively used to explore the design space in Figure 5 . A user of our technique may start with a coarse-grained scanning of the entire space first, which will allow the user to identify areas of further interests in the space. Then perform a finer grained scanning of the areas of interests.
IV. RELATED WORK
Traditionally, simulation and benchmark testing are the dominant approaches to evaluate the processor performance. Unfortunately, these approaches quickly become ineffective as the number of cores increases. Hence, there have been many alternative approaches in an attempt to address this scalability issue. Statistical simulation (e.g., [1] [10] [11] ) makes the short synthetic trace from a long real program trace and save time by simulating the short statistic trace. Partial simulation (e.g., [4] [5] [12] ) reduces total simulation time by selectively measuring a subset of benchmarks. The design space exploration based on intelligent predictive algorithms trained by sampled benchmarks (e.g., [2] [3] [6] [7] ) can predict the performance in the entire design space from simulations of a given benchmark from a small set of the design space. However, most existing approaches have focused on the exploration of microarchitectural design space and again quickly become ineffective as the numbers of cores and threads in the system increase. Moreover, the pace at which the multicore architectures proliferate makes it difficult for the existing approaches to keep up, especially in terms of comparative performance analysis of different architectures. Our approach makes it possible to quickly identify the areas of interests in a large design space at coarse granularity, in which the existing finer granularity tool can work efficiently to pinpoint the optimal operation points.
In terms of queuing network modeling, since Jackson's seminal work [25] However, very few analytical results are available for multicore processor analysis. In [21] , a mean value analysis of a multithreaded multicore processor is performed. The performance results reveal that there is a performance valley to be avoided as the number of threads increases, a phenomenon also found earlier in multiprocessor systems studied based on queuing network models [22] . Markovian Models are employed in [23] to model a cache memory subsystem with multithreading. However, to the best of our knowledge, the only work that attempts to model multithreaded multicore using queuing network model is given in [24] . But since only one job class (or chain) is used, the threads belonging to different cores cannot be explicitly identified and separated in the model and hence multicore effects are not fully accounted for.
Most relevant to our work is the work in [19] . In this work, a multiprocessor system with distributed shared memory is modeled using a closed queuing network model. Each computing subsystem is modeled as composed of three M/M/1 servers and a finite number of jobs of a given class. The three servers represent a multithreaded CPU with coarse-grained thread scheduling discipline, a FCFS memory, and a FCFS entry point to a cross-bar network connecting to other computing subsystems. The jobs belonging to the same class or subsystem represent the threads in that subsystem. The jobs of a given class have given probabilities to access local and remote memory resources. This closed queuing network model has product-form solution.
The above existing application of queuing results to the multithreaded multicore and multiprocessor systems are preliminary (i.e., within the small cone on the left in Figure 5 ). The only queuing discipline studied is the FCFS queue, which characterizes the coarse-grained thread scheduling discipline at a CPU and FCFS queuing discipline for memory or interconnection network. No framework has ever been proposed that can cover a design space anywhere near the size as the one in Figure 5 and that allow system classes to be analyzed over the entire space.
Moreover, no existing queuing network model is capable of characterizing the dynamics of a program in terms of threadlevel parallelism (i.e., different code segments use different numbers of threads). Although the traditional Fork-Join approach [14] can capture such effect, it cannot be applied to multicore systems simply because it assumes that different parallel code branches are handled by different processors, rather than different threads belonging to the same processor or core. Another approach being proposed [14] , which is amenable to queuing analysis, is to use a hybrid open-and closed queuing network with two job classes. The job class running in the closed loop emulates a fixed level of parallelism, whereas the job class in the open loop models the dynamics of thread-level parallelism. The problem with this approach is that it is difficult to match the model with the statistics of parallelism of the actual code.
V. CONCLUSIONS AND FUTURE WORK
This paper proposed a theoretical framework for design space exploration of many-core systems. The novelty of the framework lies in the fact that it works at the thread level and it studies the general properties of system classes over the entire space. Hence, it is free of scalability issues.
The framework can cover the entire design space in Figure  5 except the dynamics of multithreading. In the future, we plan to explore the following possible solution. We plan to use a set of ancillary thread classes with different delay loops to join and leave the queuing network modeling a core. It can be easily shown that with n job (or thread) classes and 2 ií1 threads in the i-th class for i = 1,…, n, any number of threads in the range [1, 2 n+1 í1] can be generated in the core. For example, with n = 4, any number of threads in the range of [1, 31] can be generated. The first thread class has only one thread in it. This thread class runs in the queuing network modeling the core. The rest (n -1) thread classes run in the delay loops, It can also be shown that by properly setting the delay value for each delay loop, the proposed model can match any distribution of parallelism (i.e., with probability P k that k threads are presented in the core). The queuing network with these delay loops has closed-form solution.
