Abstract: Many-core co-design is a complex task in which application complexity design space, heterogeneous many-core architecture design space, parallel programming language design space, simulator design space and optimizer design space should get integrated through a binding process and these design spaces, an ensemble of what is called many-core co-design spaces. It is indispensable to build a co-design automation process to dominate over the co-design complexity to cut down the turnaround time. The co-design automation is frame worked to comprehend the dependencies across the many-core co-design spaces and devise the logic behind these interdependencies using a set of algorithms. The software modules of these algorithms and the rest from the manycore co-design spaces interact to crop up the power-performance optimized heterogeneous many-core architecture specific for the simultaneous execution of co applications without space-time sharing. It is essential that such codesign automation has a built-in user-customizable workload generator to benchmark the emerging many-core architecture. This customizability benefits the generation of complex workloads with the desired computation complexity, communication complexity, control flow complexity, and locality of reference, specified under a distribution and established on quantitative models. In addition, the customizable workload model aids the generation of what is called computational and communication surges. None of the current day benchmark suites encompasses applications and kernels that can match the attributes of customizable workload model proposed in this paper. Aforementioned concepts are exemplified in, the case study supported by simulation results gathered from the XYZ simulator.
A case study on the co-design automation is actualized in section IX for designing a many-core architecture along with the correlated ISA, specific to a set of multiple applications for simultaneous execution without space-time sharing. As an outgrowth of the co-design automation, clones are created for these set of multiple applications and it is shown that they get well correlated with the complexities (computation, communication, and control) of the multiple applications.
II.Heterogeneous many-core architecture benchmark suites: relevant papers
There are two classes of workloads. The first one is for benchmarking high-performance computing systems and the second type is for the application (classified and IP based) cloning [25] [26] [27] [28] to decide on a suitable commercially available high-performance computing system. The popular benchmark workloads are LINPACK [25] , IBS [29] and SPEC [26] . The SPEC benchmark includes a set of applications and kernels. These applications are based on the different class of algorithms used in the domain of science and engineering. The spec includes several suites.The bzip2 is extensively used for file compression and has nine levels of compression stack involving Run-Length encoding, Burrows-Wheeler transform, Move to Front transform, Run-Length encoding on MTF, Huffman coding, Selection between multiple Huffman tables, Unary base 1 encoding, Delta encoding, Sparse bit array [30] . The single depot vehicular scheduling makes use of combinatorial optimization applied to graph-theoretic algorithms like traveling salesperson. Video Compression involves image compression algorithms like singular value decomposition, JPEG [31] . Astar makes use of shortest path finding algorithms with imposed constraints like move speed and passable/non-passable terrains. The Linpack benchmark [33] involves solving a dense system of linear equations based on matrix algorithms like LU decomposition and singular value decomposition. The Isolation Benchmark Suite has the following benchmarks, CPU intensive tests, memory intensive, a fork bomb, disk I/O intensive, the network transmit intensive and network receive intensive [34] . The "Memory stress test" adopts continuous allocation of memory using functions like calloc in Linux [29] , whereas the "I/O stress test" does continuous read-write disk operations involving heavy data movement.
Apart from SPEC, LINPACK and IBS, there are other benchmark suites which include complex workloads either in the form of real applications or application kernels. Among these benchmark suites, the notables are SPLASH2
[splash] , PARSEC [ parsec] and Rodinia [45] . In SPLASH2, the application/ workload suites are parallel programs chosen based on the characterization along different directions namely speedup, load balancing, working sets, and communication to computation ratio and issues related to spatial locality. For more details on these refer [splash] . In all, there are twelve workloads covering a wide range of applications and computational kernels. The PARSEC benchmark workloads are relatively more advanced in comparison with the splash2. According to PARSEC benchmark workload description, diversity, multi-threaded applications employing state of art techniques are good enough to support research and emerging technology. There are in all twelve workloads described in parsec [parsec] . The characterization of parsec benchmark workloads is along the similar lines as presented in splash2 namely parallelization, working sets, locality, communication to computation ratio and off-chip traffic. The parsec benchmark workloads provide parallel programs for the evaluation of chip multiprocessors.
The "Rodinia" benchmark suite introduced in [45] is meant for performance evaluation of highly specialized computing systems which includes GPUs, accelerators, FPGAs and STI Cells [46] . The importance of this benchmark suite is, that it works on various types of workloads patterns, parallelism and data sharing. This suite includes applications like dynamic programming, dense linear algebra, MapReduce, graph traversal and more to be expected in the future as given in table 1 of [45] . According to [Rodinia] the algorithms involved in splash2 workloads are for homogeneous systems and have become obsolete and they lack software pipelining. GPUs and accelerators are not supported in SPLASH and PARSEC, unlike the Rodinia. Also, the PARSEC applications support an only modest number of cores and hence cannot be ported to many-core systems such as GPUs. However, in building power-performance efficient heterogeneous many-core architectures of the future, minimizing the onchip data moment becomes important [exa-1, exa-2]. The number of on-chip accelerators [Intel,nvidia] tends to drastically increase the on-chip data moment. The Rodinia benchmark suite includes applications and computational kernels like leukocyte tracking, back propagation, k-means, breadth-first search etc. The selection of workloads in Rodinia is based on the set of parameters like any other benchmark suites.
A workload model generation specific to applications for exploring embedded system-level design has been proposed in [35] . Here the generation methodology is based on modified GCC compiler to capture the application characteristics on a realistic basis. An improvement over the workload accuracy presented in [35] by including the workload extraction of precompiled libraries and tracking the control flow more accurately [36] . To overcome the time complexity of simulating large-scale applications to design heterogeneous many-core architecture, proxy applications and proxy architectures are proposed [37] [38] . Such an approach is adopted in [39] to synthesize workload for branch prediction and memory pattern (critical aspects of an application) for benchmark suites Barnes, Cholesky, Ocean-C, FFT, and LUD. These workloads, which are thread based have reduced time complexity yet matching accuracy (+-o.17 % to +-11.7%) with regarding CPI, Cache hit rates and branch prediction compared with the above-mentioned benchmarks.
However, to maintain very high accuracy for achieving optimal power performance scalability, not to compromise on time complexity, RTL workloads are generated and executed on FPGAs to reduce time complexity [40] . Another approach is to partition the application workload graph generation in a parallel environment
III. GENERIC MODEL FOR CUSTOMIZABLE WORKLOAD
A deeper analysis of the mechanism involved in selecting benchmark workloads (eg.Splash2 Vs ParsecVsRodinio) reveals that their lifespan is limited o a time window [rodinia] during which the technology-driven heterogeneous many-core architecture complexities to very very rapidly and with a lot more challenging applications coming forth. The most fundamental issue is that the workloads need to be built on quantitative modeling of computational complexity, communicational complexity, control flow complexity and locality of reference. Further to this, quantification of the workload characteristics and customizability are other important factors to be considered such that scaling up the workload characteristics along with the technology time frame and forthcoming applications in the domains of science, technology, and engineering will be possible. Such workload model is useful either for general purpose or application-specific issues. In a generic workload model using which a user should be able to customize computation, communication, control complexities and the locality of references and design either a comprehensive benchmark to evaluate the overall performance or to benchmark the individual components to suit one's needs will be the ultimate. For example, the user can customize the locality of reference in such a way as to leverage the cache performance, customizing the communication complexity for surge variations in order to track the network response.
Breaking the convention: An all-embracing workload model There are several benchmarks [42-44] to evaluate the performance of NOC, Cache, Scheduler and functional units, overall performance but they are not based on generic customizable workload model which is more flexible to generate complex benchmark suites. To develop such a generic workload model, the computation, communication, control complexities and the locality of reference needs to be effectively quantified. Though these complexities are the major constituents of an application the control flow complexity and the locality of reference play a major part in characterizing the applications. A graph-theoretic based workload model is given below, in Fig.3 . This graph-theoretic model is based on the methodology presented in the upcoming section. The edge weights are a function of the measure of data in bytes that gets communicated from one node to the other and the frequency of communication. The node weights are either a numeric, semi-numeric or non-numeric algorithm and also, general-purpose operations. The algorithms (the weight of a node shown in Fig.3 ), in-degree and out-degree of the nodes, nodes per level, the total number of nodes across all the levels and the number of levels are decided based on the specified distributions of these complexities.
Both in the introduction section and in this section, the essentials of a workload model in the context of ever changing technology driven heterogeneous many-core architecture are brought out lucidly. The all-embracing model includes the following.
Modeling Computation Complexity
The computational complexity of an algorithm is independent of the architecture whereas the execution time may vary depending on the architectural characteristics whereas the communication complexity varies considerably upon parallelizing. The conventional computational model is adopted for workload generation.
[knuth]
Modeling Communication Complexity
In this subsection, a quantitative model for communication complexity of an application is presented. As an example, consider the DAG shown in figure 3 in which weights of the nodes are some algorithms and edges are the links establishing the communications across the hyper nodes.
Communication Structure Complexity:
In this subsection, a quantitative model for communication complexity of an application is presented. As an example consider the DAG shown in figure 3 in which weights of the nodes are some algorithms and edges are the links establishing the communications across the hyper nodes.(communication graph of mst,tsp and lud and sigma model(more examples)) Building the design space for Application Complexity Modeling In the introduction section, the importance of modeling the application complexity is discussed to fix the characteristics of heterogeneous core architectures, type of functional units, different cache levels and the interconnect network for achieving binding. In general, applications encompass different class of algorithms apart from general purpose operations. This means the computational and communicational complexities of various classes of algorithms need to be analyzed and modeled. The various models presented constitute the design space meant for application complexity modeling. Communication models have been proposed with regard to establishing either randomized protocols, nondeterministic protocols and average-case protocols, with regard to communication across processes, and it is always the lower bound that is presented. However no communication model is ther for an algorithm [1] . Quantifying the communication complexity of an algorithm, is a function of the size of data(data set size) communicated between computing vertices present in the communication graph and the level of dependency involved across these data set flow along the hyper edges linking computing vertices. Let (V1, V3) be adjacent hyper vertices at levels i and k re-spectively, the depth index of the hyper edge between two adjacent( V1, V3 ) hyper vertices is given by -i k-, the ad-jacency level is non zero and positive. In general, the depth index of the hyper edge of the communication graph = -i k-where i,k is 1, 2, n where n, the number of levels is positive. The depth indices give the dependency between hyper graph vertices in the hyper graph workload. The hyper edge weight between adjacent vertices p and q epq is defined as epq = dpqDpq (1) where Dpq is the data set size being transfered across adjacent hy-per vertices p and q. dpq = -i k- (2) 
External Complexity:
Complexity considering fan in alone : Let D15,D25,D05,D35 be the data set size of incoming edges, the corresponding depth index diq across the adjacent hyper vertices(algorithms) as defined perviously, d15 across A1 and A5 is one, d25 across A2 and A5 is one, d35 across A3 and A5 is one, d05 across A0 and A5 is two. CEF in,i be the external communication complexity of the hyper vertex i with respect to fan in. depth index for edges: d15(acrossA1A5) = -1 2-1 d25(acrossA2A5) = -1 2-1 and similarly for d35 and d05 edge weight: e15 = D15 d15 e25 = D25 d25 and similarly for e35 and e05 CEF in,5 = [D15, D25, D35, D05] [d15, d25, d35, d05] T (3) Complexity considering fan out alone : Let D58, D59 be the data set size of the outgoing edges, the corresponding depth index dop across the adjacent hyper vertices(algorithms) as per the definition given above, d58 between A5 and A8 is one and d59 between A5 and A9 is one. CEF out,i be the communicational complexity of the hyper vertex i with respect to fan out. depth index for edges equations edge weight: e58 = D58 d58 e59 = D59 d59 CEF out,5 = [D58, D59] [d58, d59] T (4)
Internal Complexity:
Internal communication complexity of the hyper vertex A5 is the communication complexity due to the fanins or fanouts present inside the hyper vertex. Let CI5 represent the over all internal communication complexity, which is the commu-nication complexity due to the fanins(CIF in,5) or the com-munication complexity due to the fanouts(CIF out,5). CI,5 = CIF in,5 (or) CI,5 = CIF out,5. While the communication complexity of a single vertex alone(hyper vertex or vertex) is the sum of both fanins and fanouts components. The vector length of the models are bound to be very huge and this gives a greater insight in generating highly complex workloads. However these vector complexity measure can also be given under desired distribution. Accordingl one can specify distribution measure for communication and computation complexity in the respective bands of varying levels assuming a hypothetical very large hyper graph workload, these distribution can be varied across different bands. However the solution hyper graph workload model will be given in terms of large vectors for different bands cor-responding to the given distribution. In this paper we have considered only the vector length into account in generating the hyper graph workload model.
The relative intensity of communication complexity is given in terms of following: CASE I : Communication complexity is intense: both depth index and data set size are large. CASE II : Communication complexity is Medium: depth index is high and data set size are small. CASE III: Communication complexity is Medium: depth index is low and data set size are large. CASE IV : Communication complexity is low: both depth index and data set size are small. With respect to high performance computing system design, it is a necessity to analyze the computational structures involved at various phases of the application and more importantly the interaction or the dependency across these computational structures. This analysis helps to understand the computation complexity and the communication complexity, which is nothing but the dependency across various computation structures.
Control flow Complexity:
Based on the conditional statements, the execution flow will take one of the fan-out paths 1 to n. Refer figure 5 . The control flow model is defined as a probability vector which decides the path of execution flow. The control flow vector {p1, p2, p3, . . .pn} where n is the number of fan out greater than 1. where, pi is the probability associated with the fan out path,pi 6= pj∀n> 1, pi = pj∀n (fanout) Execution flow follows the path of highest probability. These probabilities are generated under normal distribution lying in the set {0, 1}. Inclusion of this control flow model in the hyper graph describing the workload is shown in figure 5 . 
Locality of reference
In order to benchmark performance of multi-level cache architectures, which is greatly affected by mapping heuristics, locality of reference(spatio-temporal) of a workload is probabilistically varied. Unconditional/Conditional loops within a workload are vital in affecting spatio-temporal locality of reference within a large class of workloads. To introduce drastic variation in locality of reference, the loop(/array) indices cannot be deterministic values but should be specified under an appropriate distribution. This is highlighted in the following loop example:
Let 'a' be the starting address defined under a random distribution 'b' be the varying incremental address under a random distribution 'c' be the ending address also under a random distribution.
Za, Zb be the indexed variables for ( a; (a + b < c)||(loop count < condition); b ) do Za : f1(a, b); Zb : f2(a, b); loop count + +; End
In the above loop, the base index and the increment address is specified under a distribution. This will lead to random memory addressing. The indexed variables Zicorresponding to a set of expressions such that the variable are changed under the same distribution as the base index. To make things more complex, the function f1 ,f2 ,f3 can be made a function of loop indices,loop increment.
The user can customize the loop models by randomly varying thelocality of reference and include this model as an algorithm within the ALGOBANK. When these loop models become hyper-vertices in the hyper graph workload, its communication complexity of these hyper-vertices is calculated.
IV. User customizable workload: implementation
A complex workload has to be designed using the (C3L) various models, computation complexity model, communication complexity model, Control flow complexity model and locality of reference model. Such a workload is very comprehensive due to the fact that the workload generation can be customized to benchmark individual architectural components. As special cases, specific workloads with computational surge and communicational surge can be generated to stress the high-performance computing system beyond its limit of computational elasticity (beyond the limit, the system shows odd behavior like a hang). This comprehensiveness and customizability of the workload is explained in the next section on results and analysis. Fig.6 . Graph theoretic based User Customizable Workload Design process (Refer Fig 3) . The As, Bs, and the Cs are numeric, semi-numeric and non-numeric algorithms including general purpose operations. The constraints are the in degree and the out degree (fan-in and fan-out) of the nodes (of the graphic theoretic model of the of the workload) at the inter and intra levels, under user specified distributions.
the algorithm for generating graph theoretic workload shown below Complex multiple workloads chosen for ALLDEare described in Fig.8 to Fig 9. The Fig.7shows a schematic of the process. 
V.Design Automation: relevant papers
One of the earliest research papers on design automation dealing with both architecture and application is by William Rosenbluth [1] , which describes a design automation methodology for system architecture with CAD workflow using LSI components of those days. Theoretically the problems of design automation are NP-Hard, for example, partitioning, module selection, placement, fault detection and wiring in VLSI circuits [2] . With the advancement of nanotechnology and the evolution of heterogeneous many-core architecture, techniques for automatic generation of application specific multiprocessor acquired greater emphasis [3] . In [3] , a design flow for such an automatic generation process inclusive of the communication co-processors optimized to the applications is portrayed. However, this design flow is indifferent to power-performance optimization, the main focus being achieving shorter design cycle. Interesting research works, on heterogeneous many-core architecture using only single ISA, heterogeneous many-core optimization for chip multiprocessor and optimal design space exploration have been reported by Rakesh Kumar et.al [4] [5] [6] [7] . There are fundamental differences between the research works of [4] [5] [6] [7] , with the work presented in this paper and the companion papers [9, 10] , concerning the design of heterogeneous multi-core architecture. The concepts devised in [4] [5] [6] [7] is on the exploration of a single design space (heterogeneous multi-core architecture design space) for optimal (power-area) solution, whereas this paper is on codesign automation, for exploring the many-core-co-design-spaces, to build a power-performance optimal heterogeneous many-core architecture and this is the fundamental difference. Further the network architecture, an important component, particularly for heterogeneous multi core processor design has not been contemplated in [4] [5] [6] [7] . Often, there is a strong need to explore huge design space to arrive at power-performance efficient heterogeneous core architectures. There are number of design space exploration tools like [7] to meet the demands of applications pertinent to power-performance and chip area efficiency and to reduce the design cycle time. Here software simulation and the search algorithms are the essentials. To achieve cycle accuracy, RTL simulation is carried out which is time consuming by the way, while software simulation lacks accuracy. Hence FPGAs are used to perform hardware emulation which provides cycle accurate results with reduced simulation time [16] .To speed up the software simulation and the associated search algorithms for exploring the design space, heterogeneous many-core architecture simulator specific to applications are developed [17] .
VI.MANY-CORE-CO-DESIGN AUTOMATION: ZOOMING IN TO THE DESIGN SPACE PROCESSES
There are several research projects reports stressing the need for evolving a co-design methodology for designing future high performance heterogeneousmany-core computing systems apart from number of research papers[] on this.Though these project reports and research papers establish the strong necessity to pursue and evolve a co-design methodology, obviously stressing more on technology, but none of them provide a comprehensive solution
The two companion papers [9, 10] and this paper together attempt to provide a comprehensive solutionfor designing heterogeneous many-core architecture backed up by extensive simulation results [10] . The paper [9] focuses on the binding of many-core co-design spaces, whereas [10] deals with the simulator and optimization design spaces. This paper unfolds a unique methodology for the co-design automation encompassing all the design spaces probably for the first time.
The complexity of the co-design process is such, that it demands a carefully thoughtautomation technique to design power efficient high performance computing systems. Co-design automation is a complex taskof handling the integration of different design spaces about which an higher level abstraction is illustrated in Fig.1 and detailed further in Fig.11 and Fig.12 .
An interesting and inherent aspect of this co-design automation, portrayed in Fig.11 is that all the core and the uncore components including the heterogeneous inter-core networks and also the workload mapping are evolved accounting for their dependencies specific to ALLDE and the parallel programming language. The algorithms meant for the design automation need to be extremely clever to arrive at the desired solution while satisfying the number of constraints imposed on it during the exploration of the many-core co-design spaces. Efficient process flow for co-design automation needs to be fixed carefully, bearing in mind the dependencies across all these design spaces.
The phases of co-design automation:
The co-design automation flow is depicted in Fig.11 and 12 ,showing the dependencies across many-core co-design spaces that are conceptualized in [9] .The input phase, many -core co-design spaceinteraction phase, the results and analysis phase and the output phase(leading to the target architecture) together enable to resolve the types of cores and their respective count. Several algorithms involved for the same are discussed in the subsequent section.The massive complexity of co-design automation is explicit in Fig.12 . In depth discussion and analysis among the architecture design experts, application experts, optimization and algorithm experts, parallel programming language experts and simulation experts are indispensable (meant for simultaneous execution of multiple applications without space time sharing), prior to the start of the co-des ign automation process. VII.Many-Core Formation: core types and respective counts
