ABSTRACT Single-instruction set architecture (ISA) heterogeneous multi-processor architecture is promising for developing multi-processor system-on-chips (MPSoCs). In this architecture, all processors execute the same instruction set, yet with various performance and power behavior, since processors may have various micro-architectures. Therefore, systems with this architecture have the advantages of easy to develop new functions as the homogeneous architecture, and easy to customize the resource allocation to achieve high energy efficiency as the heterogeneous architecture. However, for an MPSoC utilizing the target architecture, a key design issue is how to select the set of processors so that the target system can achieve good performance while the cost of the chip is constrained to the expected value. To solve this, in this paper, we propose a processor allocation method for MPSoCs with single-ISA heterogeneous multi-core architecture. The goal of the proposed method is to automatically synthesize the allocation of cores for the given workload so that the performance is optimized while the resource constraint is met. To the best of our knowledge, this is the first work that tackles the processor allocation problem for MPSoCs with the target architecture. To bring out the best performance of a hardware configuration, the proposed algorithm also synthesizes the software design of task mapping for a selected hardware configuration. The experimental results show that, compared with the homogeneous architecture with the least cost and lowest performance cores only, even if the number of core is set to the maximum parallelism degree of the target workload, the proposed method achieves up to 8.25% of performance improvement among all the cases we evaluated while the area constraint is met. Compared with the architecture with all high performance but large cores, when the number of cores is also set to the same as the maximum parallelism degree of the target workload, the proposed method has at most 11.5% of performance degradation, while the area cost is reduced by 60.7%.
I. INTRODUCTION
System-on-Chips (SoCs) that are commonly utilized in consumer electronics enter the multi-core architecture era due to the increasing number of functions that a device can support. Multi-core SoCs (MPSoCs) utilized in consumer electronics usually have low-power, high energy efficiency and short time-to-market requirements. Since the number of hardware modules and application kernels in a mobile device increases, the complexity of a system also increases. Therefore, it is very important to have an underlying architecture that is easy for system designers to find a design that achieves these requirements. Modern MPSoCs utilize either heterogeneous or homogeneous architecture. Heterogeneous architecture has the advantage of customizing hardware modules according to the target system's needs to achieve high energy efficiency.
However, due to the high design complexity, developers also need long design times to find good designs even with various design automation tools. On the other hand, homogeneous architecture has the advantage of short time-to-market since developing new functionalities can be achieved by writing software applications. Nevertheless, energy-efficiency of homogeneous architecture is worse compared to that of the heterogeneous one. To achieve both energy-efficiency and short time-to-market, Single-ISA (Instruction Set Architecture) heterogeneous architecture [1] has been proposed, and is considered as one emerging underlying architecture for MPSoCs.
In this architecture, all processors are from the same series. The reason is, processors of the same series implement the same instruction set architecture, yet different generations of cores have different execution capabilities due to the various designs of micro-architecture, e.g. various number of pipeline stages, equipped with various accelerators of special computations, etc. When executing the same task, cores with high computation capabilities can achieve lower latency compared to cores with simple micro-architecture. However, high computation capabilities also come with high resource or area cost. For example, under the 45nm technology, both Cortex-A9 and Cortex-A8 use the ARMv7-A ISA. However, Cortex-A9 has clock rate of 1000MHz, and a single core takes about 4.068 mm 2 when cache is not considered, while Cortex-A8 has clock frequency of 800MHz, and each core only takes 2.709 mm 2 when no cache is included [2] , [3] .
Therefore, for MPSoCs that utilize the single-ISA heterogeneous architecture, one important design issue is how to get the most performance out of the limited area cost. That is, under a given area constraint, which micro-architecture types and how many cores for each type should be allocated so that performance of the target workloads is optimized while the resource constraint is met. With limited resource, the allocation of cores greatly affects the maximum parallelism that the system can support, and the critical path execution time that the system can achieve. As shown in Fig. 1 (a) , if we choose simple and small cores over complex cores, lots tasks can be executed in parallel, while the execution times of single task and the critical path would be longer than executing the task and path on a complex and large core. When a system allocates large and complex cores, as shown in Fig. 1(b) , the number of tasks that can be executed in parallel is reduced, but the execution times of single task and the critical path is also shortened. So, to maximize performance with limited resource, the allocation of processors should be carefully designed according to the needs of the target workload, such as the maximum parallel degree the workload can have, the needs of complex cores to shorten the execution time of single task, etc. To find an allocation of processors for a given workload and the resource constraint, one simple method is to exhaustively try all possible solutions and select the best one. However, in modern MPSoCs, the numbers of cores that can be allocated in one chip may vary from multiple to tens, even hundreds, of cores. To bring out the best performance of each hardware configuration, the software design, e.g. task mapping, should also be designed accordingly. For modern systems, it is very common to have tens to hundreds of kernel functions in the target workloads. Therefore, from the above discussions, we can see that the solution space grows exponentially with the number of cores and tasks in the target workload. Exhaustively trying all possible solutions is very time-consuming and impractical. Therefore, a design automation method that is able to provide viable solutions in a reasonable time for system designers to start with would be helpful to shorten the design time.
In this paper, we propose a design automation method for processors allocation of MPSoCs with single-ISA heterogeneous multi-processor architecture. The goal of the proposed method is deciding which types of cores, i.e. which kinds of micro-architecture, and how many cores for each of the type to be allocated in the system, so that the performance of the target workload is maximized while the given area constraint is met. To the best of our knowledge, this is the first work that tackle the processor allocation problem for MPSoCs with the single-ISA heterogeneous multi-core architecture. As mentioned earlier, since the solution space of our processor allocation problem grows exponentially with the number of cores and tasks of the target workloads, it is very difficult to find the optimal solution in a reasonable time. To solve this, we propose a greedy based heuristic that finds a viable solution in a reasonable time. We verify the proposed algorithm on a set of synthetic and real-world workloads. When utilizing various generations of ARM cores [2] , [3] , the experimental results show that, compared to naive configurations, i.e. all simple cores, where the number of cores are set to the maximum parallelism degree of the target workload, the configuration suggested by the proposed method achieves up to 8.25% of performance improvement on the average under the same SoC area constraint. Compared to the architecture with all high-performance but large cores, when the number of cores is set to the same as the maximum parallelism degree of the target workload, the proposed method has at most 11.5% of performance degradation, while the area cost is reduced by 60.7%.
The rest of this paper is organized as follows. System specifications and formal problem formulation of the proposed algorithm are described in Section II. The proposed algorithm is presented in Section III. The experimental results are discussed in Section IV. We briefly review the related works in Section V. Finally, Section VI concludes this paper.
II. SYSTEM MODELS AND PROBLEM FORMULATION
In this section, we present the data structures and models that we utilize to represent the software behavior and hardware VOLUME 5, 2017 configuration of the target system. With the software and hardware models, we then formulate the synthesis problem discussed in this paper formally. All definitions of the variables utilized in the software and hardware models, and the synthesis targets, are listed in Table 1 . 
A. HARDWARE MODEL
The hardware model represents the hardware architecture of the target system, which is mainly composed of processors that execute the same ISA with various micro-architectures. Therefore, we use the set C = {c 0 , c 1 , ..., c i } to denote the set of cores that execute the same ISA but with various microarchitecture designs. For each type of core c i ∈ C, they have three attributes, T (c i ), F(c i ), and A(c i ), that respectively represents the number of reference CPU cycles, clock frequency and area cost under the reference technology of c i . For all c i ∈ C, we select the type with the shortest clock cycle time and take its clock cycle time as the reference clock cycle time. For example, if core c i has the highest clock frequency of 2GHz, we set c i as the reference core, and T (c i ) is set to one. If another type of core c x has clock frequency of 1GHz, its T (c x ) would be two cycles.
B. SOFTWARE MODEL
The software model is used to capture the target workload's behavior, including the execution flow among kernel functions, the maximum parallelism degree of the workloads, and data sharing among kernel functions.
We use task and data flow graph, which is a directed graph, to represent the workload behavior. In a task and data flow graph G = (V , E), where V is the set of vertices that represent the kernel functions of the workload, and E is the set of directed edges that represent data flows among tasks. For every edge e i ∈ E, it is associated with d e i to denote the ID of the data block that is transferred over the edge. We refer to the definition of data block in [4] , where a data block is the collection of some scalars or arrays. We use the set D to denote the set of data blocks accessed in the target workload, and each d i ∈ D is associated with size(d i ) to indicate the size (in bits) of d i . Each v i ∈ V has a field t v i to record the number of reference CPU cycles for executing v i on the slowest core. For every e i ∈ E, in addition to its data block ID d e i , it is also associated with two attributes, parent(e i ) and child(e i ), to indicate its parent node and child node, respectively. Note that, if parent(e i ) ∈ ∅ or child(e i ) ∈ ∅, it indicates storing data block d(e i ) to memory or retrieving data block d(e i ) from memory.
C. PROBLEM FORMULATION
With the given software and hardware models, we formulate the target synthesis problem as follows.
Given: Task graphs G, data block library D, PE library C, area constraint Area. Area is given by the maximum area cost that a designer would like under the expected technology.
Synthesis target: The proposed method synthesizes the hardware configuration of processor allocation and software configuration of task mapping. Detailed description of the two synthesis targets are listed below.
• Processor allocation For each core type c i ∈ C, we have to decide the number of cores of c i type to be allocated in the target system. We use NA(c i ) to indicate the number of cores of type c i is allocated in the system. To meet the area constraint,
should be no more than Area. We also use the set P to indicate the set of processors allocated in the target system, where each p i ∈ P has id(p i ) and type(p i ) to indicate its ID and core type of p i , respectively.
• Task mapping For every v i ∈ V , the proposed method decides that v i should be executed on which processor in P. We use the function θ : V → P to denote this operation. Goal: Find the processor allocation and task mapping such that the latency of critical path of the target workload is minimized while the total area cost is no more than Area.
III. PROCESSOR ALLOCATION ALGORITHM FOR MPSoCs WITH SINGLE-ISA HETEROGENEOUS ARCHITECTURE
In this section, we present the proposed synthesis method. As mentioned in Section I, the solution space of our synthesis problem grows exponentially with the number of cores and tasks in the target workload. We find that even the simplified version of the problem is still NP-complete. That is, if we do not consider task mapping and considering the performance value and area cost of each core type only, the problem of allocating processors under area constraint can be reduced to the Knapsack problem, which is proved to be NP-complete. In our synthesis problem, the performance value of a selected processor cannot be viewed by its clock frequency only since the real performance of a selected processor should be evaluated with tasks running on it. Therefore, based on the above discussion, we can see that the target problem is NP-complete or NP-hard. To find a viable solution for our synthesis problem in a reasonable time, we propose a heuristic-based method. The idea of the proposed heuristic is to reduce the solution space by reducing the possibilities of task mappings. This is achieved by finding the set of tasks that should be allocated in the same processor so that the on-chip traffic can be minimized first. Then, for each set of tasks, the proposed heuristic decides the type of core to execute the task set. The flow of the algorithm is shown in Fig. 2 . Tasks on the same execution path can only be executed in serial, and mapping serially executed tasks on difference cores cannot have the performance advantage brought by parallel execution. Moreover, tasks on the same execution path have data shared among them. Allocating these tasks on the same processor can also reduce the needs of on-chip data transferring. Therefore, as shown in Fig. 2 , we first partition tasks into groups, where each group contains tasks on the same execution path and is considered as the unit for task mapping. With this grouping step, we can reasonably and effectively reduce the solution space in task mapping. Next, we perform processor allocation under resource constraint based on the paths we derived in the first step. The details of the two steps are presented in Section III-A and Section III-B, respectively. The time complexity of the proposed heuristic is discussed in Section III-C.
A. FORMING TASK GROUPS BASED ON EXECUTION PATHS
As mentioned earlier, tasks on the same execution path should be executed in serial, and mapping tasks on the same path to the same processor also has the advantage of reducing the time of transferring shared data among various processors. Therefore, in this step, we greedily group tasks that are on the same path and yield the most shared data on the path.
To achieve this, the process starts from the node that is closest to the root and is not assigned to any group yet. The node is then included in a new group and set as the current node. Then, current node's child that shares the most data with the current node and is not assigned to any group yet is included to the current node's task group. The process then set the current node to the newly included node, and repeats the process until the current node has no child node. The whole process repeats until all nodes are assigned to groups. Once task groups are decided, we can estimate the path execution time on the reference processor, i.e. the processor with the lowest clock frequency, since task execution time (in reference clock cycles) and the amount of data transferred among tasks are known. The pseudo codes of forming task groups is shown in Algorithm 1. starting from v i that has no parent node, or all its parent node are selected into certain groups repeat 7: for all children of curr_v; 8: find the child node v child with the most shared data with curr_v; The processor allocation step decides the exact allocation of processors for the target system. In addition to physical processor allocation, this step also decides the mapping of paths formed in the first step described in Section III-A to the allocated cores. The goal of the proposed processor allocation method is to find the set of processors to optimize target system's performance, i.e., minimizing the execution latency of the path with the longest execution time as much as possible.
Algorithm 1 Forming Task Groups Based on Execution Paths
The proposed processor allocation method is composed of two phases. In the first phase, each path is mapped to an individual core first. Since some paths may have disjoint execution times, or have very short execution latencies, merging these paths to the same processor may not affect VOLUME 5, 2017 overall workload execution time. So, in the second phase, the method merges such paths to the same processor so that the resource originally occupied by each path can be released. The resources released from merging paths can be utilized for the critical path to use a processor that is larger and more efficient than its current one, and to further reduce the execution time of the critical path. Details of the two phases are described in Section III-B1 and Section III-B2, respectively.
1) PHASE 1: ALLOCATING PROCESSOR FOR EACH PATH
Since the goal of processor allocation is minimizing the path execution time, the idea of this phase is greedily mapping paths with long execution times to high-performance cores as long as the remaining resource is sufficient. Therefore, starting from the path with the longest execution time, we greedily allocate the core with the highest performance to each of the path when the unused area resource is sufficient.
When the remaining area is not enough to allocate the core with the highest performance, we need to consider replacing the allocated high-performance cores to a smaller ones so that more area can be released for the paths that are not mapped to any core yet. To minimize performance degradation induced by changing to a smaller core as much as possible, we would like to find the path that achieves the least performance degradation when mapped to a smaller processor. To help us quantify the performance gain of a path can get when mapped to a certain core, we develop a metric called K value, which represents the performance gain of unit area cost that can be obtained by allocating a selected processor. For a path path i that is mapped to processor p k , its K (path i , p k ) value is defined as the following.
where t(path i , p area_ref ) and t(path i , p k ) indicate the path execution times of mapping path i to processors p area_ref and p k , respectively. p area_ref indicates the processor with the type of the area reference core, i.e. the core with lowest area cost and performance. Therefore, the path with the smallest K value indicates that the path gets the least performance gain by allocating a core that is bigger than the reference core. Therefore, replacing this path to a small and low-performance core would hurt system performance the least. Assume the path with the smallest K value is path sk and it is mapped to core p biggest , and the path to be mapped is path curr . Starting from the path path i with the longest execution latency 5: if the biggest core A ( c 0 ) <= a then 6: Map path i to processor p i with type c 0 ; 7: NA(c 0 ) + +; P = P ∪ p i ; type(p i ) = c 0 ; 8: a -= A ( c 0 ); 9: else if then 10: find path sk that is mapped to a core and has the smallest K ; 11: if K (path i , c 0 ) > K (path sk , c 0 ) then 12: map path i to p sk ; 13: find the largest c x that meets A(c x ) < a; end if 27: until All paths in PATH are mapped to a processor
2) PHASE 2: MERGING PATHS FOR ALLOCATING HIGH-PERFORMANCE CORES FOR CRITICAL PATHS
As mentioned earlier, the goal of this step is to combine paths with short execution latencies or disjoint execution intervals to the same processor. So, the overall execution time of the workload is unaffected while resources can be released for the critical paths to utilize a large and highe performance core to improve the overall performance.
Starting from the path with the least execution latency, called path s , we try to merge the path with all other paths in the system. Assume the two paths to be merged are path s and path o , and they are originally mapped to p i and p j , respectively. The overall workload execution times of merging path s and path o on p s and p o are respectively evaluated. Paths are merged to the smallest processor that yield the overall workload execution time that is no more than the time of the original configuration. If the merging of path s and path o always prolongs the workload execution time, we keep the original setting. If path s and path o are merged to the same processor, we consider a new path that is composed of path s and path o is generated. The area resource released from the merged paths, is utilized by the critical path, i.e. the path with the longest execution time, to trade for a larger and higher performance core compared to its current one. The process repeats until no proper merging can be found, or the critical paths are all mapped to the largest and highest-performance cores. The pseudo code of this phase is shown in Algorithm 3. 10: map path l with the longest execution time to a larger core if the area is sufficient; 11: end if 12: until all paths are tried 13: until critical path on the largest core
C. TIME COMPLEXITY ANALYSIS
The time complexity of the proposed method is discussed in this section. The major steps of the proposed method are task group formation described in Section III-A, phase 1 of the processor allocation step described in Section III-B1, and phase 2 of the processor allocation step described in Section III-B2.
For task groups formation described in Section III-A, the time complexity of forming task groups is O(|V |) since all vertexes are visited once. The time complexity of evaluating the path execution time needs at most O(|V | + |E|) since all vertex and edges should be visited to estimate the task execution times and data transferring times. For phase 1 of the processor allocation step, the sorting of paths according to their execution times would need O(|V 2 |) in the worst case. Since the evaluation of K value for each path only needs constant time, there would be at most O(|V |) evaluations of the K value. For phase 2 of the processor allocation step, the major loop would be executed O(|V 2 |) times in the worst case. In each loop, the workload execution time should be evaluated with the new configuration, where each evaluation takes O(|V | + |E|) time. Therefore, the time complexity of phase 2 of the processor allocation step would be O(|V 3 | + |V ||E|), which dominates the time complexity of the whole synthesis process.
IV. EXPERIMENTAL RESULTS

A. EXPERIMENTAL SETUP
The proposed synthesis method is evaluated by a set of synthetic benchmarks generated by the graph generator TGFF [5] and a workload generated from real applications. For the synthetic workloads, we generate four random task graphs. All task graphs have maximum parallelism of eight. The number of vertexes range from 26 to 51. The number of edges ranges from 24 to 53. The priority of a task is given according to its distance to the root node. Tasks that are closer to the root node have higher priorities. The priorities of tasks that have the same distance to their root nodes are randomly assigned. In addition to synthetic task sets, we also evaluate the proposed method by a set of real-world applications, CRC+JPEG, which is the mix of consumer and telecommunication benchmark suites from Embedded System Synthesis Benchmark Suites (E3S) [6] . E3S is a collection of task graphs which are built from the Embedded Microprocessor Benchmark Consortium (EEMBC) benchmark suites [7] . CRC+JPEG has 15 tasks and 14 edges.
For the cores utilized in our evaluations, we assume the core library is composed of Cortex-A9, Cortex-A8 and Cortex-A7 from the ARM series. When using 45nm technology, the area cost and clock frequency of each of the core is shown in Table 2 , where the numbers of Cortex-A9 and Cortex-A8 are obtained from [2] . For the numbers of Cortex-A7, we can only find the numbers under the 28nm technology, and we obtain the 45nm numbers by scaling the area and clock frequency according to the transistor size and the performance projections of various technologies in [8] .
In our experiments, we set the area constraint to 20 mm 2 . We implement the proposed method by C++, and running the processor allocation method on a host machine with Intel Core i5-2400 at 1.3GHz clock frequency and 8GB DDR3-1333MHz main memory. 
B. ANALYSIS OF THE RESULTS
In this section, we analyze the synthesis results of the proposed processor allocation method. Fig. 3 shows the performance of various workloads under the same area constraint. The ''all small core'' and ''our method'' bars indicate the results of assigning all smallest cores and the configuration synthesized by our method, respectively. In this set of experiments, all the performance results are normalized to that of the all small cores. The experimental results show that, although the all-small-core configuration provides the most execution parallelism, our method still achieve better performance. For the synthetic workload TG4, our method achieves 8.25% performance improvement over the all-smallcore configuration that uses eight small cores. For TG4, our method suggests two largest cores, three middle cores and two small cores. For the real-world application workload, our method still achieves 0.6% of performance improvement over the all-small-core configuration. For all workloads utilized in this set of experiments, our method shows better results than those of the all-small-cores configuration. We observe that, for these workloads, only a few paths have execution times close to that of the critical path. This indicates that, these workloads require only a few large and high-performance core to reduce execution times of long paths, but not many small cores to increase execution parallelism.
FIGURE 4.
Comparisons of performance and area cost of using all big cores and the processor allocation synthesized by our method. Fig. 4 shows that, compared to the system with all big and high-performance cores, the performance and area reduction achieved by our method. All the performance results and area costs are normalized to that of the all-big-core configuration, which is marked by ''all big core'' in the figure. In this set of experiments, the number of big cores allocated in the system is decided according to the maximum parallelism degree of each workload. For our synthesis method, the area constraint is still set to 20mm 2 . We can observe that, even with much less resource than the all-big-core configuration, our method has at most 11.5% performance degradation. However, the area reduction is as much as 60.7%. For the real-world application, the performance results are almost the same, while the area cost is largely reduced to 61.4% of the all-big-core configuration. This set of results show that, even under the limited resource, the proposed method can still achieve performance close to the full-blown hardware configuration. Table 3 shows the time to synthesize a configuration under various workload sizes. We can see that, even for task graph that has 51 tasks and 47 edges, the proposed method needs less than one second to synthesize a configuration for system designers to start with. This shows that, the proposed method can help system designers greatly shorten the time to find candidate designs. 
V. RELATED WORK
In this section, we first discuss related works about the architectural and OS designs for systems with the Single-ISA heterogeneous multi-core architecture [1] , [9] , [10] . Due to the performance heterogeneity, properly scheduling tasks executed on cores with various computation capabilities is also an important design issue, and several works have been proposed to tackle this problem [11] - [13] .
In [1] , Kumar et al. first propose the concept of single-ISA heterogeneous multi-core architecture to achieve high energy-efficiency and short development time in the multicore era. To effectively manage the computation resources of the target architecture, Li et al. [9] present a comprehensive study of OS supports for heterogeneous architectures in which cores have asymmetric performance and overlapping. Souza et al. [10] show how one can effectively use a regular fabric to provide a number of different possible heterogeneous configurations while still sustaining the same ISA.
For the single-ISA heterogeneous architecture, Xu et al. [11] observed the architecture is good to achieve high energy efficiency. However, to achieve high energy efficiency, it is important that the OS has a asymmetricaware scheduling method. Therefore, they propose to utilize offline analysis, which is able to get the exactly initial task to processor mapping, and can have more accurate information than the online methods to achieve higher energy efficiency. Chen and John [12] proposed a program scheduling method that projects the core's configuration and the program's resource demand to a unified multi-dimensional space. Then, the method uses weighted Euclidean distance between these two to guide the program scheduling. Observing that a program usually executes in various phases, Sawalha et al. [13] propose to schedule threads executing on the target architecture based on the correlation between program phases and the performance of those phases on any particular core type. The obtained correlation is then used to drive appropriate scheduling decisions. For a system that uses the processor configuration derived from the processor allocation method proposed in this paper, all the above scheduling methods discussed above can be utilized in the system.
VI. CONCLUSION
We proposed the first processor allocation method for MPSoCs with single-ISA heterogeneous multi-core architecture, which is considered to be a promising platform for developing MPSoCs. The goal of the proposed method is to find a proper processor allocation and task mapping configuration such that the target workload's execution time is optimized while the given resource/area constrain is met. Since the solution space of the target synthesis problem grows exponentially with the number of tasks in the target workload, and the number of cores, we proposed a heuristic-based method that first decides the groups of tasks that should be mapped to the same processor, and then perform processor allocation and task mapping. The experimental results show that, the proposed method effectively reduced the solution space, and synthesized a good quality configuration in a reasonable time. Under the same area constraint, the results synthesized by our method achieved up to 8.25% of performance improvement over the performance of the system with all simple cores, which provide the maximum execution parallelism in the system. The results also showed that, the proposed method synthesized a configuration for a workload with up to 36 tasks and 53 edges within one second.
YI-JUNG CHEN received the B.S. and M.S. degrees from National Chi Nan University, Taiwan, in 2000 and 2002, respectively, and the Ph.D. degree from National Taiwan University, Taiwan, in 2010. She is currently an Assistant Professor with the Department of Computer Science and Information Engineering, National Chi Nan University. Her research interests include memory system design and system-level synthesis for multi-core architecture.
WEN-WEI CHANG received the B.S. degree from the National Taichung University of Education, Taichung, Taiwan, in 2014. He is currently pursuing the master's degree with the Department of Computer Science and Information Engineering, National Chi Nan University. His research interests include system-level synthesis and memory system synthesis for single-ISA heterogeneous multi-core architecture.
CHIA-YIN LIU is currently pursuing the master's degree with the Department of Computer Science and Information Engineering, National Chi Nan University. Her research interests include thermalaware system-level synthesis and memory system design for MPSoCs with 3-D-stacked memories.
CHENG-EN WU received the B.S. degree from Providence University, Taichung, Taiwan, in 2014. He is currently pursuing the master's degree with the Department of Computer Science and Information Engineering, National Chi Nan University. His research interests include thermal-aware system-level synthesis and memory system design for MPSoCs with 3-D-stacked memories.
BO-YUAN CHEN received the B.S. degree from Providence University, Taichung, Taiwan, in 2015. He is currently pursuing the master's degree with the Department of Computer Science and Information Engineering, National Chi Nan University. His research interests include memoryaware system-level synthesis for MPSoCs with embedded FPGAs.
MING-YING TSAI received the B.S. degree from National Chi Nan University, Nantou, Taiwan, in 2016, where she is currently pursuing the master's degree with the Department of Computer Science and Information Engineering. Her research interests include power-aware system design for single-ISA heterogeneous multi-core architecture.
