ABSTRACT
1.INTRODUCTION
Microprocessor is at the core of high performance computing systems but they provide flexible computing at the expense of performance [1] . Application Specific Integrated Circuit (ASIC) supports fixed functionality and superior performance for an application but they restrict flexibility of architecture. Thereafter a new computing paradigm [2] Reconfigurable Systems (RS) promises greater flexibility without compromise in performance. So complex applications like MIMO, OFDM and image processing are accelerated by reconfigurable architecture and achieved higher performance by reducing the instruction fetch, decode and execute bottleneck [1] [2] [3] . The RS brings the phenomenon of configuring custom digital circuits dynamically and modified via software. This ability of creating and modifying digital logic circuits without physically altering the hardware provides more flexible and low cost solution for real time applications. This phenomenon of dynamic reconfiguration of an application is enabled by the availability of high density programmable logic chips called Field Programmable Gate Array (FPGA). So, High Speed Computing Systems (HSCS) should have one or more resources of such kind (Reconfigurable System on Chip (RSoC) [15] , MOLEN architecture [26] ) as Processing Element (PE) to enhance the speed of real time application. A computing platform described in [26] [27] [28] [29] and these are made by integrated similar resources through high speed network to support the execution of parallel applications called Homogeneous Computing System. The efficiency of homogeneous computing system critically depends on the methods used in [22] [23] [24] [26] to schedule tasks of parallel applications. Other hand, diverse set of resources interconnected with a high speed networks provides a new computing platform [20] [21] called Heterogeneous Computing System, which could support executing computationally intensive parallel and distributed applications. An emerging computing platform which integrates the array of programmable logic resources and soft core processors together on a single chip [15] [16] [17] [18] called Heterogeneous Reconfigurable Computing Systems (HRCS). The HRCS platform is an emerging paradigm of research that offers cost effective solutions for computationally intensive applications through hardware reuse and many multimedia applications [2] were accelerated by HRCS.
In real time, tasks of parallel application must share the resources of HSCS effectively in order to enhance the execution speed of an application and it could be achieved through effective scheduling mechanism. There are many researchers presented techniques for mapping multiple tasks to HSCS with the aim of "minimizing execution time of an application' and also "efficient utilization of resources". In this paper, we bring the review of various existing scheduling methodologies for HSCS. The task scheduling models are basically two types called static and dynamic scheduling. Static Scheduling: All information needed for scheduling such as the structure of the parallel application, execution time of individual tasks and communication cost between the tasks must be known in advance [10] [12][13] [14] . Dynamic scheduling: The scheduling decisions made at runtime and whereas its aim is not only enhance the execution time and also minimize the communication overheads [8] [20] [24] [26] . The review of static and dynamic scheduling heuristics for HSCS is described thoroughly in next chapters. In general, various scheduling heuristic approaches are classified into four categories: List scheduling algorithms [20] , clustering algorithms [11] , Duplication Algorithms [22] , and genetic algorithms. Among them, the list scheduling algorithms provides good quality of schedule and their performance is compatible with all categories of applications [20] . So in this paper, we have concentrated more on list scheduling algorithm and it has three steps: task selection, processor selection and status update. For clear understanding, the remaining paper is organized as the task scheduling for homogeneous computing systems in chapter 2, heterogeneous computing systems in chapter 3, reconfigurable computing systems in chapter 4, heterogeneous reconfigurable computing systems in chapter 5, and review summary chart in chapter 6 and finally paper is concluded in chapter 7.
SCHEDULING MODELS FOR HOMOGENEOUS COMPUTING SYSTEMS
Homogeneous Computing refers the systems which are formulated with multiple similar kinds of soft core processors and it brings parallelism for application execution. The parallelism is achieved by effective task scheduling. The static and dynamic list scheduling techniques are summarized [5] for microprocessor based systems. In [5] , the task scheduling is based on cost function whereas the cost function is an attribute of tasks of an application. There are several list scheduling algorithms proposed [5] [27] is a static priority based scheduling algorithm, which assigns the highest priority to the most frequency task and lowest priority to least frequency task in the system. The RM selects the highest priority task to execute first and then remaining tasks come for execution as per their priority sequence. So the RM can only used in statically defined systems and the scheduling bound of RM algorithm is less than 100%. So that the researchers in task scheduling moved towards the use of dynamic priority based scheduling algorithms. Earliest Deadline First Algorithm: The Earliest Deadline First (EDF) algorithm [28] uses the dead line of the task as cost function. The task with earliest deadline has highest priority whereas the task with longest deadline has lowest priority. The major advantages of EDF algorithm is that the priorities are dynamic so the period of tasks can be changed dynamically and also the schedulable bound for any task set is 100% but there is no control of which tasks fails during transient overload. Minimum Laxity First Algorithm: The Minimum Laxity First (MLF) algorithm [29] [4] follows dynamic scheduling where it assigns Laxity to each task in a system and selects the task having minimum laxity to execute next. The Laxity is a measure of flexibility of a task to schedule and it is defined as follows: Laxity = deadline timecurrent time -executing time. It also has 100% schedulable bond like EDF and there is no way to control which tasks are guaranteed to execute during a transient overload. Maximum Urgency First Algorithm: The Maximum Urgency First (MUF) algorithm [29] follows both static and dynamic priority scheduling. In MUF algorithm, each task would be given with an Urgency and the Urgency is combination of two fixed priorities and one dynamic priority. The static priorities are defined once and do not changed during execution where as the dynamic priority is assigned at runtime which is inversely proportional to the Laxity of a task. The MUF scheduler looks first in static priority and then dynamic priority. A low cost task scheduling [26] described for Distributed Memory Machines (DMM) based on the heuristics EDF, MLF etc. and stated that the List Scheduling with Dynamic Priorities (LSDP) gives optimum results than List Scheduling with Static Priority (LSSP). The task duplication based scheduling [22] for distributed memory machines designed to reduce the inter processor communication. A Modified TDS (MTDS) described in [23] and it generates shorter scheduled list then TDS [22] . A Dynamic Critical Path (DCP) Scheduling algorithm [24] proposed for multiprocessors where the DCP intended to find critical path of a task graph and rearranges the schedule on each processor dynamically. A duplication-based scheduling strategy called Selective Duplication (SD) algorithm is developed [25] for multiprocessor systems with the aim of exploit the available scheduling holes effectively without scarifying efficiency. In [25] , the application is visualized as DAG and the targeted machine is represented as ‫ܯ‬ = (ܲ, ൣ‫ܮ‬ ൧, ൣℎ ൧); P = {p1, p2, ...,Pp} is set of P homogeneous processor; ‫ܮ[‬ ] is a ‫‬ × ‫‬ matrix describing interconnection network topology and [ℎ ] is a ‫‬ × ‫‬ matrix giving minimum distance in number of hops between processor ‫‬ and ‫‬ . The SD algorithm [23] is compared with existing duplication TDS [22] , MTDS [23] , and non-duplication scheduling algorithms with respect to Normalized Schedule Length (NSL), Efficiency.
SCHEDULING MODELS FOR HETEROGENEOUS COMPUTING SYSTEMS
Heterogeneous computing refers to systems that have more than one kind of processing elements and it gains performance for the application when multifarious execution required. An application scheduling algorithms called Heterogeneous Earliest Finish Time (HEFT) and Critical-Path-Ona-Processor (CPOP) formulated [20] for a bounded number of Heterogeneous processors. The HEFT has two phases, Task prioritizing phase uses HEFT as cost function and processor selection phase to select the tasks on its best processor. The HEFT has the time complexity O(e × q) for e edges and q processors. The CPOP used Critical Path, which is sum of computation time and inter task communication time, as cost function and provides time complexity equal to O(e × p) for e edges and p processors. The HEFT algorithm outperforms other algorithms in terms of SLR and Speedup but the CPOP algorithm outperforms the related work in terms of average SLR. On an average, the HEFT [20] algorithm is faster than the CPOP algorithm by 10 percent, the Mapping Heuristic (MH) algorithm by 32 percent, the Dynamic Level Scheduling (DLS) algorithm by 84 percent, Levelized-Min Time (LMT) algorithms by 48 percent. A high performance static scheduling algorithm [21] called Longest Dynamic Critical Path (LDCP) algorithm presented for Heterogeneous Distributed Computing Systems (HeDCS). In order to compute the LDCP, the HeDCS is formulated with m heterogeneous processors and application is computed as Direct Acyclic Graph that corresponds to a Processor P j (DAGP j ) with size of task set to their computation cost on that processor P j . The DAGP nodes are assigned with upward rank [21] (URank) and URank acts as cost function to prioritize them for scheduling whereas the Urank is summation of execution time on processor and communication cost between the adjacent tasks. The LDCP scheduling algorithm outperforms the both HEFT [20] and DLS algorithms in terms of Normalized Schedule Length (NSL) and speedup. A generalized fixed priority CPU scheduling model with the notion of pre-emption threshold [10] is developed and it bridges the gap between pre-emptive and non-preemptive scheduling models in real time. The scheduling model [10] addresses the problem of finding an optimal priority ordering and pre-emption threshold assignment for the tasks which are independent and do not suspend themselves whereas the overheads due to context switching are negligible. The model [10] introduces pre-emptablity as it is enough to achieve feasibility and ensures optimum schedulability by reducing scheduling overheads through minimum number of pre-emptions.
SCHEDULING MODELS FOR RECONFIGURABLE COMPUTING SYSTEMS
Reconfigurable Computing is an emerging paradigm that satisfies simultaneous demand for application flexibility and performance. The ability of customize its architecture, to support the concurrent computation and parallel application execution, demonstrates RCS performance benefits over the general purpose processor. A Parameterized Module Scheduling (PMS) algorithm for RCS [8] [4]addressed the problem of scheduling and mapping non-preemptive tasks of an application task graph to platform having variable Reconfigurable Logic Units (RLUs) by the concept parameterized modules and variable silicon area. The scheduling system [8] follows the concept Dynamic Programming (DP) to schedule the tasks & it is described in three parts: application in the form of task graph, computing environment and performance criteria to obtain the scheduling goal. Here, performance criteria would be the scheduling length 'L' i.e. actual Finish Time (FT) of the exit task v exit (L = FT (v exit )) and the goal is to minimize the scheduling length 'L' of an application. The scheduling algorithm [8] uses the b-level of task as rank function to prioritize the tasks of an application where b-level of a task node V i is the length of longest path from the node V i to exit task node. Loop Kernel Pipelining Mapping (LKPM) [2] addressed for Coarse Grained Reconfigurable Architecture (CGRA) to optimize Data Intensive Applications (DIA). In [2] , The Program Information Aided Control Dataflow Task Graph (PIA-CDTG) represents the functionality and behaviour of DIA, Virtual Instruction Dataflow Graph (Vi-DFG) represents the behaviour of critical loop kernels and Reconfigurable Architecture Graph (RAG) represents the loop self pipelining and loop iteration behaviour of CGRA. The M×N CGRA can be represented by RAG = (PE, C) where PE ij consists of memory PE (mPE) and computation PE (cPE), 0 ≤ ݅ ≤ ‫,ܯ‬ 0 ≤ ݆ ≤ ܰ and C describes the data relevance dependency. The LKPM map the control conditions of loop to mPEs and body of the loop to cPEs to increase the throughput of DIA. A dynamic scheduling and placement algorithm [11] has been proposed for RS based on finishing time mobility of the tasks. The model in [11] integrates an online placement algorithm with scheduling model to support FPGA clusters. Here the FPGA is divided into slots or clusters and the arriving tasks are placed inside one of the cluster depending on their execution end time values. To enhance the efficiency of the device [11] , the width of the clusters varies in runtime when needed and the host processor could control the mapping of hardware task code as an executable circuit to FPGA. Online scheduling of real time tasks to reconfigurable computing systems [12] [13] [14] formalized with the objective of reducing configuration overheads through resource reuse and minimizes the total execution time in addition to decrease task rejection ratio. The model in [12] is combination of window based stuffing algorithm and KAMER [11] placement algorithm. The model in [13] focuses on real time independent tasks and the tasks are defined with 5 -tuple ܶ = ሼ‫ݓ‬ , ℎ , ݁ , ܽ , ݀ , ‫ݎ‬ ሽ where ‫ݓ‬ , ℎ , ݁ , ܽ , ݀ ܽ݊݀ ‫ݎ‬ represents width, height, execution time, arrival time, dead line and reconfiguration time of tasks respectively. The schedulable bound of these algorithms [12, 13] is less than 100%. A heuristic approach to schedule periodic real time tasks on RH [14] formalized two scheduling algorithm called EDF-Next Fit (EDF-NF) and Merge Server Distribute Load (MSDL) for preemptive periodic tasks. The MSDL constructs a set of servers by properly merging set of tasks for parallel execution and the resulted servers are then scheduled for sequential execution on FPGA with EDF-NF.
SCHEDULING ALGORITHMS FOR HETEROGENEOUS RECONFIGURABLE COMPUTING SYSTEMS
A computing platform called MOLEN Polymorphic processor [26] presented and it is incorporated with both general purpose and custom computing processing elements. The MOLEN processor is also incorporated with arbitrary number of programmable units to support both hardware and software tasks. An efficient multi task scheduler for runtime reconfigurable systems [9] proposed a new parameter called Time-Improvement as cost function for compiler assisted scheduling algorithm. The Time-Improvement heuristic is defined based on reduction-in-taskexecution time and distance-to-next-call. The scheduling system in [9] target to MOLEN Polymorphic processor [26] and it assigns less CPU intensive tasks and control of tasks to General Purpose Processor (GPP) whereas computing intensive tasks are assigned to FPGA. The task scheduler in [9] outperforms previous algorithms and accelerates task execution from 4% up to 20%. Online scheduling of Software Tasks (ST), Hardware Tasks (HT) and Hybrid Tasks (HST) proposed [6] for CPU-FPGA platform, where ST executes only on CPU, HT executes only on FPGA and the HST execute on both CPU & FPGA. The scheduling model [6] uses reserved time of tasks as cost function and it is integration of task allocation, placement and task migration modules. An On-line HW/SW partitioning and co-scheduling algorithm [3] proposed for GPP and Reconfigurable Processing Unit (RPU) environment in which Hardware Earliest Finish time (HEFT) and Software Earliest Finish time (SEFT) are calculated for tasks of an application. The difference between HEFT and SEFT imply to partition tasks and EFT used to define task scheduled list for GPP and RPU as well. An overview of Tasks co-scheduling is described [7] [31] to µP and FPGA environment from different communities like Embedded Computing (EC), Heterogeneous Computing (HC) and Reconfigurable Hardware (RH). The Reconfigurable Computing Co-scheduler (ReCoS) [7] integrates the strengths of HC and RH scheduling to handle the RC system constraints such as the number of FFs, LUTs, Multiplexers, CLBs, communication overheads, reconfiguration overheads, throughputs and power constraints. The ReCoS algorithm as compared with EC, RC and RH scheduling algorithms, shows improvement in optimal schedule search time and execution time of an application. Hardware supported task scheduling for Dynamically RSoC [15] described to effectively utilize the RSOC resources for multi task applications. Task systems in [15] represented as modified Directed Acyclic Graph (DAG) and the task graph is defined as tuple G = (V, E d , E c , P), where V is set of nodes, E d and E c are the set of directed data edges and control edges respectively and P represents the set of probabilities associated with E c . The RSoC architecture in [15] comprises a general purpose embedded processor along with two L1 data and instruction cache and a number of reconfigurable logic units on a single chip. The summary of the paper [15] states that Dynamic Scheduling (DS) does not degrade as the complexity of the problem increase whereas the performance of Static Scheduling (SS) decline and finally the DS outperforms the SS when both task system complexity and degree of dynamism increases. Compiler assisted runtime scheduler [16] is designed for MOLEN architecture where the compiler describes the run time system as Configuration Call Graph (CCG). The CCG in [16] demonstrates two parameters called the distance to the next call and frequency of calls in future to the tasks and these parameters acts as cost function to the scheduler. Communication aware online task scheduling for partially reconfigurable systems [17] distributes the tasks to 2D area based on data communication time of tasks. The scheduler in [17] 6 can run on host processor and tasks expected end time ‫ݐ‬ = ‫ݐ‬ ௧௦௧ + ‫ݐ‬ + ‫ݐ‬ ଵ + ‫ݐ‬ ௫ + ‫ݐ‬ ଶ , where ‫ݐ‬ ௧௦௧ is completion time of already scheduled task ‫ݐ‬ is task configuration time, ‫ݐ‬ ଵ is data/memory read time, ‫ݐ‬ ௫ is task execution time and ‫ݐ‬ ଶ is data/memory write time. HW/SW co-design techniques [18] are described for dynamically reconfigurable architectures with the aim of deciding execution order of the event at run time based on their EDF. Here authors have demonstrated a HW/SW partitioning algorithm, a codesign methodology with dynamic scheduling for discrete event systems and a dynamic reconfigurable computing multi-context scheduling algorithm. These three co-design techniques [18] minimizes the application execution time by paralleling events execution and controlled by host processor for both shared memory and local memory based Dynamic Reconfigurable Logic (DRL) architectures. When number of DRL cells is equal or more than three, the techniques in [18] brings better optimization for shared memory architecture than the local memory architectures. A HW/SW partitioning algorithm [30] presented to partition the tasks as software tasks and hardware tasks based on their waiting time. A layer model [20] provides systematic use of dynamically reconfigurable hardware and also reduces the error-proneness of the system components. The Layer Model [20] comprises of six layers (Bottom to Top). The lowest or first layer Hardware Layer represents the reconfigurable hardware, second Configuration Layer interfaces with the configuration port of FPGA, third Positioning Layer assigns the position to partial bit-stream, fourth Allocation Layer manages the resources for incoming modules on FPGA, fifth Module Management Layer provides access to all modules (tasks) that are loaded to the system and sixth Application Layer represent the application as task graph. These kind of layer models helps to design efficient operating for HRCS.
SUMMARY CHART FOR SCHEDULING MODELS FOR HIGH SPEED COMPUTING SYSTEMS
The summary chart of scheduling methodologies shown in table 1 demonstrates the author and paper reference in first column, nature of scheduling algorithm (static or dynamic) in second column, tasks handling behaviour (single task or multiple task supported) in third column, nature of computing resources in targeted computing platform ( microprocessor or FPGA or integration of µP and FPGA) in fourth column, nature of host platform where the scheduling methodology executes ( µP or FPGA) in fifth column, the targeted performance metrics (schedulable bound, execution speed enhancement and resources optimization) in sixth column, cost function, which is used to prioritize the tasks of an application that helps to prepare scheduled task list, in seventh column and finally future scope and remark of the methodology is described in eighth column. The overview and summary chart of various scheduling methodologies described for HSCS are as follows in table 1. 
CONCLUSION AND FUTURE SCOPE
Optimization of real time applications can be done only when HSCS has multifarious resources to support parallel processing where the scheduling algorithms play crucial role in distribution of tasks to the HSCS resources. In this paper, we have demonstrated various High Speed Computing systems which support runtime requirements of applications and also prepared a summary chart for existing scheduling methodologies. The homogeneous computing systems provide parallel processing to the applications at the expense of number of resources, the heterogeneous computing systems support distributed application with the expense of communication between resources, the reconfigurable systems brings dynamic reconfiguration in run time to the application at the expense of soft core processor efficiency and finally the HRCS provides optimal solution for computing real time application by integrating both soft core and hardcore processor as computing elements. The summary chart clearly states that the dynamic scheduling methodologies with multitask are effective in speedup real time application on HSCS but the scheduling model could run always on soft core processors which degrades the efficiency of scheduling model in runtime. So, there is a demand for researchers to develop scheduling model which could run on hard core processor which enhances the efficiency (speed) of scheduler in runtime.
