Today's embedded systems operate under increasingly dynamic conditions. First, computational workloads can be either variable by nature or adjustable. Moreover, as many devices are batterypowered, it is common to have runtime power management technique, which results in dynamic power budget. This paper presents a design methodology for multi-core systems, based on dataflow specification, that can deal with various contexts. We optimize the original dataflow considering various working conditions, then, autonomously adapt it to a pre-defined optimal form in response to context changes. Such context changes at runtime might cause non-negligible delay or power budget violation. In order to overcome them, an efficient mode switching method is proposed. We show the effectiveness of the proposed technique with a real-life case study, stereo-vision, and synthetic benchmarks.
I. INTRODUCTION
Due to the ever increasing functional and computational demands, traditional uni-core microprocessors or microcontrollers are no longer an effective solution in the design of embedded or cyber-physical systems. Though it is possible to deal with the heavy computational workloads by having specialized accelerators or coprocessors, it is not suitable to have such customized designs for embedded or cyber-physical systems in terms of economic feasibility or time-to-market. So, it is nowadays prevalent to use commercial-off-the-shelf multi-core processors [2] in embedded systems.
The workloads in embedded or cyber-physical systems are synthesized into multiple tasks or threads running on the multi-core processor. Then, an important combinatorial optimization problem arises to determine which workload is executed on which core. This procedure is called mapping optimization and considered to be one of the most important decisions in the multi-core system design [3] . Thus far, The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Anwar Hossain . the mapping optimization has been performed offline since it is typical to assume that the design parameters, such as task execution time or power budget, are given as fixed values. However, this assumption is no longer valid in today's embedded systems. Firstly, the design complexity and level of abstraction continue to get higher. Moreover, the working environment or incoming workloads are increasingly dynamic.
There are many sources of dynamism in today's embedded systems. First of all, the workload characteristics itself gets dynamic as it is closely related to physical processes [4] . In some use-cases, software implementations offer different execution modes resulting in variable execution times [5] . Further, to take advantage of the tradeoff between the execution time and the quality of outcome, it has been tried to optimize the multi-core system over the variable execution times [6] . Another example of dynamic working environment can be found in systems with energy harvester where power budget is variable [7] .
In the above-mentioned cases, the variable design parameters are originated or related to input stimuli or given operating conditions. Thus, it is practically impossible to capture the dynamic behaviors of varying parameters accurately in the design time optimization. In order to overcome this issue, in this paper, we propose a dataflow-based multi-core embedded system design methodology to systematically consider the dynamic factors that cause considerable variations in system status or design parameters.
The overall framework of the proposed technique is illustrated in Fig. 1 , which is composed of two offline steps, state derivation and individual design space explorations (DSEs), and a runtime management. In the first offline step, state derivation, the user needs to enumerate a set of all possible system states or contexts, between which the system may alter at runtime. Among the derived set of states, it should be described by the user when or under which condition state changes happen. Then, a finite state machine (FSM) description is automatically built out of the original dataflow specification. In the second part, each state of the derived FSM is individually customized from the original dataflow and optimized with respect to its specific condition. In the individual optimization, the topology of the original dataflow is modified to be more suitable to the associated context. Once these offline phases are completed, we obtain the customized dataflow topologies and their optimized mapping for all possible contexts stored in the database. At runtime, whenever necessary, the dataflow topology as well as its mapping decision can be adjusted in response to a context change caused by state transitions of the derived FSM.
System state or context changes may be triggered either by internal execution mode changes or by external events, both of which are considered in this work. For the former case, quality-of-service (QoS) changes in software execution is considered. In some computer-vision applications for embedded systems, for instance, it is common to consider a number of QoS levels [6] , [8] . Such flexible QoS levels have also been studied at the circuit-level to explore energy-QoS tradeoff in the video codec domain [9] , [10] . Other than the vision or video codec applications, there are a wide variety of algorithms that are associated with such QoS-time tradeoffs [11] , so-called anytime algorithms. In such algorithms, the task execution can be interrupted at any time and the quality of outcome is proportional to the amount of execution time. Taking advantage of this relation, an algorithm can be operated with reduced QoS, e.g., lower resolution, in favor of better energy-efficiency. The later case, state changes triggered by external events, is considered with varying power budget in this work. Today's embedded systems are often mobile or deployed in extreme environments in which periodic maintenance is not feasible. Thus, in many cases, they do not have stable power sources. Energy harvesting techniques [7] or runtime power managements [8] are popularly adopted in the design of embedded systems, resulting in the varying power budget.
The basic dataflow adaptation framework that we present in this paper has been originally proposed by our previous work [1] . In this work, we complete the framework by enhancing the mode switching overheads which remained as future work. The overhead of the naive mode switching is identified and quantitatively discussed in this work. Then, we propose the light-weight mode switching method to deal with context changes more promptly and efficiently. The overhead of the light-weight switching is also studied in terms of memory usage, then, alleviated by buffer sharing.
The rest of this paper is organized as follows. In the next section, we review the previous researches that deal with the dynamism in the design of embedded systems. Then, in Section III, the model we assume in this work is illustrated, followed by problem formulation. Then, we present how a dataflow application is optimized considering various system states by the proposed technique in Section IV. Further, in Section V, how the proposed technique enables efficient mode switchings at runtime, in terms of switching delay and memory usage, is presented. In Section VI, we show the effectiveness of the proposed technique with a real-life case study, computer-vision for drones. Quantitative evaluations of the real-life benchmark are shown in Section VII in terms of power consumption and QoS, followed by concluding remarks in Section VIII.
II. RELATED WORK
There have been a number of studies on how to adapt dataflow applications to dynamic systems. Stuijk et al. [12] proposed Scenario-Aware DataFlow (SADF), in which the dataflow specification also captures a control part that determines the active part of the dataflow. SADF has not only improved the expression capability of the dataflow model, but also enabled more sophisticated design-time analysis of dynamic systems. A scenario-based design flow for multi-core embedded systems has been proposed by Schor et al. [13] , in which, unlike SADF, environmental or execution mode changes are expressed in different states in an FSM. To cope with varying functionalities for different contexts, each state in the FSM is associated with a different set of dataflow applications. Jung et al. [14] proposed a multi-mode dataflow model and its mapping/scheduling technique on multiprocessor systems. They devised Mode Transition Machine (MTM), a simplified FSM, to describe the transitions between different context. And, an offline/online hybrid mapping optimization was performed to minimize resource usage with respect to a given throughput constraint. In the above-mentioned works, system designers are responsible for describing all possible scenarios of the dynamic system. Therefore, their approaches are not suitable or scalable when the number of possible scenarios is prohibitively big or the granularity of the specification is too fine.
In order to alleviate the effort of considering all dynamic cases at the specification phase, online optimization approaches have been proposed. Choi et al. [15] proposed to run dataflow applications with an initial configuration, then properly modify it as necessary at runtime. To be more specific, the bottleneck of the dataflow graph in terms of throughput is identified on the fly and the execution of the identified task is accelerated by means of task duplication. Zhang et al. [16] also proposed a similar online detection and dataflow adaptation framework, called ADAPT, to optimize the throughput even in dynamic execution scenarios. In addition to the task duplication, they also considered flexible pipeline stage management. Tuveri et al. [17] proposed an online remapping technique of process network for sharedmemory multi-processor platforms, where the mapping decision is reconfigured on the fly in response to dynamic workloads.
Another notable online approach is Parameterized and Interfaced dataflow Meta-Model (PiMM), whose application to synchronous dataflow model is called PiSDF, proposed by Denos et al. [18] . They enabled dataflow models to flexibly adapt their properties or topologies by means of parameterizable interfaces. Based on PiSDF, Heulot et al. [19] presented the Synchronous Parameterized and Interfaced Dataflow Embedded Runtime (SPiDER) framework where dataflow configurations can be adapted at runtime. In SPiDER, the mapping decision, as well as the dataflow configurations, is made by a master called GRT (Global RunTime), which could be a potential bottleneck in large systems.
Offline/online hybrid approaches have also been studied in the optimization of dynamic embedded systems. Quan and Pimentel [20] proposed a hybrid framework, called Scenariobased run-time Adaptive Resource Allocation (SARA), where multiple mapping solutions are precomputed at design time optimization. At runtime, in response to workload scenario changes, those precomputed mapping solutions are combined and further optimized. Diguet et al. [21] also proposed a similar offline/online hybrid optimization approach, where the online decision making is performed by means of an adaptive closed-loop model. Such a hybrid approach can also be found in other processing elements; Wildermann et al. [22] introduced a hybrid multi-mode optimization framework for reconfigurable FPGAs.
The proposed technique is different from the previous studies reviewed above in the following aspects.
1) We consider more generalized system state changes.
In the scenario-aware design frameworks, it was simply assumed that the dynamisms originate in the execution mode changes or scenarios [12] - [14] that are mostly initiated by users or algorithms themselves. That is the reason why, in their techniques, different contexts or modes manifest themselves as different set of active dataflow graphs that are described by users at the specification phase. In this work, however, we deal with more general system status changes so that dynamism can also exist even in a single dataflow as previously exemplified as QoS or power budget changes. 2) In contrast to the existing techniques [13] , [14] , [21] , [22] , we do not restrict ourselves to different mapping solutions; the dataflow topology can also be alternated in order to better optimize the parallelism degree.
3) The proposed technique is more general in applying various kinds of design constraints and objectives. Note that the existing online optimization techniques [15] , [16] , [20] only focused on maximizing throughput without guaranteeing any design properties like worstcase power or latency. Due to the fundamental limitation of online optimization, they are not suitable for the design of real-time or power-constrained systems. 4) At last, we propose an efficient mode-switching method and a memory sharing technique associated with it, based on the POSIX framework which is popularly used in the multi-core systems.
III. PROBLEM DEFINITION
In this section, we describe the system models in dataflow specification, mapping, variable workload, and parallelization, followed by the description of a couple of optimization problems. Note that the basic models are identical to those of the preliminary version of this paper [1] .
A. SYSTEM MODEL 1) DATAFLOW APPLICATION
An application is described in a dataflow graph, which is defined as a tuple V , E where V and E denote the sets of vertices and directed edges, respectively. Each element v of V , v ∈ V , denotes a constituent task of the application. Note that each task v is associated with a positive integer number, pr v , as priority which is also a part of description. Two tasks in a dataflow application may have an execution dependency. That is, if task v 2 can only be executed after the completion of task v 1 , it is said v 2 is dependent upon v 1 and this dependency is described as a directed edge from vertex v 1 to v 2 in the dataflow graph. A dependency of v d to v s is represented as a tuple v s , v d and E is the set of all execution dependencies in the given dataflow. A dataflow graph is associated with a latency constraint, T , by which the execution of the graph should be completed.
2) MULTI-CORE ARCHITECTURE AND MAPPING/SCHEDULING
A multi-core system consists of a set of cores C that can communicate with each other with fast shared memory. The mapping decision, i.e., which v ∈ V is executed on which c ∈ C, is described as a function map :
When two or more tasks are mapped on a core and ready to be executed at the same time, a task with the lowest pr value among them is chosen for scheduling. We assume a self-timed scheduling, thus, as soon as an application instance is completed, the successive instance is invoked. Though two communicating tasks are mapped on different cores, no additional communication costs are applied in the schedule, thanks to the fast shared memory between cores.
3) POWER CONSUMPTION
Since most multi-core processors are implemented in CMOS technology, power consumption of a core can be modeled as a sum of dynamic and static power dissipations. The dynamic power of core c is denoted as P dyn c and only counted when a task is being executed. On the other hand, the static power P stat c is counted even when the core is idle, executing no tasks on it. While we do not explicitly apply a dynamic power management, we allow a designer to choose to turn off some cores at all times. In this case, the power consumption of core c is P idle c . Note that each core in the system can be associated with its own distinct power parameters.
4) VARIABLE WORKLOAD
As stated in Section I, an application can operate at several different QoS levels. In order to quantify the impact of a QoS level to the system's behavior, a positive integer l ∈ [l min , l max ] is introduced, in which a bigger value indicates a higher QoS level. Note that the execution time of a task is a monotonically increasing function of QoS level l as typically assumed in anytime algorithms [6] . That is, execution time of task v on core c is denoted as exec v,c (l) and exec v,c (l min ) ≤ exec v,c (l) ≤ exec v,c (l max ).
5) PARALLELIZABLE TASKS
We assume that some dataflow vertices can be executed on multiple processing elements at the same time. This is possible as they handle multiple independent data sets at the same time, which is compliant with state-of-the-art parallel programming models such as OpenCL [23] or CUDA [24] . We enable this parallelization technique in the dataflow specification by modifying its topology. For that, a dataflow task v is associated with a maximum parallelism degree mpd(v).
That is, if mpd(v) = n, task v can be executed at most on n cores at the same time. At design time, we optimize the parallelism degree of each task and modify the dataflow topology properly tailored to a certain context. For instance, let us assume that the original dataflow is given as
B. PROBLEM FORMULATION
Based on the models described above, we formulate two optimization problems we target to solve in this paper as follows. In both problems, the original dataflow V , E , the multicore architecture C, and the execution time information exec are given as input and the expected output is the modified dataflow V , E and its mapping decision map. As a time constraint, the end-to-end latency of the schedule with respect to map should always be less than or equal to T .
1) QoS-CONTROLLED-POWER-MINIMIZATION (QCPM)
In this problem configuration, context changes arise from the application's requirements. That is, depending on which context the system operates at, it has different QoS requirements. Therefore, in the mapping optimization, a set of QoS levels L is given as input and it is assumed that the QoS requirement may vary at runtime within L. No matter which QoS level it is associated with, the time constraint should be satisfied. The optimization objective is to minimize the average power consumption. A typical example of QCPM can be found in the co-design of control quality and computational cost for computer-vision applications in UAV [6] .
2) POWER-CONSTRAINED-QoS-MAXIMIZATION (PCQM)
In this problem configuration, unlike the previous one, the dynamism is from its working environment: power supply. So, the power consumption, in this case, is not an objective to be minimized, but a constraint. That is, during a certain period, the power consumption of the system should be less than or equal to the power constraint PC. As we aim at devising an optimization technique that can adapt at runtime, we let PC change dynamically over time. In this configuration, the optimization objective is set to maximize the QoS levels with respect to the power budget. The PCQM problem can be applied to the adaptive mapping/scheduling optimization in transiently-powered systems with energy harvesting, e.g. [7] , where power budget may vary at runtime.
IV. PROPOSED DATAFLOW ADAPTATION TECHNIQUE
As shown in Fig. 1 , the proposed technique consists of two design time optimization steps and a run time management part.
A. STATE DERIVATION
In the first offline optimization phase, to enable efficient runtime adaptation, a set of all possible system contexts (states) should be first defined by the system designer for the given problem. Once these are defined, in line with the existing approaches [13] , [14] , a fully connected FSM, each of whose states denotes one of the defined contexts, is automatically synthesized. I.e., the system can make an arbitrary transition from one state to any other. The generated FSM is recorded as an XML file in a compatible format to [13] . Some problem-specific restrictions can be manually applied after the automatic FSM derivation. For instance, in case that no radical changes are allowed, the states in the FSM may have transitions only to the neighboring ones. Note that no extra states are added in this customization; we only consider removing some transitions from the original fully connected FSM.
For QCPM, context changes are triggered by the varying application's requirement, i.e., system states can be defined as a set of possible QoS levels, i.e., L. So, we generate FSM states as many as |L|. A task is associated with different execution times for different states. That is, task v on core c has an execution time of exec v,c (l 1 ) in a state with QoS level of l 1 , while it is exec v,c (l 2 ) in another state where the level is l 2 . In the PCQM problem, the number of states is equal to the number of possible power budgets. And, for each power budget, different power constraint (PC) is applied.
B. DESIGN SPACE EXPLORATIONS (DSE)
In the second offline optimization step, as shown in Fig. 1 , the dataflow associated with each state of the derived FSM is individually optimized in mapping and topology as they could have different constraints or design parameters. Since the multi-core mapping of dataflow applications is a wellknown NP-hard problem, we build a genetic algorithm (GA) based DSE engine by customizing a publicly available metaheuristic solver framework, Opt4J [25] . Once the dataflow of each state is optimized by the GA engine, it is also recorded in an XML file format compatible to [13] .
For the dataflow topology modification, we consider the notion of replication in Expandable Process Network (EPN) [26] . EPN is a variation of dataflow model, in which some tasks can be instantiated by multiple times to handle the data parallelism. During the optimization, the degrees of replication for the parallelizable tasks are fixed, together with mapping. Fig. 2(a) shows the genotype structure we assemble to solve the QCPM and PCQM problems. It is mainly composed of two parts: QoS and mapping. Note that the QoS part is only existent for PCQM since the QoS level is already given as a fixed value in each state in the QCPM problem. It chooses a single positive integer value within l ∈ [l min , l max ] as a QoS level in the corresponding state. In the mapping part, it optimizes the dataflow topology and mapping at the same time. Each task v ∈ V is assigned as many slots as mpd(v). Thus, the total genotype length for QCPM is v∈V mpd(v), while it is v∈V mpd(v) + 1 for PCQM. At each slot, a positive integer value is assigned during the DSE, which indicates a core id that the task is to be mapped on. If the value is zero, it is interpreted as no valid mapping. Let us take a dataflow shown in Fig. 2(b) as an example, where v 2 is the only task whose mpd is bigger than 1. Then, as illustrated in Fig. 2(c) , three integer values are to be determined in the three slots for v 2 . As the chosen integers are 2, 1, 0, it has been determined to be mapped on core id 2 and core id 1. In other words, v 2 is parallelized (or duplicated) by two instances that are mapped on c1 and c2.
The modified topology of Fig. 2(b) caused by Fig. 2 (c) is illustrated in Fig. 2(e) . Note that two interface tasks, split and merge, are inserted to parallelize v 2 . The interface tasks do not perform any meaningful computation, but only split (or merge) the incoming (or outgoing) data to (or from) the duplicated task instances. Therefore, they do not appear in the schedule, taking negligible amount of time for execution.
Each generated population in the GA engine is a possible mapping candidate. They are evaluated in terms of the predefined objectives. In QCPM, we aim at minimizing the average power consumption. Thus, we build a schedule with respect to a fixed mapping encoded in the gene by means of the priority-based list scheduling. The scheduling result of a mapping solution in Fig. 2(c) is shown in Fig. 2(d) in Gantt chart. For simplicity, we assume that all tasks' execution times are the same as 10 at the QoS level of 2. Note that an instance of v 2 and v 4 become executable at the same time on c2 at time 10. In this case, v 2 is chosen over v 4 as per the given priority setting, i.e., pr v 2 < pr v 4 . Once the schedule is obtained, average power consumption is calculated with P dyn , P stat , and P idle . In the case of Fig. 2(d) , power consumption of c1 is P dyn c1 + P stat c1 as it is always busy. On the contrary, c3 is never used in the schedule, thus can be turned off, resulting in the power consumption of P idle c3 . Given the schedule, c2 consumes power as much as 25·P stat c2 +15·P dyn c2 25 . In PCQM, on the other hand, the objective is set to maximize a single integer value, l, which is encoded in the QoS part as long as the time and power constraints are satisfied.
In addition to the objectives, they need to be evaluated whether they fulfill the given constraints. In both problems, they should satisfy the latency constraint, i.e., the time difference between the release of the first task and the completion of the last task should always be less than or equal to T . In the case of Fig. 2(d) , the schedule is valid as the latency is equal to T = 25. Once this condition is not met in a candidate solution, it is penalized with ∞ and −∞ in QCPM and PCQM, respectively. Additionally, in PCQM, the power constraint is enforced to be satisfied. That is, once the average power consumption of a mapping solution is bigger than PC, its objective value is set to −∞, which is interpreted as one of the worst solutions.
C. RUNTIME MANAGEMENT
Once the above two-step design time optimizations are performed, we obtain an FSM, in which each state is associated with an optimal dataflow modification and mapping solution. State transitions may be triggered either by external events (e.g. in PCQM) or internal requests from applications (e.g. in QCPM). We do not elaborate on the implementation of the event handling, which is not our novelty and beyond the scope of this paper. We borrow the hierarchical controller and event-management schemes proposed by Schor et al. [13] in the implementation.
Some state transitions may possibly change the topology, as well as mapping, of the original dataflow, which require runtime task migrations. Overhead of migrating active tasks from one core to another is known to be non-negligible even for homogeneous-ISA multi-cores [27] . In most existing works [13] , [14] , the dataflow tasks are implemented as POSIX threads [28] , where all the active tasks are artificially stopped and a new set of tasks are created at each state transition. These stopping and starting procedures introduce considerable time and power overhead during transitions. We observed that such a naive implementation of modeswitching results in the violation of power constraint or delayed switchings. Thus, in the following section, we propose a light-weight mode-switching method for POSIXbased multi-core systems.
V. LIGHT-WEIGHT MODE SWITCHING
As discussed in IV-C, the overhead of mode switching can be prohibitively big. This overhead is mainly due to the migration costs of tasks (V ) and channel buffers (E) between tasks. To solve this problem, we propose a light-weight mode switching with buffer sharing in this section.
Typically, in the existing POSIX-based dataflow applications on multi-core systems [13] , [14] , a state transition is associated with stopping the old tasks and (re)creating new ones. To be more specific, all tasks belonging to the old state are stopped and the resources assigned to them are relinquished by invoking pthread_exit. On the other hand, a new set of tasks that correspond to the new state of the transition are (re)created and properly initiated with required resources by invoking pthread_create. Regarding the channel buffers, dynamic memory release (free) and allocation (malloc) procedures are properly invoked on each state transition.
Note that, in this approach, allocating and releasing resources each time a state transition occurs would result in considerable overhead if state transitions are frequent. In order to avoid such heavy task stopping and (re)creation overheads, we propose to create all possible tasks a priori for the light-weight mode switching. At a certain state, only a subset of the tasks belonging to the corresponding state is selectively activated and all others remain paused. By doing so, the overheads associated with (re)creating/stopping tasks on each transition can be effectively reduced. For that, we modify the FSM transition of the basic framework [13] so that transitions do not cause POSIX thread creations or stoppings. Instead, as illustrated in Fig. 3 , we insert an initial FSM state (surrounded by dotted outline) that is responsible for installing (initial creation) all possible tasks and their associated buffers. All tasks are blocked by default, and, whenever a transition occurs, a defined set of tasks are released from blocking by pthread_cond_wait.
While this pause/resume scheme could effectively reduce the switching time, its benefit comes at the cost of increased memory usage. It is because all tasks for all possible states should be resident in the primary memory even when they are not activated. To alleviate this memory overhead, we apply buffer sharing between tasks in different states. Note that any pair of two channel buffers belonging to different states would never be actively used at the same time. Thus, we do not dynamically allocate buffers at runtime on state transition; but, different tasks belonging to different states share a single statically allocated data by only changing the pointers. For that, when the dataflow topology for each state is fixed after the DSE step, the number and size of channel buffers used in all states are investigated. Then, a global shared channel buffer, which is big enough to accommodate all channel buffers in any single state, is allocated. The allocated shared buffers are connected to a set of tasks that are activated when the state transitions. Through this shared buffer scheme, we can reduce the memory overhead problem accompanied by the light-weight mode switching.
This light-weight mode switching with buffer sharing is enabled by modifying the FSM obtained through the twostage design-time optimization. We artificially inserted an initialization state to the original FSM for instantiating all the tasks and allocating the shared buffer as shown in Fig. 3 . This inserted state does not make any changes in the functionality of FSM since no other states have a transition to the inserted one. It is implemented using POSIX APIs on top of a publicly available multi-core runtime manager [13] .
VI. CASE STUDY: STEREO-VISION FOR DRONES
In this section, we verify the applicability and effectiveness of the proposed technique with a case study on embedded system: stereo-vision for drones. One of the important challenges of autonomous drone is the obstacle avoidance [29] . Many solutions take advantage of expensive sensor modules such as radar or lidar [30] for detecting and avoiding obstacles. However, due to the high manufacturing cost, it is not generally applicable to low-end drones. An affordable alternative to this sensor-based solution is to use 2-D scenes obtained from two cameras and detect obstacles out of the scenes by means of image processing. This process is so-called stereovision and it requires considerable computational cost as will be discussed later. The output of stereo-vision is a 3d depth map in which the brightness of each pixel indicates how close the corresponding pixel locates from the cameras. This depth map is used to avoid obstacles in the autonomous drone control algorithm.
We believe that stereo-vision for drones is a well-suited benchmark for the validation of the proposed technique, as it is real-time constrained, battery-powered, and computationally intensive. We implemented the stereo-vision algorithm on top of a multi-core embedded board and installed it on a commercial low-end drone. Note that, for controllable and repeatable comparisons, we did not perform the experiments by actually flying the drone, but the drones are manually moved by human. Fig. 4 shows the hardware configuration of the drone system implemented as the case study. In order to capture the scenes in two horizontally aligned cameras, we use DUO3D [31] , which weighs less than 30 g and is capable of 45 fps capturing of 752X 480 images. For the real-time handling of 2-D scene streams from the cameras, we implement the stereo-vision algorithm (will be elaborated in the following subsection) and run it on a multi-core embedded board, Odroid-XU3 [32] . The multi-core processor, that Odroid-XU3 is equipped with, is a big.LITTLE octa-core AP with quad Cortex-A15s (big, performant but power consuming) and quad Cortex-A7s (little, slow but power efficient). For derivation and calibration of the power model (P dyn , P stat , and P idle ), we used an on-chip current sensor, INA231, which is integrated in Odroid-XU3 by default.
A. HARDWARE PLATFORM

B. STEREO-VISION AND ITS DATAFLOW MODELING
Stereo-vision obtains two separate images from the two horizontally-placed cameras, just like human eyes, and calculates a pixel-wise depth map by comparing those two. Among several well-known available algorithms for stereovision, we adopted the block matching (BM) algorithm [33] , in which the depth information of each pixel is estimated by examining Sum of Absolute Difference (SAD) values of the surrounding block. Fig. 5 illustrates the principle of the BM stereo-vision algorithm. First, it determines to which pixel in the right image a pixel in the left image correspond. This pixel-wise correspondence is resolved based on the position that minimizes the difference of the SAD values. In other words, in finding a matching position of a pixel in the left image, the location that minimizes the differences of the 9 surrounding pixel values is estimated as the pixel's position in the right image. Once the pixel-matching is completed, we now can determine how far the object associated with this pixel locates from the cameras by comparing the positions of two matched pixels. If the displacement between the two pixels is big, it is judged to be near and vice versa. It is worthwhile to mention that BM is highly parallel (as the candidate positions can be examined simultaneously without any dependencies) and computationally intensive (as it is associated with a great deal of subtractions and additions in the calculation of SAD). Fig. 6 shows how the BM algorithm is described in dataflow specification. Firstly, two images are obtained in the source task (Input Image). The obtained images are then properly divided into a number of sub-images in the following task (Split Image). Those sub-images are processed by the main tasks, denoted as Stereo Match, which require most intensive compute capability. Note that, here in Stereo Match, we can take advantage of a plenty of parallelism since the block matching procedures are completely independent pixel by pixel. That is, in our model, the maximum parallelism degree, mpd, of this task is bigger than 1. The results of the divided stereo matching results of sub-images are merged back in Merge Image, and, finally, the depth map is acquired in the last task (Mode Decision). In our implementation, a pixel of the depth map is an integer value between 0 and 255, representing a relative distance of a subject with 255 being the nearest. For QCPM, Mode Decision is responsible for the decision of required QoS level, as will be explained in the next section. In order to keep the drone system reliable, the stereovision is required to operate at least at 4 fps, i.e., the latency constraint is set to T = 250 ms.
VII. EXPERIMENTS A. BENCHMARKS
In this section, we quantitatively validate the effectiveness of the proposed technique in both QCPM and PCQM with the case study presented in Section VI. In addition, to examine how the proposed technique scales to bigger applications, we also perform the evaluations using synthetic dataflow examples. For that, we used SDF3 [34] in generating three synthetic dataflow graphs, which consist of 10, 50, and 100 tasks, respectively. Among the generated tasks, the top 30 % of the tasks with large execution time are selected and their mpd values are set to be equal to the number of cores, i.e., 8 in the target processor. While the execution times for multiple QoS levels in the case study are properly profiled by experiments as will be explained in the following subsection, we artificially impose multiple QoS levels and their execution times in the synthetic benchmarks. To be more specific, the execution time assigned by SDF3 is simply assumed to be the execution time at the highest QoS level and that of a lower level is modeled based on equation (1). The synthetic dataflow topologies and their execution time information are illustrated in Fig. 13 and Table 3 in Appendix.
B. QoS AND POWER MODELINGS OF STEREO-VISION
In the case study presented above, the stereo-vision is utilized to enable efficient obstacle avoidance in drones. Since it is clear that more sophisticated depth maps are more helpful for obstacle avoidance, it makes sense to interpret the variable resolutions of the stereo-vision processing as different QoS levels. In other words, during the flight, the stereo-vision algorithm is allowed to adjust the resolution of the input images fed to it by camera as necessary. In case that it is obvious that no meaningful obstacles or objects are detected from the input images, it may degrade the QoS level, i.e., the image resolution, expecting the enhanced power efficiency. On the contrary, if the algorithm starts detecting objects from the images, it would be better to switch to a higher QoS level in favor of the enhanced avoiding capability.
In this experiment, five image resolutions are considered: l 1 = 320 × 240, l 2 = 384 × 288, l 3 = 480 × 320, l 4 = 600 × 480, and l 5 = 640 × 480. The resolution of the next stereo-vision processing is decided in the last task of Fig. 6 , Mode Decision based on the current depth map result as follows. When no considerably near object detected, i.e., all pixels in the calculated depth map are less than or equal to 55, it sets the next QoS level to the lowest one, l 1 . Otherwise, in case that a depth map pixel between 56 and 105 is found, the level is switched to l 2 . Similarly, any depth map pixel between 106 and 155 (156 and 205) results in a state transition to l 3 (l 4 ). In all cases other than those mentioned above, the QoS level is kept to the highest one, l 5 .
We repeatedly performed the profiling of the execution times of Stereo Match at each of the above resolutions, {320 × 240, 384 × 288, 480 × 320, 600 × 480, 640 × 480}. Fig. 7 shows that the average execution times of Stereo Match increase along with the QoS level in both the big and little cores. As assumed in the workload model, it is modeled as a monotonically increasing function, exec, of the QoS level. Based on the repeated profiling results, we approximate exec as quadratic polynomials for the two types of cores: exec(l) SM ,big = 1.3 · 10 −9 · l 2 + 0.00016 · l − 0.5 exec(l) SM ,little = 4.1 · 10 −9 · l 2 + 0.00058 · l − 23. (1)
During the repeated profilings, we have also measured the power consumption 1 to calibrate the power model. P dyn = 2.05 Watts, P stat = 0.26 Watts, and P idle = 0.15 Watts are obtained for the big core, while they are P dyn = 0.18 Watts, P stat = 0.0675 Watts, and P idle = 0.0275 Watts for the little core. The target multi-core processor has a special deep power down mode for the big cores, where all four big cores can be turned off altogether. This can significantly reduce the P idle value. In line with this feature, we slightly modify the power model, i.e., if the four big cores are turned off at the same time, each big core is associated with another idle power value, P idle < P idle , which is measured to be 0.03 Watts. This deep power down is not applicable to the little cores since at least one little core should remain activated to serve the OS kernel.
C. EXPERIMENT RESULTS OF QCPM AND PCQM 1) QoS-CONTROLLED-POWER-MINIMIZATION (QCPM)
In this experiment, the adaptability of stereo-vision to QoS requirement variations is validated in terms of power minimization. Given the 5 QoS levels presented above, individual DSEs were performed for each and their results are illustrated in Fig. 8 . The modified topologies for the 5 levels are illustrated in Fig. 14 in Appendix. It has been observed that a higher QoS level requires more degree of parallelism in the dataflow application. And, the proposed technique could successfully adapt dataflow applications to increase parallelism as the QoS level increases as shown in Fig. 14. At the lowest QoS level, it is optimized to use only one little core whose average power consumption for executing the stereo-vision dataflow application is 0.45 Watts. At the other extreme, the highest level, the optimal configuration is obtained utilizing 2 big and 4 little cores and the average power consumption is 4.13 Watts.
Now that optimal configurations for the different QoS levels are obtained, it is shown how the proposed technique actually adapts in a specific scenario. An example runtime QoS change scenario for QCPM is illustrated in Fig. 9 . In this case, the application was executed for about 53 seconds in a particular flying route. According to the mode decision policy presented in the case study section, 9 QoS level changes have been detected during this flight scenario.
The benefit of the proposed adaptation technique is shown in in Table 1 in comparison to the statically optimized ones. If the avoidance capability of drone is the first design concern, it is best to always maintain the highest QoS level (l 5 ) in the static optimization. Compared to this case, the proposed adaptation approach achieves an average power saving of 53.03 %. One may enforce reduced QoS levels at design time, say l 1 -l 3 , to obtain better power efficiencies. Then, in these cases, considerable deteriorations are expected to arise, i.e., the total number of processed pixels in stereovision during the flight decreases significantly.
We performed the same experiments with three synthetic dataflow benchmarks to validate the generality and scalability of the proposed technique. In the dataflow graph, they have 10, 50, and 100 tasks (verticies), referred to as #10, #50, and #100, respectively, in the table. Overall, we could observe the same tendency; the adaptive approach outperformed the statically optimized ones when considering both the average power and QoS.
2) POWER-CONSTRAINED-QoS-MAXIMIZATION (PCQM)
In the PCQM optimization, a context change is triggered by the fluctuating power supply. In the first phase of the offline optimization, we derive the FSM whose five states denote power budget of 4.5, 4.0, 3.0, 2.0, and 0.5 Watts, respectively. The five individually optimized configurations are plotted in Fig. 10 as 2-D coordinates of QoS and average power consumption. At the smallest power budget, 0.5 Watts, it does not have the luxury of turning on more than one little core and, due to the time constraint, its maximized QoS stays at the minimum, 320 × 240. On the contrary, at the maximum power budget of 4.5 Watts, it could afford the highest QoS, 640 × 480. Like this, in the static optimizations, the power-and QoS-optimal points are diverged in separate configurations. The modified topologies for the 5 different power budgets are illustrated in Fig. 15 in Appendix. The higher power budgets tend to enforce the dataflow application to have more degree of parallelism as the objective is set to maximize the amount of data processed within the latency constraint. Again, the proposed technique could successfully adapt dataflow applications to have more duplicated tasks as the power budget increases as shown in Fig. 15 .
Similarly to the previous experiment, we applied an example power budget variation scenario as depicted in Fig. 11 . In the testing scenario of 70 seconds, 10 power budget change events occur and it has been observed the configuration is adjusted by the proposed technique. Each time the power budget changed, the system is adapted to one of the five precomputed configurations. Table 2 shows the average power consumption and the total number of pixels processed in stereo-vision as an indicator of QoS. The proposed solution could effectively maintain the highest possible QoS while keeping the average power consumption always below the transiently varying power budget. For the entire flight, the average power consumption was 2.14 Watts. The key feature of the proposed technique, which can be verified in this experiment, is that it could efficiently maximize the QoS while always keeping the power consumption below the given budget. We also performed the PCQM optimization for the synthetic benchmarks and the same observation has been made as shown in the table.
D. MODE SWITCHING
Note that, in the experiments performed thus far, the naive mode switching method has been used, where all the currently running tasks are stopped (pthread_exit) first, and only after that, a new set of tasks are brought into memory and restarted (pthread_create). In this case, transitions are likely to result in prohibitively long delays. In the next experiment, we compare the proposed light-weight mode switching with the naive implementation. Fig. 12 shows a comparison of the transition overheads in terms of average switching delay and peak memory usage in QCPM and PCQM of stereo-vision. In the naive switching, as shown in Fig. 12(a) , the average delay for mode switching in QCPM was 233.79 ms, which means it typically misses 1 − 2 frames at each transition during the entire scenario of Fig. 9 . In the light-weight mode switching, on the other hand, the average switching time of stereo-vision in QCPM is significantly reduced to 2.07 ms.
The gain in the reduced switching time comes at the cost of increased memory. As shown in Fig. 12(a) , the peak memory usage of the scenario in Fig. 11 is enlarged from 12.73 MB to 21.36 MB. By further applying the buffer sharing, the peak memory usage is reduced to 16.43 MB. Note also that little additional time overhead of about 0.02 ms is imposed on the buffer sharing for the process of connecting the shared buffer in state transition. However, the overall memory footprint is still bigger than the naive one. This is due to the fact that all tasks in all states should be resident in memory in the proposed light-weight mode switching. Further optimizations considering the trade-off between the memory usage and the switching efficiency remain as future work.
As shown in Fig. 12(b) , we basically observed the same tendency in PCQM as well. In PCQM, it is particularly important to keep the system under the given power budget. Note that the power consumption during the transition is not explicitly modeled in the DSE. Thus, we empirically measured the transient power consumption profile of the scenario depicted in Fig. 11 and observed power budget violations in 4 transitions (out of 9). By applying the light-weight mode switching with buffer sharing, the average mode switching time was reduced to about 2 ms, similarly to QCPM. It is noteworthy that the power budget violations were completely removed by the light-weight mode switching with buffer sharing. This proves that the proposed mode switching is also effective in terms of power consumption. 
VIII. CONCLUSION AND SUMMARY
In this paper, we propose a context-aware adaptive optimization technique of dataflow applications tailored to multi-core embedded systems and its efficient mode switching. By the proposed technique, runtime dynamic factors such as varying QoS requirements or fluctuating power budget now can be considered in the offline optimization of multi-core embedded systems. For efficient mode transitions, we propose a light-weight mode switching method which is implemented in POSIX threads. The memory overhead caused by the lightweight switching is alleviated by buffer sharing between tasks in different states. We prove the effectiveness of the proposed technique with a real-life case study on the design and optimization of a computer-vision application for drones, as well as synthetic dataflow benchmarks. The proposed technique enables to find a decent compromise over the tradeoff between power consumption and QoS.
As a future work, it needs to be investigated how to derive more suitable and compact FSM topologies in the state derivation step, with respect to given application and contexts. From this, more sophisticated switching with better memory management could be enabled. Further, it also remains as a future work to associate the proposed technique with power management schemes such as Dynamic  TABLE 3 . Execution times for the tasks in synthetic dataflow benchmarks presented in Fig. 13 .
Voltage-Frequency Scaling (DVFS) or Dynamic Power Management (DPM). APPENDIX Fig. 13 illustrates the synthetic dataflow benchmarks that are evaluated in Section VII. The execution times of the tasks (vertices) shown in Fig. 13 are summarized in Table 3 . The modified topologies of Fig. 6 derived from DSE are shown in Fig. 14 and Fig. 15 for QCPM and PCQM, respectively.
ACKNOWLEDGMENT
A preliminary version of this article appeared in the proceedings of Design Automation Conference (DAC) 2018, under the title of ''Context-Aware Dataflow Adaptation Technique for Low-Power Multi-Core Embedded Systems'' [1] .
