Abstract-Single microcontroller embedded systems cannot easily satisfy the computational requirements of systems observing physical phenomena via multiple channels at high sampling rates. Flash FPGAs can provide the necessary trade-off between adaptivity and computational power, however, fewer developers are familiar with them. Thus, we propose a soft multi-core architecture in the fabric forming a loosely coupled network with a queue-based messaging framework for inter-core communication. This platform provides parallel improvements (as per Amdahl's Law) and a familiar Harvard abstraction. The nesC language was chosen for programming, as it enables modularity and assignment of independent tasks to cores. The single core development environment was augmented to help with the transition to the new architecture. A cycle accurate system simulator, called Avrora, was enhanced to fully support multi-core platforms and whole sensor networks. The architecture provides better power consumption and response time properties for time critical applications by effectively pipelining tasks.
I. INTRODUCTION
Complete embedded systems are now designed with only a central MicroController Unit (MCU) and a set of integrated and external peripherals. However, new use cases emerge putting forward processing requirements not achievable by these conventional architectures. A subset of Cyber-Physical Systems (CPSs) observe physical phenomenon and instantly process recordings. These new applications have more complex computational requirements. Low power operation and high computational throughput may only be met by reconfigurable parallel computing and most recent silicon technology.
Field-Programmable Gate Arrays (FPGAs) can implement arbitrary digital circuits. For high-throughput computationintensive tasks, they outperform general-purpose MCUs. Unfortunately, for resource-constrained battery-operated devices, they consume too much power. Conventional FPGAs store configuration in an external memory. Every time they start up, the contents of said memory have to be read, resulting in an initial phase when it is already powered on, but not yet configured. During this short time period conventional FPGAs draw high inrush currents. However, flash-based FPGA technology stores configuration directly on chip making the power saving feature of duty-cycling viable. The main issue of FPGA technology is the fundamentally different programming paradigm. MCUs prominently employ simple imperative languages, like C, describing algorithms in a sequential manner. FPGAs, on the other hand, follow a Register-Transfer Level (RTL) programming concept with naturally more concurrent languages, like Very high speed integrated circuits Hardware Description Language (VHDL) [1] . Certain classes of calculations can be simply expressed with the latter, resulting in efficient Intellectual Property cores (IP cores) implementations. Yet, in most cases of high-level algorithm development, the RTL abstraction is cumbersome.
Thus, instead of utilizing the hardware directly, soft processing cores should be instantiated in the fabric, which can then be programmed the conventional way. Consequently, developers can use familiar languages and development environments. This necessitates a software development framework that supports multi-core platforms. Our research focuses on this direction, but with the main idea of instantiating not one, but several soft cores, in order to have a single chip multi-core embedded architecture. The central motivation for our approach stems from the observation that contemporary embedded CPSs have to perform many loosely connected highthroughput tasks in a tightly timed parallel manner.
Our concept involves the functional decomposition of complex applications to identify autonomous modular parts. Components associated with certain functionality are assigned to individual soft processing cores, which only have limited responsibility. From the programmers point of view, this approach yields simpler per core programs and reduced complexity. From an architectural point of view, the different cores may have different parameters, for example, lower clock rates, which may reduce power consumption. But, as the cores only serve certain limited purposes, latency and response time can still be lower than in the single MCU case.
The outlined system concept has many associated challenges. Parallel architectures have already been researched, but never widely employed in the resource-constrained embedded application field. The described reconfigurable computingbased architecture will have to support integration of hardware and software components. The assignment of these components to cores also has to be simple, with automated tools helping the developers finding feasible solutions.
The contributions of this paper include an end-to-end design approach that yields multi-core applications with eventdriven inter-core communication, a detailed simulation environment, and a case study.
II. HARDWARE ARCHITECTURE
The benefits of this architecture are reduced complexity and simpler programs. The independent parts are much more predictable and exhibit a more easily understandable behavior, while still providing lower latency and shorter response times. Also, cores can run at their own optimal speeds forming a Globally Asynchronous Locally Synchronous (GALS) system. The GALS design can achieve power savings due to two main factors. Firstly, computing cores can choose lower clock rates, secondly, power dissipation due to routing a global clock signal across Integrated Circuit (IC) does not exist. The clock distribution network is significantly easier to design at the local level. Global clock skew and slew rate issues are also alleviated since the design complexity of each synchronous block is reduced and easier to manage. Furthermore, the presence of several independent clocks can reduce switching noise, and GALS systems also tend to be more resilient towards ElectroMagnetic Interference (EMI) [2] .
1) MarmotE hardware platform:
We developed a lowpower, multi-channel, wireless sensor node called MarmotE [3] , see Fig. 1 , to serve as an example for the above described novel flash FPGA platform concept. Our most important goal was to enable experimentation with power saving techniques (such as energy harvesting), various analog sensor and radio front-ends, reconfigurable processing (such as crosslayer optimization for Radio Frequency (RF) communication), and embedded multi-core computing approaches. The platform follows a modular layered approach, and can be physically and logically divided into three parts. The bottom layer manages energy, featuring power monitoring and interfaces for batteries, wall power, and other sources, e.g., energy harvesting units. The middle layer is responsible for domain conversion, digital processing based on the SmartFusion flash FPGA, and high-speed connectivity such as a Universal Serial Bus (USB) or Ethernet. The application-specific frontend layer has baseband amplifiers and carries a RF chip for wireless communication. The stacked architecture makes it possible to seamlessly replace the top-layer radio front-end and the bottom-layer power supply modules, while keeping the same mixed-signal processing module intact.
2) AVR HP soft processor:
The AVR HP is a soft processor IP that may be employed in a soft multi-core architecture. It is a hyper pipelined (thus the HP extension in the name) version of an AVR type processor written in Verilog. The AVR platform in general is a Harvard architecture and Reduced Instruction Set Computing (RISC) type processor.
Digital logic is implemented in fabric, and thus, the key question here is how many of the MCUs may be instantiated. If the cores are the same, it is easy to see that certain parts are instantiated redundantly several times without actually being fully utilized. It is this realization that sparked interest in conjoined processor architectures, where certain operational units, i.e., multiplier, divider, etc., may be shared among many concurrent cores. In a sense, the most advanced form of this sharing is System Hyper Pipelining (SHP), where every part of the processor is shared, except for core registers, which store the current state of the processor. The AVR HP is designed along these lines. Synthesis and place-and-route results indicate that up to around 10 HP AVR cores could fit in the fabric.
A. Soft multi-core architecture
We will assume that the number of cores is limited to a fairly low number -on the order of ten. Thus, every core can be directly connected to those it has to communicate with. Alternatively a bus system may be employed [4] , but is very likely not necessary. Resource allocation and deployment, scheduling, and optional load balancing can be centralized to a manager, there is no need for distributed solutions.
The proposed final hardware architecture can be categorized as a loosely coupled, circuit switched, Asymmetric MultiProcessing (AMP) system made up of Harvard type cores. Cores can be connected to hardware peripherals or connected to special purpose processing blocks implemented within fabric. The key part of the architecture is the introduction of a queue-based messaging framework for inter-core communication, which utilizes both FPGA fabric and block memories, and has associated components in software as well.
Cores are assigned individual block memories, which are divided into three parts:
• Program memory or -if the code is too big to fitinstruction cache (completely transparent for the core)
• Data memory (including static and dynamic data)
• Message queue (one memory mapped message queue for every core)
Data memory is distributed and not shared among cores, and (in order to avoid race conditions) other processors can only influence the content of it indirectly through messages. As for the program memory, if individual program memories are large enough to hold the whole code, the binaries may be loaded directly into individual program memories.
Within this framework each core and IP core has a dedicated hardware message queue also utilizing block memory. Inter-core communication is exclusively done by sending short messages to these queues. The queues can hold messages of arbitrary format.
Asynchronous or parallel execution is achieved by putting messages in the core's message queue, with a process running on the core solely dedicated to sequentially removing said messages, and running associated functions. The queue write operation is a cooperation of different memory handlers.
The queue-based messaging framework provides rudimentary routing as well, but because communicating cores are expected to be hard wired in a direct point-to-point manner to each others' queues, the routing problem is reduced to mere switching. This has the advantage that the hardware parts of the framework become simple logic functions, and hence, less fabric is required when synthesized.
In-fact the limiting factor of the framework is the available block memory and not the fabric. For instance, message queues have to be big enough to not overflow and hold all the messages throughout the normal operation of device. In depth analysis is unavoidable for obtaining the exact required minimum queue and memory sizes for a given application. The Avrora simulator, described in later sections, can be employed for this step.
To maintain backward compatibility, the queue-based messaging framework was added in such a way that it does not interfere with already existing system parts. Using the memory handler, the cores are able to push messages into the queue of any other core, even into their own. The queue size is limited to ten entries for the multi-core platform and can be extended if need be. The messaging passing process employs a requestreply approach. The transmitter first requests an empty slot in the message queue, to which the receiver replies whether the request was granted or not. If granted, the transmitter sends the actual message.
III. PROGRAMMING PARADIGM
The ultimate goal is to augment the single core programming approach so that it enables programmers to easily distribute applications on many cores. Ideally, existing singlecore software projects should be fairly simple to migrate to the new multi-core architecture. The resulting new multi-core application then should have improved properties compared to the old one, particularly with regards to reduced event misses. Practically speaking, this means requirements for structural composability and modularity. The network embedded systems C (nesC) programming language provides the necessary modularity, and fits the multi-core concept, thus, it was chosen as the base language.
A. The TinyOS framework and nesC
The nesC language -widely employed in the Wireless Sensor Network (WSN) community -addresses the lack of modularity and reusable components in C [5] . It is a component-based event-driven extension of C meant for a framework called the TinyOS platform [6] . TinyOS is an Operating System (OS) designed to run on resource-constrained hardware platforms. The core OS requires only 400 bytes of code and data memory, combined.
From a code structuring point of view, applications for the framework at the topmost level can be described in terms of graphs. The nodes are components, which encapsulate functionality and state, and expose a subset of them through interfaces. Edges are bi-directional connections between components via these interfaces. For these connections, or wirings in nesC terminology, the interfaces need to be specified only, not the associated components. More specifically applications are a combination of hypergraphs and multigraphs. This is because several components can be wired up using the same interface, and components may have many parallel connections. From an execution model point of view, the OS does not support thread-based concurrency in which thread stacks would consume precious memory. Instead, it schedules and provides asynchronous, deferred execution of non-time-critical and computationally intensive operations referred to as tasks. Tasks are independent, but do not run truly concurrently in the single execution thread of a single core, as there is no pre-emption. Tasks run to completion, so they can be considered atomic with respect to each other, but not with respect to interrupts. A task can be thought of as a chain of (subsequent and branching) function calls entering components through their interfaces. These task call trees are rooted in the scheduler. Task executions are requested for either a hardware interrupt or a previous task having posted a deferred task.
Due to the lack of pre-emption, an approach was needed to break up long tasks into smaller, manageable pieces. This is achieved by using split-phase interfaces. These interfaces provide a way to initiate an operation in a non-blocking manner, and have a callback that signals the completion of the operation later on within the context of a different task.
1) Sense and Forward application example:
The "Sense and Forward" application illustrates some of the major tasks of WSNs. The embedded system has to measure some value in a periodic manner, it then has to send a radio message to the rest of the network based on the measurement results. Since the communication is expected to be cooperative, nodes are also required to intercept each other's radio messages and process and retransmit them as needed, thus forming a multihop network. 
2.
1.
3.
1.
2. The actual working of the program is straightforward. The "MainC" configuration signals the "booted" event on start up, which starts the radio interface using the "ActiveMessageC" configuration. When the radio interface is up and running, the hardware timer is initialized using the "Timer" configuration. From that point on the hardware timer will periodically signal the "fired" event, which will call the "Sensor" and read a value. The "SignalProcessingP" module will then take this value do some sort of processing, and call "SendRadioMsg" command in "AMSenderIFP" module. This latter is really nothing more than a module dedicated to hold the state variables (i.e., message buffers and flags) associated with the radio stack, and prepare and send a message. The "Packet" interface provides a simple way to handle message buffers. a) Partitioning: The partitioning process has two options when considering partitioning: cutting components or cutting along interfaces. Cutting a component is conceivable, but unlikely, because this would mean that there are independent parts within the same component. The very idea of a component is to collect related functionality. So, the basic approach is to cut along interfaces. However, all interfaces have to stay connected, so if one of the connections is moved to another core, some wrapper component has to take over that interface. It has to wrap the whole underlying messaging framework in a transparent manner.
b) The task call tree interpretation: The component to core assignment is not just a graph partitioning problem, because that omits the time and causality aspects of the execution model. Causality, in this context, means the function call order. Computation is expressed as nested, branching function calls within the context of tasks. Hence, it is crucial to see how partitioning and mapping (using wrapper components and the queue-based messaging framework) affects not just the component graph but the call trees as well.
Several function calls can take place within one component never actually leaving the component, which calls are thus irrelevant for a partitioning where the smallest entity is a whole component. Hence, for all practical purposes, the partitioning described here has a granularity not going beyond the component level. So, here task call trees are discussed that describe how components call each other via their interfaces, see Fig.  4 .
The partitioning decisions are based on the task call trees, Thus, the main call tree types of task call trees have to be identified.
• Disjoint tree: Two task call trees are disjoint, if they use disjoint set of components. Disjoint trees can run in a trivially parallel manner.
• Conjoined tree: Two trees sharing a set of components. A shared component is a likely sign of shared resource access, which means that both task call trees have to be executed on the same core. However, there are certain cases, where components do not contain resources, but rather provide services, e.g., mathematical functions. These components have no persistent inner states and affect no peripherals, thus they can be safely copied.
The partitioning is about finding disjoint trees that may run in parallel. If such trees are not found, conjoined trees may be turned into disjoint trees, if shared components can be safely duplicated. If the above described classification does not result in satisfactory partitioning, the number of components within a task call tree has to be reduced to shorten its run time. The rerooted cutting technique is in effect the splitting of a task call tree at a split-phase interface into two disjoint trees see For example, the independent sequential initialization of peripherals can be split up this way. On the single core system, a task call tree may sequentially visit components (dedicated to peripherals), and execute their initialization commands. On a multi-core system some of the peripherals may be assigned to different cores. So, the split-phase call to the peripheral component's initialization command may be redirected through the messaging framework. This way the original task call tree returns immediately once the messaging framework has sent the message. The new call tree on the other core will be initiated by the messaging framework having received the initialization command call message. So, the messaging framework will call the same component, and the peripheral gets initialized.
c) Partitioning abstraction:
For partitioning purposes an abstraction was devised to easily describe the process. The goal was to give a simple way to model the hypergraph partitioning problem.
The abstraction defines cores, so that during the parallel development workflow the developer has to assign some components to these computing cores. There are no restrictions within the abstraction on the assignment process. However, in practice:
• If a resource or state is associated with a core, then modules directly working with said state or resource have to run on the same core as well.
The abstraction defines assigned or dedicated components to specific cores, but it also defines unassigned components, which may be associated with any core. The final assignment of such components is automatically performed by the development tool set. However, it is the developer's responsibility to mark certain components and edges in order to guide this process.
Some components may be marked "copyable", which indicates that they can be safely replicated on different cores. In case of conjoined trees, these are the shared components which may be duplicated thereby safely separating the call trees. The developer has the responsibility to verify that for the application at hand, a given component can be considered copyable. Some guidelines:
• Components may not access core specific peripherals or resources.
• Components may use their state variables only temporarily. They can not have a real permanent state that affects overall code execution.
Some interfaces may be defined as "cuttable", meaning they may connect components residing on different cores. This definition is very much application dependent, and certain cuttable interfaces in one case may not be regarded as such in other cases. Some guidelines:
• If a signal or event of a module returns a value of any kind, the interface is not cuttable. The interface has to be split-phase.
• The interface must not be timing critical, as the communication through these interfaces is not guaranteed and might involve the queue-based messaging framework.
• No pointer parameters may be passed, as cores have no shared memory.
• Transmission data rates can not be too high, as message queue block memories are limited in size.
2) Multi-core Sense and Forward application example:
We allocated three cores and assigned them to different partitions of the Sense and forward application. Existing configurations and modules are left intact, and the queue-based messaging framework (enclosed with a dashed, blue box) was added. In between cores, message queues were added that represent a simplified version of the hardware message queues. Each core has one queue, and an additional fabric based processing algorithm (depicted as a simple white box with no markings) has one as well. On the software side, queues have associated modules, which have the sole task of putting messages in and reading messages from the hardware. Message handler blocks use this functionality. These modules are responsible for creating the right message format and routing the messages to their proper destination. Finally, wrapper modules provide the original interfaces for the original components, and manage the transformation of values into messages and the other way around.
The program flow is the same as seen in the single core case, only this time most command calls and event signaling have to go through the messaging framework. The solution is however transparent, and pre-existing modules and configurations require no rewrites. All the new framework related modules can be computer generated.
IV. ENVIRONMENT
The development framework is meant to help with the transition of single core projects to parallel solutions. Before actual deployment, a simulator helps answering questions, like how many and what type of cores a certain application needs, and how these shall be connected.
A. Framework
The goal of the framework is to provide support such that developers are able to designate top-level components for different processing units. An automated process checks the feasibility of the assignment, and if it is indeed a feasible solution, the assignment is automatically generated.
1) Multi-core project generation:
Only widely available, free, preferably open source and standardized tools and technologies were employed. The backbone of the framework became the Extensible Markup Language (XML) file format, the Extensible Stylesheet Language Transformations (XSLT) transformation standard, and the FreeMarker template engine with the corresponding FreeMarker Template Language (FTL). Also, all of the tools working with these formats are written in JAVA, and hence the whole framework is platform agnostic and easily portable. The shell scripts, which glue together the individual tools, were written for the bash shell, and are the most platform specific parts. But these scripts are little more than a collection of subsequent command calls, not complex algorithms, hence, they can be easily rewritten for any platform. 6 shows the complete framework (utilizing the TinyOS development environment) for code generation. The top part of the figure shows the conventional single core development process. The developer starts out by creating a singlecore project, which (compiled by the nesC compiler) results among others in a platform-specific binary and an XML file ("wiring-check.xml") describing the project. This XML contains information regarding the project code (parsed by the compiler) including among others definitions of components, interfaces, variables, event and command parameters, return values, etc. Also, at this point the developer has the option to take the compiled binaries and run them on the Avrora cycle accurate simulator to verify functionality and detect early bugs. Once satisfied with the results, the developer can move on to the multi-core project generation phase. This means that the developer has to create a partitioning guide. The partitioning abstraction, introduced in previous section, is employed here to describe the desired partitioning.
The developer-defined partitioning may not viable, thus, it is crucial to verify the proposed component separation. This is a two-step procedure. The first step only deals with toplevel components, hence, it is called top-level partitioning. The second step handles partitioning issues of the whole component hierarchy, and is thus called hierarchical partitioning.
The outcome of the feasibility check is a valid full partitioning guide. This file is made up of mostly the same information as the input partitioning guide, but it also lists all the top level components assigned to cores. In itself this file still does not hold the information necessary to generate actual code. Thus, a subsequent step is necessary during which the full partitioning guide and the single-core project descriptor files are processed, and the actual code generating information is extracted.
The final step towards a multi-core solution is the actual code generation, which employs a predefined template for the queue-based messaging framework architecture. The template is general in the sense that it does not assume any particular number or interconnection of cores. It also does not specify components, except for the components that are part of the queue-based messaging framework itself. The template is universally applicable for any type of top-level arrangement.
2) Top-level partitioning: Top-level partitioning takes care of component partitioning on the topmost level. The feasibility test algorithm serves two purposes. It verifies that the partitioning is feasible given the list of copyable components, cuttable interfaces, and dedicated components. Secondly, the algorithm assigns so far unassigned components.
a) Feasibility check:
The devised algorithm for the feasibility check is an application specific variant of the preorder Depth-First Search (DFS). For all practical purposes, it is safe to assume that the number of components on the topmost level is adequately limited, hence, the run time of this algorithm is not critical.
The algorithm itself is implemented in a recursive manner due to the declarative nature of XSLT (which does not even support variables). At its heart the algorithm is recursively repeating the same steps. In a nutshell,
• take a list of components as input,
• remove the first component,
• check the first component to see if it is assigned to a different core,
• generate a new list from this first component's neighbors,
• and finally recursively analyze the rest of the input list (without the first component).
The recursive function takes two lists of components as input.
• The first list "L" is just a collection of components (assumed to reside on the same core) that we would like the algorithm to check for contradicting assignments. A contradiction, in this case, means that any component in "L" is already assigned to a different core than the given core.
• The second list "potential_L" is the set of components that have been visited once and seem to check out, meaning that they can be potentially dedicated to the given core. However, this is just a temporary state, and these components are not fully verified yet. Their status can change depending on what the algorithm finds during recursive steps.
The algorithm returns a state and another component list. The state simply indicates whether the algorithm has found anything contradicting the assignment or not. This list holds all the (directly or indirectly) connecting components that may be safely assigned to the same core as the original input list.
b) Assigning undedicated components:
The second main objective is the assignment of undedicated components to cores. This is performed in a greedy manner by successively executing the above algorithm with different sets of input components dedicated to different cores. For example, the first run would have components dedicated to the first core as input, and thus would return a list of all the components that can be assigned to the first core. Similarly, the second run would have components dedicated to the second core as input, but all the cores dedicated to the first core would be on a prohibited list, and hence the algorithm would be forbidden to reassign them to the second core.
3) Hierarchical partitioning: Given a viable top-level partitioning (in an automatically generated full partitioning guide file), the hierarchical partitioning check is performed next. This phase is meant to verify that said partitioning remains viable even if the component hierarchy is taken into account. Hence instead of a top-level connectivity hypergraph, this step analyzes the component containment graph.
a) Feasibility check:
The main issue here is that certain components can exist simultaneously at different points in the hierarchy. These are not copies of the same component, but are indeed a single instance of one component. In other words, the containment graph is not a tree, and it may very well be cyclic. Given the above abstraction, the shared components may only be assigned to different cores simultaneously if they are copyable, in which case each core receives its own copy. The feasibility test algorithm verifies that the partitioning is viable given the list of copyable components and dedicated components.
The algorithm is implemented in a recursive manner similarly to the top-level partitioning case. The recursive function takes two lists of components as input.
• The second list "ignore_L" is effectively the set of copyable components, which can be safely disregarded when searching for components instantiated multiple times on different cores.
The algorithm returns a state and another component list. The state simply indicates whether the algorithm has found anything contradicting the assignment or not. This list holds all the connecting components that may be safely assigned to the same core as the original input list.
B. Simulation
The Avrora open-source, discrete time simulator [7] cycle accurate and supports the simulation of a network of nodes. This makes it possible to directly test binaries developed for various platforms, and evaluate how the signal processing and the radio stack hold up for radio networking.
In Avrora, MCUs are fully simulated (memory, registers, Input/Output (IO), etc.), and peripherals are modeled as FiniteState Machines (FSMs). Instead of the whole network and all the peripherals running in a synchronous lockstep manner, each MCU core is given a separate thread, which is solely responsible for simulating the core and the connected peripherals. Also, even within a thread, the core and peripherals are not simulated for every cycle. Instead, a single event-queue is employed for every core and connected peripherals, and the simulator is essentially only dealing with events as they happen. Thus, for example, timers do not require clock cycle simulations, they only have to place an event in the queue. However, this could quickly lead to nodes drifting apart in time to a degree where inter-node communication became impossible. Hence, the key to the above described ideas is the loose synchronization of nodes, which in essence is a method to stop individual node simulator threads from getting to far ahead.
1) Modifications to support multi-core simulation:
The original simulator assumed that platforms are single-core. The WSN simulation only supported nodes of the same type (program code could differ though). Also, the simulator employed time resolution that was equal to the clock rate of the MCUs.
For the multi-core simulation, the structure remained mainly unchanged with the addition of another container entity within the "Platform" called System of Elements (SoE). A SoE contains devices, peripherals, cores, and other SoEs. Examples for the use of SoE include the modeling of System on Chips (SoCs), or soft cores instantiated within FPGA fabric.
V. CASE STUDY
The above described concept has been tested on a Structural Health Monitoring (SHM) application employing real measurement data. SHM deals with the assessment of the inner condition of civil structures, like bridges. One way to perform this is by monitoring material cracking events, which have to be handled in real time, as extensive buffering is not an option. The event handling rate of the system may be improved by assigning different parts of the signal processing to different cores, i.e., pipelining. Also, throughput could be improved with a single MCU as well if the clock rate was high enough. But the increased dynamic power consumption associated with the approach prohibits this. There is a direct connection between event loss probability and consumed power, however multicore and single core systems perform differently ( Fig. 7) .
In the figure dotted lines indicate the performance of single core systems, solid and dashed lines show a 10 core multicore system. Blue lines indicate system performance based on actual measured Inter-Arrival Times (IATs), while red shows the performance based on an equivalent Markovian Arrival Process with two states (MAP(2)) model. In all cases there is a clear trade off between reliability an power consumption. Higher reliability for a system results in faster processing that results in higher clock speeds that increases power consumption. However, when different systems are compared, the same level of power consumption will yield better reliability. For example, based on the actual measured data, an event loss probability of 0.04 requires -17 dBm power for a single core system, and only -26 dBm from the parallel pipelined approach. Also, for -23 dBm power the single core system provides an event loss probability of 0.14, while for the parallel solution that is less than 0.04. More elaborate version of the case study is presented in [8] .
