Operating system designs in future wireless sensor networks by Watfa, Mohamed et al.
University of Wollongong 
Research Online 
University of Wollongong in Dubai - Papers University of Wollongong in Dubai 
January 2010 
Operating system designs in future wireless sensor networks 
Mohamed Watfa 
University of Wollongong, mwatfa@uow.edu.au 
Mohamed Moubarak 
Ali Kashani 
American Univ. of Beirut 
Follow this and additional works at: https://ro.uow.edu.au/dubaipapers 
Recommended Citation 
Watfa, Mohamed; Moubarak, Mohamed; and Kashani, Ali: Operating system designs in future wireless 
sensor networks 2010, 1201-1214. 
https://ro.uow.edu.au/dubaipapers/79 
Research Online is the open access institutional repository for the University of Wollongong. For further information 
contact the UOW Library: research-pubs@uow.edu.au 
Operating System Design in Future Wireless 
Sensor Networks 
Mohamed K. Watfa 
University of Wollongong in Dubai, UAE 
Email: Mohamed.Watfa11@gmail.com 
Mohamed Moubarak and Ali Kashani 
American University of Beirut, Beirut, Lebanon 
Email: {mm27, ak13}@aub.edu.lb 
Abstract— Traditional operating systems do not take into 
consideration the limitations in space and energy of wireless 
sensor networks. Thus, contemporary architectural 
demands in terms of power, heat, size and cost will not be 
satisfactorily met by such uniprocessing design. Also, the 
transition to multithreaded, multi-core designs places a 
greater responsibility on programmers and software for 
improving performance which is becoming increasingly 
important as sensor nodes are migrating towards dual 
processor designs. By analyzing and summarizing the 
activity of a system, one could locate sections of code that 
have a potential to generate enhanced performance. First, 
this paper studies the differences between different 
operating system designs introducing a thread-driven 
scheduling algorithm which focuses on the value of 
preemption to overcome the energy tradeoff brought by 
event-driven systems. We then devise efficient techniques 
that will enable us to locate sections in OS code that could 
behave more efficiently when parallelized, especially in 
terms of energy consumption. Finally, we provide 
simulation results that will validate our proposed 
techniques. 
Index Terms— Design, Energy consumption, Multi-core, 
Parallelism, TinyOS, Wireless Sensor Networks 
I. INTRODUCTION
Recent advances in computing technology, wireless 
technology, digital electronics, and MEMS (Micro-
Electro-Mechanical-Systems) have led to the creation of a 
new class of low cost, low power, small sized, 
multifunctional devices. These devices are called sensor 
nodes, nodes, or sensors. In essence, they are wireless, 
battery powered, smart sensors that have the ability to 
locally process data, communicate in short distances, and 
form ad hoc wireless networks with other sensors. 
Existing operating systems do not meet the requirements 
imposed by current and future sensor networks and hence 
the work on applicable operating systems has begun.  
Based on “Optimizing the Value of Preemption in Embedded Sensor 
Nodes”, by M. Watfa and S. Moubarak which appeared in the 
Proceedings of the International Conference on Embedded Systems and 
Applications (ESA'08), Las Vegas 2008.
The de facto operating system for wireless sensor nodes 
is TinyOS [1]. TinyOS has a simple design, similar to 
that of network interfaces. Hence as expected, TinyOS is 
event-driven. The scheduler in TinyOS is a simple non-
preemptive FIFO scheduler. That is, tasks run in order of 
arriving and run to completion, without being preempted 
by other tasks. Another embedded operating system 
designed for wireless sensor nodes is MOS [2]. Unlike 
TinyOS, MOS is thread-driven. That is, tasks are 
preempted by the scheduler for other (higher priority) 
tasks to run. This provides the aspect of virtualization 
desired in operating systems. Although other operating 
systems also exist in the field such as SOS [3], all 
operating systems conform to one of two design 
philosophies, event-driven and thread-driven. The choice 
of which design to adopt is not made abruptly, instead, it 
is thoroughly investigated since it has a significant impact 
on the performance of the system in its remaining life 
time. The importance of choosing among an event-driven 
system and a thread-driven one has motivated us to 
contribute to the field. Any application, algorithm or 
protocol will have to conform to the chosen design, hence 
carrying with it the design’s advantages and 
disadvantages. Making the choice at an early stage 
obliges the designer to go back to existing results of prior 
experiences and theoretical analysis. Event-driven 
systems are assumed to perform better under constrained 
environments. Yet they lack some system functionality 
and impose their own difficulties. However, thread-
driven systems provide high concurrency with 
preemption, allowing the use of real-time applications. 
Previous research has shown the ability of such systems 
to outperform event-driven ones. Yet, in some cases such 
as high system load, the thread-driven approach tends to 
consume more energy. Designers will then have to 
prioritize energy consumption and high concurrency. The 
thread-driven approach has more scope for optimization, 
therefore is chosen to overcome the energy consumption 
tradeoff imposed by event-driven systems. The first part 
of this paper studies the differences between different 
operating system designs introducing a thread-driven 
scheduling algorithm focusing on the value of preemption 
to overcome the energy tradeoff brought by event-driven 
systems. 
JOURNAL OF NETWORKS, VOL. 5, NO. 10, OCTOBER 2010 1201
© 2010 ACADEMY PUBLISHER
doi:10.4304/jnw.5.10.1201-1214
As multi-core processors uncover their way through 
embedded devices, it is interesting to see how embedded 
software could adapt to such technology. The rapid 
advance in the technology of multi-processors in 
embedded devices proposes the possibility of multi-
processor wireless sensor nodes in the near future. WSN 
operating systems however are not designed to make use 
of multi-processors on a single chip. To analyze the 
performance of WSN operating systems on multi-
processors, it is thus of extreme importance to locate 
potential parallelism first. Sensor node architectures such 
as the Instra-Node are heading towards multi-threaded or 
dual processor designs. This is not the case, however, 
with sensor nodes software yet. Parallelizing software for 
future multi-core sensor nodes offers the challenge of 
deciding where to parallelize code. This is a delicate step 
towards making full use of future sensor node hardware 
while achieving maximum performance. The second part 
of this paper aims at establishing a level of appreciation 
for the role of performance evaluation in locating 
potential parallelism to improve system performance. 
When mentioning potential parallelism, we refer to 
sections in a program that can be separated or divided 
among different threads or CPU-cores to improve the 
performance of the global system.  
To summarize, our contribution in this paper is 
multifold and involves the following: 
1- We define the notions of event-driven and thread-
driven systems and investigate the differences 
between each model. 
2- We introduce a simple and energy efficient 
preemption algorithm targeting single core 
embedded wireless sensor network operating 
systems resulting in a significant decrease in the 
number of context switches. 
3- We illustrate the significance of multi-
core/processor system architecture in current 
sensor operating system designs. 
4- We provide an algorithm that identifies potential 
parallelism in existing single-threaded wireless 
sensor node applications.  
The rest of this paper is organized as follows: In section 
2, we provide some definitions and terminologies used 
through out the paper. Related research work is 
summarized in Section 3. Section 4 presents an optimzed 
OS scheduler. Section 5 dicusses some evaluation criteria 
of parallelized systems. A parallelized algorithim is 
presented in Section 6.  We present the imulation results 
in Section 7 and conclude this paper in Section 8. 
II.  DEFINITIONS
A.  Events and Threads 
Before investigating the difference between the 
event-driven design and the thread-driven one, we will 
describe the two designs according to the existing 
operating systems. This is because some authors describe 
an event-driven system with a preemptive scheduler, but 
since our existing event-driven operating systems do not 
adopt that kind of scheduler, we will describe our event-
driven model as non-preemptive as well. Any comparison 
that will be done later will be based on the design 
described in this section. We will start with the event-
driven approach.  
Event-driven models consist of event handlers that 
continuously wait for events to issue tasks such as packet 
arrivals to be processed. Since tasks may arrive at a pace 
faster than that of the processor, tasks are queued. The 
scheduler of the event-driven model selects the tasks 
from the queue to be processed in a FIFO fashion. The 
selected task is then put on the processor and processed to 
completion, uninterrupted by other tasks. After the 
completion of the entire task, the scheduler can select the 
next task to process and so on as depicted in Figure 1. 
Figure 1. Event-driven execution model allows one process at a time. 
Figure 2. A thread-driven execution model simulates parallel execution 
on several CPUs.  
Thread-driven systems on the other hand deal with 
tasks in a different way as depicted in Figure 2. When a 
task is created, it is queued. The scheduler selects a 
thread from the queue in any fashion; let us assume a 
round robin scheduler, like the one in MOS. The thread is 
put on the processor for a certain time slot after which the 
thread is preempted (interrupted) and another thread is 
put on the processor. By allowing multiple threads to 
execute preemptively, the system acts as if there are 
multiple processors, one for each thread. This increases 
concurrency, however, the cost of preemption (context 
switches) is very expensive in terms of time, energy and 
1202 JOURNAL OF NETWORKS, VOL. 5, NO. 10, OCTOBER 2010
© 2010 ACADEMY PUBLISHER
memory. Another problem is that threads executing may 
share a resource. Semaphores or monitors should be used 
to insure safety and a reliable flow. Using the thread-
driven design also allows a thread that is waiting for an 
I/O device to be blocked, allowing other threads to 
execute while the I/O request is processed. This approach 
increases the processor utilization. Furthermore, a 
separate stack has to be maintained for each thread. Stack 
analysis techniques are used to predict the size of the 
stack on MMU-less hardware. Thus multi-threading is a 
package containing stack management, memory 
management on thread creation, and preemptive 
scheduling.  
Event-driven programming has been highly 
advertized in recent years as the best way to approach 
concurrent applications [4]. However, after more research 
has been done, it has been shown that the latter belief is 
not completely true. The arguments in favor of the event-
driven model are that it uses an inexpensive (non-
preemptive) scheduling technique, it requires no stack 
management and provides a safe control flow (no locks 
and semaphores) [4]. Moreover, event-driven systems are 
highly portable since they do not require the extra stack 
support for multi-threading. They also have a smaller 
memory stamp. However, in [5], the authors have shown 
that event-driven systems could still have the same 
performance of thread-driven systems. 
Programmer Experience
According to [6], event programming is tedious, 
unstructured, and repetitive. In the event-driven design, 
the event loop is in control and not the programmer. So, 
the programmer will have to chop a program into a series 
of short programs. This is also required in order not to 
allow a long running task to monopolize the entire 
system. However, in a thread-driven implementation, the 
programmer is not concerned whether his program 
monopolizes the system or not, since the system itself 
will take care of that through its preemptive nature. 
Bounded Buffer Producer-Consumer Problem 
Due to the RAM limitations in embedded wireless 
sensors, the buffers are sufficiently small for the bounded 
buffer producer-consumer problem to occur in an event-
driven system. When an event is filling up a buffer in an 
event-driven system, the buffer will not be emptied by a 
consumer unless the current event or the producer is done 
putting all the data it got on to the buffer. The buffer may 
be full for a time long enough to lose data such as packets 
that could not find space in the buffer. However, in a 
preemptive or thread-driven system, the buffer will be 
occasionally emptied by other events running virtually in 
parallel, avoiding the problem of producer-consumer 
bounded buffer. In event-driven systems, long lived tasks 
may exist under high system load due to the complexity 
of applications running. 
Disadvantages of Preemption
Preemption has played an important role in drawing the 
line between event-driven systems and thread-driven 
ones. Several research papers show that all the fears of 
multi-threading comes from preemption [6 and 7]. To 
elaborate, let us look at the disadvantages of the thread-
driven approach. One argument against the thread-driven 
approach is the difficulty in writing code that handles 
synchronization through semaphores or monitors [7]. The 
reason why locks are needed as a form of synchronization 
is because threads may be using shared variables while 
they run preemptively. In other words, if an event-driven 
system had a preemptive scheduler, then that system 
would also have to take synchronization into account. 
Thus, the question whether the control flow is event-
driven or thread-driven is orthogonal to the question of 
whether those threads and events were preemptively 
scheduled. 
To illustrate the motivation behind our work, we 
performed some experiments to compare the performance 
of TinyOS and MOS under high system load as shown in 
Figure 3. Experiments comparing TinyOS and MOS have 
shown that under high system load, MOS consumes more 
energy. In these experiments, a tree binary topology is 
assumed. Depending on the tree position n in the tree, a 
sensor node might process varying amounts of packets. 
The behavior of a single node is emulated by applying a 
certain traffic pattern. The node under test was given 
varying sensing task lengths and a set of forwarding tasks 
to emulate each tree position n, hence each node was 
stressed depending on whether it is a leaf node or a 
forwarding node. The idle time was measured at every 
position n in the tree as an indication of the amount of 
energy conserved. The difference in idle time is directly 
related to context switches or preemption, since under 
high system load, the number of incoming packets 
increases the number of interrupts. Under low system 
load, MOS offers better concurrency, prediction, and 
equal energy consumption as the event-driven TinyOS. 
Figure 3. As traffic increases, MOS tends to spend more energy than 
TinyOS Due to the overhead of context switches. 
JOURNAL OF NETWORKS, VOL. 5, NO. 10, OCTOBER 2010 1203
© 2010 ACADEMY PUBLISHER
B.  Tiny OS 
To meet the tight constraints of WSNs, TinyOS 
adopted the event-driven approach as the concurrency 
model and is currently the standard OS for WSNs. 
TinyOS was designed to have a very small memory 
stamp, where the core OS could fit in less than 200 bytes 
of memory. TinyOS’ event-driven choice was based on 
the fact that it cuts down on stack sizes since one process 
could run at a time. Another fact it is that it eliminates 
unnecessary context switches which are infamous for 
their energy inefficiency. TinyOS is entirely made of a set 
of reusable system components and an energy efficient 
scheduler and hence has no kernel. Each component is 
made up of four parts, a set of commands, event handlers, 
a bundle of tasks and a fixed size frame for storage. The 
commands and events a component supports must be 
predefined to enhance modularity. Components in 
TinyOS are arranged hierarchically with low level 
components closest to hardware and higher level 
components form the application layer as shown in 
Figure 4. 
Figure 4. Visual representation of a TinyOS component. Upside-down 
triangles represent command handlers, triangles represent event 
handlers, upward dashed arcs represent signaled events and downward 
solid arcs represent issued commands.
Components are of three types: 
1. Hardware abstraction components: These are the 
lowest level components that map the physical 
hardware to the TinyOS component model. One 
such component is the RFM radio component which 
manipulates the pins connected to the RFM 
transceiver. 
2. Synthetic hardware components: These components 
simulate the behavior of hardware. For example, the 
Radio Byte component performs data encoding and 
decoding that can be performed by hardware. These 
components lie on top of the latter. 
3. High level software components: These components 
form the application layer and are responsible for 
data management and routing. Data fusion 
applications fall into this category as well. 
Since components are organized, some form of 
‘wiring’ or binding is required to make inter-component 
protocols clear. This is provided by a component through 
its commands and events. As mentioned earlier, a TinyOS 
component is made up of commands, events, tasks and a 
frame. Commands are the set of function calls or services 
that a component will request from other components. 
Event handlers implement the handling of results returned 
from previous commands. Those results are triggered by 
the component that provided the service in a form of 
event to indicate completion of the service. Commands 
and events cannot block. Tasks on the other hand are a 
form of deferred computation. Most computational work 
is done through tasks. A component defines the tasks that 
it may post. When a task is posted, it is buffered until the 
scheduler runs it, which is a simple FIFO scheduler. 
When no tasks are pending, the scheduler puts the CPU in 
sleeping mode for energy efficiency. Only one task could 
run at a time and each runs to completion. Tasks may be 
preempted by commands or events. A task should not be 
long in order not to delay other tasks. Finally, the fixed 
size frame is used to depict the state of the component by 
storing parameters. The fixed size and static allocation of 
the frame allow for simpler memory management at 
compile time.  
C.  The Multi-* Technology 
The “Multi” prefix has been significantly introduced 
throughout the modern advancements and improvements 
in computer and communication context. Recently, 
terminologies such as: multiprocessor, multicore, 
multitask and multithreading have been ambiguous in 
terms of architecture, structure, functionality and 
purposes. 
In what follows, we will be giving a complete 
definition for each of the concepts mentioned above to 
make the idea more clear and precise. 
1. Multiprocessor Technology: Multiprocessor 
system can be defined as comprising 2 or more 
independent central processing units (CPUs), 
which only share a common back-end data bus 
interface.  One of the drawbacks of such 
architecture is the implementation cost in terms of 
multiple chips and bus requirements. 
2. Multicore Technology: Multicore, or on chip-level 
multiprocessor, can be defined as multiple 
processors (CPUs) on a single hardware chip. Each 
processor has its own L1 cache, while the L2 
cache, the main memory unit (MMU) and the data 
bus interface as shared among the multiple 
processors. The significance in multicore 
technology is that performance similar to that of 
multiprocessor system can be achieved for lower 
cost since much of the computing resources 
mentioned earlier are not duplicated but shared. 
3. Multi-task Technology: Multitasking is a method 
in which multiple tasks/processes, which are 
programs under execution, share common 
processing resources such as CPU and the MMU. 
Originally dependent on multiprocessor 
technology, multitasking required 2 or more 
processors for tasks to run simultaneously. Early 
operating systems were “single task” systems, 
1204 JOURNAL OF NETWORKS, VOL. 5, NO. 10, OCTOBER 2010
© 2010 ACADEMY PUBLISHER
meaning that only a single task/process can be 
executed at a time (e.g. Win 3.11). However, 
modern operating systems (e.g. Windows XP, 
UNIX, Mac OS, etc…) give the impression of 
parallel-multitasking execution by efficient 
scheduling of running applications and switching 
between them in an optimal time-slots assignment 
manner as if actual multi-tasking is taking place. 
The multiprocessor/core environment is 
transparent to the application where the operating 
system acts as an interface, mapping and 
scheduling tasks over available processors. 
4. Multithread Technology: Taking multitasking into 
a higher level, multithreading divides selected 
operations within a single task and map them onto 
individual threads. Furthermore, these threads will 
be executed in parallel on multiprocessor/core. The 
advantage of such technique is that efficiency and 
performance is pushed even further along each 
task, process and thread. 
III. RELATED WORK
The related research work can be divided into two 
different focus groups: 
A.  OS Design Related Work  
In [8], the authors make a first attempt at optimizing 
the low level implementation of thread-driven operating 
systems, in order to achieve event-driven performance. 
First, the authors perform stack analysis and used control 
flow information created at compile time to predict the 
size of the stack. Then, they provided a single stack 
implementation for all running threads, as opposed to the 
traditional technique of creating a stack for each thread, 
thus cutting down on space. The authors also tackle 
energy consumption by coming up with a new scheduling 
technique that depends on a variable timer, as opposed to 
the traditional fixed quantum, thus saving on computation 
latency. However, they did not take into account the large 
overhead produced by context switches. Their results still 
perform worse than event-driven systems, but with a 
great improvement compared to other thread-driven 
systems. Our work is greatly motivated and influenced by 
the works of [9] and [7]. In [9], the authors make a first 
step in studying the cost of preemption. The authors 
present a theoretical scheduling model which 
incorporates the cost of preemption. They show that 
preemptive algorithms, such as shortest remaining 
processing time, are theoretically optimal but are 
impractical because they do not take into consideration 
the cost of context switches. Moreover, the authors 
provide an algorithm, “wait to preempt”, which 
aggregates arriving processes and then runs them after a 
certain amount of work is done, which depends on the 
cost of preemption. However the authors aim at 
minimizing total flow time, which is the total time that 
the jobs spend in the system since arrival until they are 
run to completion. The cost of preemption introduced 
does not depend on energy consumed or on the CPU 
cycles. The algorithm is strictly based on the size of 
processes and also assumes the knowledge of the size of 
the smallest process.  The authors in [7] comparatively 
evaluate the performance of MOS and TinyOS. Their 
work measures the memory foot-print, event processing 
and energy efficiency of the two operating systems. The 
experiments aimed at comparing the performance of 
event-driven systems against thread-driven ones. The 
results show that the event-driven system, specifically 
TinyOS, has smaller memory foot-print and better energy 
consumption at high system loads. Whereas the thread-
driven MOS has better real time performance and 
predictability with similar energy consumption at low 
system loads. According to these results, a tradeoff exists 
when choosing among those systems. The same authors 
in [7] attempted to overcome this tradeoff later on in [10] 
and [11]. In [10], the authors focus on improving energy 
efficiency in MOS by tuning its preemptive scheduler. 
Their modifications included removing the idle thread, 
which ran whenever no tasks are runnable. Also, time 
slicing between equally prioritized threads was removed. 
If needed, the user should explicitly include it. Finally the 
linked list queues were replaced by a single array, which 
makes addition and deletion costly. This tuning technique 
is specific to MOS and not to thread-driven systems like 
ours; however it improves the energy efficiency of MOS. 
B.  Multi-Core Related Work 
Recent developments in hardware solutions in terms of 
fully programmable media processing devices allow the 
re-use of design efforts that would dramatically decrease 
the production and design costs. In [12], the authors 
suggest a novel approach for exploiting the advances and 
improvements in consumer-electronics industry in terms 
of exploiting parallelism using a multiprocessor 
architecture as an infrastructure for executing a most 
resource demanding application recently encountered in 
high definition multimedia: H.264 decoding. They 
suggested partitioning the H.264 application over the 
multi-processor environment in a data-partitioning 
fashion rather than the functional partitioning, since a 
comparison between the 2 approaches concluded that the 
former ensures: locality of data, load balancing of data 
among the multiple processors, system scalability without 
the need to rewrite the software, simplicity of 
implementation. The experimentation resulted with a 
conclusion that the proposed data partitioning scheme 
leads to a significant bandwidth reduction of 65% over 
the traditional functional scheme. After proposing the 
data partitioning schema as a solution for H.264 
decoding, a single specific data partition size and shape 
was considered which is a staircase shape.  
In [13], the authors introduce two techniques for 
aiding programmers in parallelizing loops via “loop 
profiling”. When trying to parallelize sequential code, a 
logical first step might be to find which loops are doing 
the most work. The concept of loop-centric profiling aims 
to give the programmer a more complete view of where 
time is spent in a program. Loop-centric profiling is 
JOURNAL OF NETWORKS, VOL. 5, NO. 10, OCTOBER 2010 1205
© 2010 ACADEMY PUBLISHER
similar in nature to the traditional call graph, but also 
identifies parent-child relationships and self/total 
execution counts for loops in addition to functions. In 
[14], the authors propose a Multi-Processor Operating 
Systems (MPOS) emulation framework for Multi-
Processor Systems-On-Chips (MPSoCs) that provide 
efficient evaluation of thermal management strategies at 
the architectural and OS levels. A MPOS framework, 
based on the Field Programmable Gate Array (FPGA), is 
proposed which consists of 4 cores with a customized 
version of uClinux (Linux for embedded system) running 
on each. A Task Migration module and a Communication 
Module along with the OSs comprise the HW/SW 
abstraction layer. Using hardware sniffers, a built-in 
library calculates the temperature of each core. A 
proposed thermal-aware policy initiates a task migration 
process based on the temperature threshold attained by a 
currently executing core. Whenever a core reaches this 
threshold, the task is migrated to another colder core, thus 
maintaining the overall temperature of the MPSoC. The 
authors suggested installing OS on every core, which 
affects the overall performance due to the OS-OS 
communication overhead and the increase in the design 
complexity. Tasks on the same processor share a common 
private memory space, where tasks running on different 
cores communicate via shared memory space. This 
design adds significant overhead when migrating tasks 
between cores, in which lots of data transmission will be 
involved and most of the bandwidth will be consumed. In 
[15], the authors discuss three possible techniques for 
loosening the constraints forced by control flow on 
parallelism: speculative execution, control dependence 
analysis, and executing multiple flows of control 
simultaneously. Simulations of execution trace are used 
to evaluate such techniques to find out the limits of 
parallelism for machines that utilize different 
combinations of these techniques. The ultimate goal is to 
design an Oracle machine where branch outcomes are 
known in advance, thus no instructions have to wait for 
branches to be resolved. Since such a machine is 
unrealistic in terms of hardware resources and 
complexity, such techniques need to be examined. 
IV. OPTIMIZED PREEMPTION TECHNIQUES
As mentioned earlier, we first start with a single core 
design where the main fears of multi-threading come 
from the value of preemption and therefore tackle this 
problem by introducing an energy efficient preemption 
optimization. We give an example of a research effort that 
aimed at analyzing the performance of WSN EOSs. 
Precisely the aim of the research was to analyze the 
performance of only a part of the operating system which 
is the scheduler. Our algorithm aims at optimizing the 
number of context switches in thread-driven systems, 
under high system loads. This is done by directly 
optimizing the number of preemptions. There are two 
scenarios that need to be taken into consideration under 
high system load. First, when sensing tasks are timely. 
When smaller tasks arrive, the longer sensing task will be 
continuously preempted as shown in Figure 5. This 
causes preemption overhead, and is worse when tasks are 
longer. The second scenario does not involve the size of 
incoming tasks; instead it involves the frequency at which 
they arrive. At high frequencies, processes tend to 
preempt each other irrelative of their sizes.  
Figure 5. Without taking into consideration the size of the process, 
scheduling may cause context switch overhead. 
Figure 6.  Using our algorithm, only one context switch is needed in the 
same scenario of Figure 5. 
Taking these scenarios into consideration, our 
algorithm works as follows. First, run processes 
preemptively in a round robin fashion. After some work 
has been done, preempt the currently running process if it 
is long, and run small processes to completion without 
preemption. Again after some work has been done, go 
back to step one of the algorithm and repeat. The 
algorithm presented depends on three values, ,  and .
 represents the size of a small process,  the size of a 
long process and  denotes a certain amount of work 
done. The idea as illustrated in Figure 6 is to create 
preemption free periods without affecting concurrency by 
differing small processes and running them to 
completion. The following sections elaborate on the 
choice of ,  and .
1206 JOURNAL OF NETWORKS, VOL. 5, NO. 10, OCTOBER 2010
© 2010 ACADEMY PUBLISHER
A. Process Sizes  and 
Accurately determining the size of a process is almost 
impossible yet is a very crucial piece of information. 
Several scheduling algorithms used in the field depend on 
the size of a process. One approach to predict the size of 
the process is called aging. The size of a process depends 
on the amount of time it has spent on the CPU during 
previous runs. Hence the update is continuously updated. 
Formally, assume a process spent time T0 on the first run 
and T1 on the second run. The new estimate is the 
weighted sum of these two runs, that is aT0 + (1 - a)T1, 
where a is the chosen weight. However our approach in 
determining the size of a process is simpler and is based 
on the quantum size.  and  are discussed in more detail 
later. 
B. Work Done 
The proposed algorithm mainly depends on the value .
The value  denotes the time when the scheduling 
algorithm should adapt to optimize the number of context 
switches. This is done by the scheduler entering a 
preemption free period. In this period, small processes are 
run to completion with respect to each other. This is 
because small processes are handled quickly and easily. 
After another , the scheduler returns to its original state, 
allowing longer processes to run. The algorithm is 
illustrated in Figure 7. The value  could be tuned for 
better performance and could be determined based on 
experimentation. Our choice of  is discussed in the 
following section. Using this approach, we might incur 
some delay in terms of the amount of time processes wait 
to be scheduled. To optimize this latency, one method that 
can be used to increase latency is by enhancing the CPU 
utilization. When the clock interrupt handler determines 
the end of a quantum a context switch occurs. However, 
the clock will keep issuing interrupts at a certain rate. 
Since most of these interrupts are unhandled, a 
considerable amount of energy is wasted in triggering 
them. To overcome this problem, a variable timer was 
implemented such that the rate at which interrupts occur 
depends on an upcoming timeout request. The variable 
timer manages timeout requests from threads and sets the 
clock-tick rate as such. Variable timers are not feasible in 
conventional OSs where the number of threads is very 
large. However, in networked nodes, the number of 
threads is small enough to allow for a variable timer. 
C. Implementation 
In this section, we discuss implementation specifics, 
namely the choices of the values ,  and . Before doing 
so, we need to present the two different types of context 
switches, voluntary and involuntary. A voluntary context 
switch occurs when a job or process gives up its time 
quantum voluntarily due to an IO request for example. An 
involuntary context switch on the other hand is when a 
process uses up its quantum but still has work to do. In 
this case the kernel preempts the process to place another 
one. We are only interested in optimizing the value of 
involuntary context switches. We mentioned previously 
that we use the quantum to determine the size of a 
process.  
Figure 7. After each quantum, we check if a certain amount of work  is 
done. If so, check if the running process is long ( ). If so, preempt it and 
run only small processes ( ) to completion without preemption.
Figure 8.  Short and long processes  and  are identified by quantum 
size. 
JOURNAL OF NETWORKS, VOL. 5, NO. 10, OCTOBER 2010 1207
© 2010 ACADEMY PUBLISHER
This is done as follows. On each clock tick, the 
kernel checks if the current process has used up its 
quantum. Processes are given a fixed quantum and are 
not preempted before the quantum is done. A process may 
require more than one quantum to finish. So if the kernel 
determines the end of the current process’ quantum, the 
kernel will preempt the process causing an involuntary 
context switch. The scheduler will place the preempted 
process in the appropriate place in the scheduling queue 
and pick another process to run. When a process is 
preempted for an IO request, the quantum that it used is 
recorded. So when the process gets its request and is put 
back on the CPU, it is not given a full quantum again. It 
is only given the remaining quantum it had left. However, 
if the process was preempted due to an involuntary 
context switch, it is given a full quantum again as shown 
in Figure 8. Thus, we have the notion of a small process 
and a large process depending on the remaining quantum 
size. More precisely, if a process has a full quantum, it’s a 
long process ; otherwise it’s a short process . As for the 
value of , we represent the work done in terms of time 
spent. Another possibility would be to represent the work 
done as a ratio of preemption cost and the size of the 
smallest task. However, for the sake of simplicity, we use 
the value of  to be 100 quanta. In other words, every 100 
quanta, the scheduler readapts to optimize preemption. 
Example of our context switch aware scheduler. 
Interrupt Handler { 
    if (elapsed == quantum) { 
          Scheduler ( ++)} }     
Scheduler { 
    if (  < 100){ 
          Optimize ()}… 
}     
Optimize { 
  PickShortProcs () } 
V. EVALUATION CRITERIA OF PARALLELIZED SYSTEMS
Parallelism suffers from several challenges that limit 
the transformation of uniprocessor, single-threaded 
applications to parallelized multi-threaded systems. 
In performing such a transformation process, the 
following constraints are significant: 
Inter-core/processor communication: When dealing 
with multicore environment, intercore 
communication must be taken into consideration 
especially in terms of time and clock cycles latency 
which is evident when two or more cores are 
sharing common resources or data. Several 
approaches should be carefully measured for 
minimizing such a delay. 
Data Dependency: Parallelism is tightly related with 
the data dependence concept. Any transformation 
approach should respect such dependence for the 
parallelization to be successful. Data that is 
produced and consumed should be exactly in the 
same order as in the original pre-transformed 
application. In terms of load-store order, data 
dependency can be in the following forms:  
1. True Dependence:  
a.X = ... 
b. ... = X 
The dependence ensures that the second statement 
receives the value computed by the first. This type of 
dependence is also known as flow dependence. 
2. Anti Dependence: 
a.... = X 
b.X = ... 
The dependence prevents the interchange of a and b, 
which could lead to a incorrectly using the value 
computed by b.  
3. Output Dependence: 
Both statements write into the same location 
a.X = ... 
b.X = ... 
This dependence prevents an interchange that might 
cause a later statement to read the wrong value. For 
example, in the code fragment: 
c.X = 1 
d. ... 
e.X = 2 
f.W = X * Y 
Statement e should not be allowed to move before 
statement c for Y to be incorrectly multiplied by 1, rather 
than 2, in f. This type of dependence is called output 
dependence. 
Control Dependency: Besides data dependency, 
control dependency is a critical issue to be 
considered when parallelizing. Statements which will 
not be executed unless the corresponding predicate 
(conditional branch) is resolved are considered to be 
control dependent on that predicate. Consider the 
following simple example:  
if (a < 0) 
b = 1; 
c = 2; 
While the assignment b = 1 is executed only if a < 0, the 
assignment c = 2 is always executed regardless of the 
value of a. We say that b = 1 is control dependent on the 
1208 JOURNAL OF NETWORKS, VOL. 5, NO. 10, OCTOBER 2010
© 2010 ACADEMY PUBLISHER
condition a < 0 and that c = 2 is control independent. We 
refer to the branch on which an instruction is control 
dependent as its control dependence branch. 
VI. METHODOLOGY AND PROPOSED PARALLELIZED 
ALGORITHM
We define the problem of locating potential parallelism in 
the EOS as a framework which consists of three stages: 
1. Creating an abstract model 
2. Partitioning the abstract model 
3. Analyze the performance of the partitioned system 
The first stage involves representing the actual code of 
the OS as an annotated acyclic graph. This approach 
abstracts away some unnecessary details in the code that 
helps to generalize the problem. In this case, we use a 
control flow graph (CFG) [16] as a representation. The 
second step is based on the abstract model. Using one of 
the techniques mentioned earlier (data or functional), the 
abstract model is divided into threads. Since the abstract 
model represents the code, then the code itself is 
partitioned by partitioning the CFG. The final step 
involves evaluating the performance of the partitioned 
system. The results of this stage indicate whether there is 
a potential in parallelizing the system in hand. In the 
following section, we will be presenting the proposed 
algorithm in details.  
Our goal is to come up with an algorithm that 
identifies potential parallelism in existing single-threaded 
wireless sensor node applications. Figure 9 presents a 
snapshot of the 2 main components of the algorithm.  As 
an example, we will be examining a multimedia image 
encoding application for wireless sensor networks used 
for surveillance and monitoring purposes. The algorithm 
proposed is solely based on information flow analysis via 
data/control dependency in which the control flow and 
data/control dependency is carefully examined to identify 
data definitions/usage in the application’s code. Such an 
image encoder is characterized as a resource demanding 
application which may suffer significantly from 
limitations and constraints in the wireless sensor 
networks context such as limited energy and resources.  
The main advantage of parallelizing single-threaded 
applications into multi-threaded counterparts on a 
multicore system is that the number of per core execution 
cycles is reduced significantly, causing each core in the 
system to operate at lower frequencies and thus leading to 
a reduction in the overall energy consumption and 
performance.  In achieving this improvement, our 
algorithm will be the first step in identifying whether 
parallelism exists in current single-threaded applications. 
After feeding the image encoder into our algorithm, a 
CFG, control flow graph consisting of nodes/blocks and 
edges flowing between nodes, is first generated. Each 
node/block consists of one or more instruction level 
statements/instructions that are tightly related to each 
other.  Next, a PDG, the program dependence graph 
constructed by identifying control and data dependencies 
between the nodes/blocks, is generated on which our 
proposed algorithm will be working. Figures 11a and 11b 
represent the CFG and the PDG respectively. 
Figure 9.  Lines 1-5: Initialization statements; Lines 6-14: identify and 
initialize the first extracted thread of independent nodes; Lines 15-32: 
the thread extraction process starts; Line 34-55: The main function 
responsible of extracting threads. 
JOURNAL OF NETWORKS, VOL. 5, NO. 10, OCTOBER 2010 1209
© 2010 ACADEMY PUBLISHER
The algorithm is comprised of three major steps: 
1. Initialize three main variables: We start by intializing 
a set of StartingNodes, VisitedNodes, and 
ExtractedThreads. StartingNodes is a set containing 
one of more nodes having the maximum number of 
incoming edges. A large number of incoming edges 
illustrates the significance of this node with respect 
to other nodes in terms of data/control dependency 
and thus should be included in at least one of the 
extracted threads. VisitedNodes is a set that is 
incrementally updated with the recently visited node 
along the proposed scenario. The ExtractedThreads 
is the main variable, which represents a list of lists of 
nodes. Each list is a thread containing the selected 
nodes that the algorithm has chosen to be included 
due to their dependency.  
2. Grouping: Next, all independent nodes having no 
data/control dependency with other nodes are 
grouped together into a single thread which is going 
to be the first thread to be executed separately on one 
of the cores. Selecting such nodes is based on 
choosing the nodes with their InDegree = OutDegree 
= 0. When working with graphs, InDegree is used to 
identify the number of incoming edges to a certain 
node, while OutDegree is the number of outgoing 
edges. 
3. Iterative Selection: Left with the most significant 
nodes along with their data/control dependencies, we 
start by iteratively picking nodes from the 
StartingNodes set to be included in the next extracted 
thread, in this case node B0. One of the direct 
dependent nodes, B1, on B0 is considered for the 
first iteration and then recursively, we check if there 
are any nodes depending on B1 but not directly 
depending on B0. B6 is the only node depending on 
B1 which will be added to the current thread along 
with B0 and B1. Each time a node is passed over, it 
is marked as visited by adding it to the VisitedNodes 
set. Now, the current node being visited is B6, we 
recursively keep on checking for every node the set 
of nodes it is depending on and the set of nodes 
depending on it. For example, node B6 does not have 
any node depending on it but it directly depends on 
B3. B3 only has node B4 depending on it while it is 
directly depending on B2. 
However, when reaching node B2, B9 is not going to 
be included even though it depends on B2 because it 
directly depends on the starting node B0 and it is going 
to be passed over during the next iteration.  Since all the 
dependencies in this iteration are covered, a second 
thread is extracted, consisting of 6 nodes strictly 
depending on each other: B0->B1->B6->B3->B4->B2. 
With node B9 being the only unvisited node depending 
on B0, the same scenario will be executed which gives a 
third and final thread consisting of B0->B9->B2->B1 as 
depicted in Figure 11. 
 (a)                                                     (b)  
Figure 10. The CGF (a) and PDG (b). Red edges represent control 
dependency, while black represent data dependency.
Figure 11. The final output. 3 separate threads sharing nodes marked in 
green. Such nodes could be synchronized using any inter-thread 
communication mechanism. 
Note that no restrictions or validations are imposed on 
the nodes which the current node is directly depending on 
when passed over recursively. This is due to the fact that 
a node will not be able to be executed unless the nodes it 
depends on are included in the same thread. However, 
nodes that are depending on it may not be necessarily 
included in the same thread and would definitely be 
included in the one of the next extracted threads. As a 
result, since we managed to extract more than one thread, 
including the first thread containing the independent 
nodes, we can conclude that parallelism exists and the 
single-threaded application can be mapped into a multi-
core/processor system. 
1210 JOURNAL OF NETWORKS, VOL. 5, NO. 10, OCTOBER 2010
© 2010 ACADEMY PUBLISHER
VII. EXPERIMENTAL ANALYSIS
Our simulation analysis is divided into two main 
experiments: 
A.  Experiment 1 
In the first experiment, we study the performance of our 
optimized scheduler. We have implemented a benchmark 
suite that simulates a system under high load. Our 
benchmark assumes a tree topology as shown in Figure 
12. Nodes with larger height h, have more work to 
process, while nodes with lower h are less loaded. To 
simulate the load relative to the position in the tree, the 
benchmark uses two variables, the frequency fs at which 
packets arrive and the sensing duration ls. By varying 
these values, the position hi in the routing tree is 
simulated. In our simulation, we are only interested in 
nodes that experience high system loads, illustrated in 
Figure 13. This is because the overhead of context 
switches only appear then. In our benchmark, high 
system load is represented by values of fs and ls being 
300000 CPU cycles and 1000 ms respectively. Moreover, 
4 copies of the benchmark were run at once, to simulate 
the existence of 4 neighboring nodes. Our benchmarking 
suite was run for one minute before and after 
implementing our scheduling algorithm. The performance 
of the system was monitored and plotted to show the 
change in energy consumption and the affect on event 
processing. 
Figure 12.  Network routing topology forming a tree. The greater the 
height h, the closer the node is to the sink or the root. The high system 
load area is the area of interest. 
Energy Consumption
We have shown in previous sections the effect of context 
switches on the energy efficiency of a system. The more 
the context switches, the more energy is consumed. We 
argue that if we decrease the number of context switches 
while still doing the same amount of work, we obtain 
better energy consumption. From the OS perspective, 
energy is not measured by the amount of current 
dissipated, instead it is measured by idle time. The energy 
efficiency of an OS is how much it can provide idle time 
for the CPU.  By sparing the CPU some of its cycles, the 
result is better energy consumption. In the first 
experiment, the number of CPU cycles spent is plotted 
before and after our implementation. 
The results illustrated in Figure 13 are an indication 
of % idle time. The amount of CPU cycles spent after our 
optimization is less than those spent without it. This is 
because we reduced the number of context switches and 
therefore reduced the total amount of processing the CPU 
has to perform. In the time frame of the experiment, the 
same amount of packets was delivered before and after, 
and the same length of sensing tasks as well. Yet, due to 
the reduction in the number of times the CPU has to 
switch between processes, the CPU does less work. This 
is a direct indication of both idle time and energy 
consumption, i.e. the less the cycles, the more the CPU is 
idle and the more energy is conserved. 
Figure 13. Number of CPU ticks decreased using our algorithm.  
In the second experiment, the total number of context 
switches is monitored. As mentioned earlier, we simulate 
packets coming from 4 different neighbors. The amount 
of processing done for each neighbor is monitored and 
the number of context switches is calculated as well. In 
Figure 14, the number of context switches due to each 
neighbor is plotted before and after our optimization. A 
significant decrease in the number of context switches is 
shown due to our optimization. This is expected since our 
algorithm is able to reduce context switches by more than 
70 percent. That is the total number of context switches 
due to processing packets coming from all neighbors. 
Figure 14. The number of context switches is optimized. 
JOURNAL OF NETWORKS, VOL. 5, NO. 10, OCTOBER 2010 1211
© 2010 ACADEMY PUBLISHER
Event Processing
Although we have optimized preemption, this was 
expected to incur an overhead in terms of delay. Our next 
experiments investigate this delay and its effect on event 
processing. Figure 15 presents the effect of our 
optimization on the predictability or real-time operation 
of the system. The average processing time is calculated 
and plotted before and after our optimization. The 
average is the total processing time spent for all 
neighbors divided by the number of neighbors. The delay 
incurred by our algorithm hence would be the difference 
between the average processing time before and after. As 
shown in the plot this difference is very small, hence 
event processing is slightly affected.  This delay is 
affected by the choice of the parameter  discussed in 
earlier sections. 
Figure 15.  Event processing is slightly effected by the optimization. 
We were also interested in investigating the relation 
between the size of processes and behavior of context 
switches. As the number of long processes increases, the 
number of context switches is expected to increase. 
Moreover, our algorithm has more potential for 
conserving energy when there are enough small processes 
to run without preempting longer tasks. For example, if 
the number of short processes is small, the scheduler will 
go back to its default (round-robin) state before the 
amount of work  has been done. Otherwise the scheduler 
will cause a deadlock. If small processes cannot cover the 
period , the scheduler will be running long and short 
processes as if it is a round-robin scheduler since it will 
always go back to its default state. However, we know 
this is not often the case at high system load. This is 
illustrated in Figure 16. The number of context switches 
increases steadily and at a low rate as small processes 
arrive. At time = 40 sec, a significant decrease in the 
number of short processes causes a rapid increase in the 
number of context switches. The plot also shows that the 
percentage of small processes is not very high. This 
means that the number of voluntary context switches is 
low, and the overhead is due to involuntary context 
switches. Since short processes have smaller quanta, 
processes that perform voluntary context switches are 
fewer. This is because a smaller quantum is a result of a 
voluntary context switch in the first place. Hence 
voluntary context switches do not dominate the overhead 
of preemption which justifies our focus on involuntary 
context switches. 
Figure 16. The relation between percentage of short processes and 
context switch behavior 
B.  Experiment 2 
In the second experiment, we analyze and predict the 
performance of our proposed parallel algorithim. Our 
software partitioning assumes multi CPUs on the 
hardware level. If this is not the case, partitioning will 
have a negative effect on the system by overwhelming it 
with threads. As a result energy consumption will 
increase dramatically and the concurrency will be much 
more complicated and unstable. On the other hand, by 
having multiple CPUs, we are exploiting the potential 
that TinyOS has for better performance. We illustrate how 
a sensor node’s performance would change using our 
partitioning algorithm, as the number of CPUs increases. 
As the number of CPUs or cores increases, tasks are 
scheduled accordingly, resulting in fewer cycles per core. 
For example, if we partitioned a task into 4 threads, 
having one core would have to execute all 4 threads. 
Moreover, having 4 cores would require each core to 
execure a single task with fewer context switches. Fewer 
context switches result in better energy consumption.  
Figure 17. Energy is consumed due inter-thread communication. 
Threads communicate when passing variables which justiies why 
functional partitioning consumes more synchronization energy. 
1212 JOURNAL OF NETWORKS, VOL. 5, NO. 10, OCTOBER 2010
© 2010 ACADEMY PUBLISHER
Context switching due to multiple tasks is the dominating 
cause of CPU cycles and thus energy consumption. Since 
fewer tasks run on each core, fewer context switches are 
required. Scheduling algorithms may also be adapted to 
minimize these context switches. Again, as the number of 
tasks on each single CPU decreases, context switches will 
also decrease resulting in better energy consumption. As the 
cost of context switching diminishes, what dominates is the 
cost of inter-thread communication. Inter-thread 
communication occurs when two threads sharing a variable 
communicate the value of that variable.  Another example is 
two functions communicating parameters. In this case, 
communication overhead will appear if we used functional 
partitioning. Inter-thread communication overhead is not 
analogous to context switch overhead.  That is, inter-thread 
communication overhead does not increase as the number of 
threads increase. It actually depends on the partitioning 
technique used to partition a task into threads. Data 
partitioning for example produces more threads than 
functional partitioning. However, it requires less inter-thread 
communication since the technique itself removes 
dependencies within a task. The functional partitioning 
divides a task into separate functions; in this case there is 
more scope for communication and the predicted results for 
the simulation are presented in Figure 17. 
VIII. CONCLUSIONS AND FUTURE WORK 
In this paper, we study the evolution of operating 
system designs or future wireless sensor nodes. We first 
show that the value of preemption has a great impact on 
the design and implementation of operating systems. We 
introduced a simple and energy efficient preemption 
algorithm targeting embedded wireless sensor network 
operating systems. We implemented our algorithm on an 
embedded operating system and evaluated its 
performance. Our algorithm is general and portable in the 
sense that it can be applied on any preemptive platform. 
Moreover, we have showed a significant decrease in the 
number of context switches using our algorithm. Our 
algorithm also maintains the predictable nature of the 
preemptive system. We also illustrated the significance of 
multi-core/processor system architecture in current 
hardware designs, especially with the current trend in 
wireless sensor network devices being pushed along the 
same line of production. We presented the importance of 
migrating existing WSN applications into multi-threaded 
applications capable of taking full advantage of multi-
processor architecture. Our algorithm was able to extract 
multiple threads out of single-threaded applications, 
where data and control dependency were carefully 
examined and analyzed for preserving such dependencies 
in the extracted threads. Expected improvements in terms 
of lower execution per core cycles and energy 
consumption were examined.  
As part of our future work, we are to provide a 
deeper investigation on the effect of our algorithm on 
processing latency. We also intend to investigate different 
values for  and its effect on delay. A theoretical analysis 
of our algorithm would be provided in an extended 
version of this paper. An investigation involving more 
Wireless sensor OSs is required to determine other 
bottlenecks. Our future work also includes the simulation 
of our results on multi-processor sensor nodes. We also 
need to investigate the consequences of such migration 
on the network level and check whether such migration 
would affect the overall network performance. 
REFERENCES
[1] J. Hill, R. Szewczyk, A. Woo, S. Hollar, D. Culler, and K. 
Pister, “System Architecture Directions for Networked 
Sensors,” Proceedings of the ninth international 
conference on Architectural support for programming 
languages and operating systems , ACM Press, New York, 
USA, November 2000, pp. 93-104. 
[2] S. Bhatti, J. Carlson, H. Dai, J. Deng, J. Rose, A. Sheth. B. 
Shucker, C. Gruenwald, A. Torgerson, and R. Han, “MOS: 
An Embedded Multithreaded Operating System for 
Wireless Micro Sensor Platforms,” ACMKluwer Mobile 
Networks and Applications Journal, Special Issue on 
Wireless Sensor Networks, Kluer Academic Publishers, 
Hingham, USA, August 2005, pp. 563-579. 
[3] C. Han, R. Kamur, R. Shea, E. Kohler, and M. Srivastava, 
“A dynamic operating system for sensor ndoes,” 
Proceedings of the third international conference on 
Mobile systems, applications, and services, ACM Press, 
New York, USA, June 2005, pp. 163-176. 
[4] R. Behren, J. Condit, and E. Brewer, “Why Events Are A 
Bad Idea (for high-concurrency servers),” Proceedings of 
HotOS IX: The ninth Workshop on Hot Topics in 
Operating Systems , USENIX Association, Hawaii, USA,  
May 2003, pp. 19-24. 
[5] H. Lauer and R. Needham, “On the Duality of Operating 
System Structures,” Proceedings of the second 
international Symposium on Operating Systems, IR1A, 
Rocquencourt, France, October 1978; reprinted in 
Operating Systems Review, April 1979, pp. 3-19. 
[6] A. Gustafsson, “Threads Without the Pain,” Queue, ACM 
Press, New York, USA, November 2005, pp. 34-41. 
[7] C. Duffy, U. Roedig, G. Herbert, and C. Sreenan, “An 
Experimental Comparison of Event Driven and Multi-
Threaded Sensor Node Operator systems,” Proceedings of 
the fifth Annual IEEE International Conference on 
Pervasive Computing and Communications Workshops,
IEEE computer society, White Plains, New York, USA, 
March 2007, pp. 267-271. 
[8] H. Kim and H. Cha, “Multithreading optimization 
techniques for sensor network operating systems,” 
Wireless Sensor Networks, Springer, Heidelberg, Berlin, 
April 2007, pp. 293-308. 
[9] Y. Bartal, S. Leonardi, G. Shallom, and R. Sitters, “On the 
value of preemption in scheduling,” Approximation 
Randomization, and combinational Optimization.  
Algorithms and Techniques, Springer, Heidelberg, Berlin, 
August 2006, pp. 39-48. 
[10] C. Duffy, U. Roedig, G. Herbert, and C. Sreenan, 
“Improving the Energy Efficiency of the MANTIS 
Kernel.” Proceedings of the fourth IEEE European 
Workshop on Wireless Sensor Networks, IEEE Computer 
Society Press, Delft, Netherlands, January 2007. 
[11] C. Duffy, U. Roedig, G. Herbert, and C. Sreenan, “Adding 
Preemption to TinyOS” Proceedings of the fourth 
workshop on embedded networked sensors, ACM Press, 
Cork, Ireland, June 2007. 
JOURNAL OF NETWORKS, VOL. 5, NO. 10, OCTOBER 2010 1213
© 2010 ACADEMY PUBLISHER
[12] Erik B. van der Tol, Egbert G. T. Jaspers, Rob H. 
Gederblom, “Mapping of H.264 Decoding on a 
Multiprocessor Architecture” 
[13] Salvatore Carta, Michele Pittau, Andrea Acquavia, Pablo 
G. Del Valle, David Atienza, Giovanni De Micheli, 
Fernando Rincon, Luca Benini, Jose M. Memdias, “Multi-
Processor Operating System Emulation Framework with 
Thermal Feedback for Systems-on-Chip”, GLSVLSI 2007
[14] Yang Yu, Loren J. Rittle, Jason B. LeBrun, Vartika Bhandari 
“Supporting Concurrent Applications in Wireless Sensor 
Networks,” Proceedings of the 4th international conference on 
Embedded Networked Sensor Systems – SenSys ‘06
[15] Monica S. Lam and Robert P. Wilson, “Limits of Control 
Flow on Parallelism”, ACM, 199
[16] Mary Jean, Harrold Gregg ,Rothermel Alex Orso, 
“Representation and Analysis of Software” 
Dr. Mohamed K. Watfa is 
currently in the college of Computer 
Science and Engineering at the 
University of Wollongong in Dubai.  
Before that he was at the Computer 
Science department at the American 
University of Beirut (AUB). He 
received his Ph.D. from the School 
of Electrical and Computer 
Engineering at the University of 
Oklahoma in Norman, OK, USA 
in 2006 and was of the youngest PhD holders in the history of 
the university. He obtained his BS in Computer Science from 
American University of Beirut in 2002 and his Masters degree 
in Engineering Science from the University of Toledo, OH, 
USA in 2003. He was one of the youngest PhD holders to 
graduate from his university at the age of 23. He was also on the 
dean’s honors list and was given a number of prestigious awards 
and scholarships. His research interests include wireless sensor 
networks, intelligent systems, Vehicular Ad-hoc Networks, 
wireless networking, resource management, energy issues, 
tracking, routing, and performance measures. He is the author of 
a number of books, the guest editor of a number of international 
journals, and the organizer of a number of international 
conferences. He also held a position as a lead network engineer 
at different networking companies. He is a professional member 
of the ACM and IEEE. He has more than 40 journal and 
conference publications ranked among the top in his research 
domain. 
Mohamed A. Moubarak is a 
Software Engineer at Consolidated 
Contractors Company. He received his 
B.S and M.S degrees in computer 
science from the American University 
of Beirut (AUB), Beirut, Lebanon. He 
is the author of the book chapter, 
Embedded Operating Systems in 
Wireless Sensor Networks (coauthored 
with Mohamed K. Watfa) which 
appeared in the book Guide to 
Wireless Sensor Networks. His current research interests 
include wireless sensor networks, embedded operating systems, 
performance evaluation and benchmarking, scheduling 
algorithms. 
Ali H. Kashani is a 
telecommunication engineer at MHD 
Telecom S.A.R.L. He received his B.S 
& M.S in computer science from the 
American University of Beirut (AUB), 
Beirut, Lebanon. His research interests 
include wireless sensor networks, 
clustering and routing in MANETs, 
energy efficiency in wireless ad-hoc 
networks, and parallel & multi-threaded distributed algorithms 
on multi-core systems. 
1214 JOURNAL OF NETWORKS, VOL. 5, NO. 10, OCTOBER 2010
© 2010 ACADEMY PUBLISHER
