The analytical techniques are either highly inaccurate or often require unaffordable computational costs to evaluate the concurrency in complex multiprocessor and multicore systems. Recent studies show that simulation techniques are being used in engineering and scientific researches more than ever. Simulation technique can be a promising alternative to analytical techniques as the real-world complex and expensive systems can be modeled as simplified representations including only relevant aspects of the problem. As billions of transistors are available on a single chip, the total number of CPUs in a multiprocessor system and the total number of cores in a multicore architecture are expected to grow significantly. As the number of processing cores increases, currently available simulators will become inadequate to simulate the concurrency of future computation extensive systems. Therefore, we propose a simulation technique to effectively model the concurrency in multiprocessor and multicore systems. Proposed VisualSim accelerated concurrency modeling approach is able to evaluate system performance. We model concurrency in a multi-processor shared-memory system and measure two performance metrics, average response time and task completion time. We find the proposed concurrency modeling technique easy, flexible, and capable of determining deadlock and starvation, balancing loads, and scheduling tasks.
Introduction
The growing additions of functionalities make applications more complex. Such a complex application requires increased processing power. Due to the fact that processing elements (PE) are becoming faster and cheaper, powerful multiprocessor systems are built to satisfy the rising needs [1, 2] . In a multiprocessor system, the instructions of a concurrent program are divided and executed into different processors in parallel to improve performance. At a higherlevel of abstract, a multiprocessor or a multicore system can be considered as simply a collection of processing elements which can load and store data in a global, shared memory. While a single-processor system performs the dynamic scheduling of multiple processes on a single processor, a multiprocessor system performs a dynamic scheduling of multiple processes onto various processors. The mapping scheme of single-processor systems is appropriate for processes which are not collectively involved in a parallel computing task, such as the case with many users or user programs. In case of a parallel program, however, each part of the program (determined by the partitioning scheme) runs on a separate processor/core, which imply a static one-to-one mapping, i.e., one process per processor for optimum efficiency. Figures 1(a), 1(b), and 1(c) show a single-processor (i.e. single-core), a multi-processor, and a multicore system, respectively [3] [4] [5] [6] .
(a) A single-processor (i.e., single-core) system (b) A multiprocessor system (c) A multicore system Figure 1 : Single-processor, multiprocessor, and multicore systems Unlike sequential systems, the processes which make up a concurrent system interacts with each other for their share of the common resources. Simultaneous use of shared resources is the source of many difficulties. Race conditions involving shared resources can result in unpredictable system behavior. The introduction of mutual exclusion can prevent race conditions, but can lead to problems such as deadlock and starvation. The design of concurrent systems needs reliable techniques for coordinating their execution, data exchange, memory allocation, and execution scheduling to minimize response time and maximize throughput.
The concurrency of a system can be reflected in the control program structure. Concurrency can be used to speed up response to user interaction by offloading timeconsuming tasks to separate processors. Throughput can be improved by using multiple processors to manage communication and device latency. However, the advantages of concurrency may be offset by the increased complexity of concurrent multiprocessor system. The processors must communicate properly in order to efficiently divide and solve a problem. Using currently available simulation tools, concurrency can be modeled up to a certain limit. None of these techniques are efficient and adequate for future architecture exploration.
In this paper, a promising simulation technique is presented which is capable of analyzing concurrency and performance metrics of a multiprocessor system at a higherlevel of abstraction. Section 2 discusses concurrency in multiprocessor systems. Concurrency modeling using LTSA/FSP is illustrated in Section 3. Section 4 presents the proposed concurrency modeling technique. Results are discussed in Section 5. Finally, we conclude our work in Section 6.
Concurrency in Multicore Architecture and Multiprocessor Systems
Concurrency can be defined as the appearance of simultaneous execution of processes or transactions by interleaving the execution of multiple pieces of work. Logical concurrency (logically simultaneous processing) requires interleaved execution on a single PE. Parallelism in a multiprocessor system is an event where two or more identical changes occur independently -physically simultaneous processing. Parallelism can be considered as physical concurrency that involves multiple PEs to complete a task. Both logical and physical concurrencies require controlled access to the shared resources. We use them interchangeably in this paper. Each processor may process more than one processes at a time. Therefore, total number of active processes in a multiprocessor system may be greater than available processors [7] [8] [9] .
Modeling Concurrency
Three important evaluation techniques are measurement, analytical, and simulation (also known as modeling).
 Measurement -target system must exist. Measurement is not an option for future systems. The analytical techniques are either highly inaccurate or often require unaffordable computational costs to evaluate multiprocessor systems. Simulation techniques are very potential to model multiprocessor systems. Modeling a multiprocessor system includes the following issuesconcurrency, ordering, speed, and time. Concurrency is a must to model. Tasks from different processors should access shared resources i.e., memory simultaneously. This concurrent execution is possible by interleaving or timesharing algorithm. Only one process/processor can access/use the shared memory at a certain point of time. In order to guaranty fair execution, tasks from same processor are executed in order they are generated and tasks from different processors are assigned priorities. Speed is not considered -so a processor can take any time to process a task. Time, also, is not considered -for two independent tasks T1 and T2, both orders T1 T2 and order T2  T1 are okay.
Deadlock and Starvation
Deadlock and starvation are two common problems in concurrent multiprocessor systems where many processes share a specific mutually exclusive resource. Deadlock refers to a specific circumstance when two or more processes are each waiting for another to release a resource. As a result, the processes involved in deadlock cannot finish their tasks. Starvation refers to a situation where a process is continuously denied to acquire necessary resources. As a result, the process in starvation cannot finish its task [9] .
There are four necessary "Coffman" conditions for a deadlock to occur. Deadlock only occurs in systems where all 4 conditions happen [8] .
 No pre-emption -resources cannot be preempted. A requesting processor can not have immedi-ate access to the requested resource(s) until the holding processor is done and give up.  Mutual exclusion (ME) -there should be at least one non-sharable resource. Only one processor can access/use the non-sharable resource at any time.  Hold and wait -processes already holding resources may request new resources.  Circular wait -two or more processes form a circular chain where each process waits for a resource that the next process in the chain holds.
Deadlock can be prevented by implementing one or more of the following,   ME is applicable only for non-sharable resources. Processes can access sharable resources anytime.
 Processor P holding some resources and requesting for more resources that are not available, then all resources held by P are pre-empted.   No holding is allowed while waiting   No circular waiting is allowed
Concurrency Modeling Using LTSA
Modeling concurrency is becoming more important as we approach billions of transistors in a single-chip era. There are only a few methods and tools available for modeling concurrency. Labeled Transition System Analyzer (LTSA) is a verification tool for concurrent systems introduced by [10] . In this Section, we discuss LTSA as a concurrency modeling tool.
LTSA automatically checks whether the specification of a concurrent system satisfies the properties required of its behavior or not. LTSA also supports specification animation to facilitate interactive exploration of system behavior as well.
A system in LTSA is modeled as a set of interacting finite state machines. The properties required of the system are also modeled as state machines. Each component of a specification is described as a LTS, which contains all the states a component may reach and all the transitions it may perform. LTSA supports a process algebra notation (FSP) for concise description of component behavior. The tool allows the LTS corresponding to a FSP specification to be viewed graphically. Now we discuss how concurrency can be modeled in FSP notations and verify using LTSA. Say, P1 and P2 are two separate processes, then (P1||P2) represents the concurrent execution of P1 and P2. The operator || is the parallel composition operator. Figure 2 shows a system with two processes and one processing element. P1 and P2 are using the PE by time-sharing and both are completed at time t1. It is realized that action scratch is concurrent with both think and talk but action think must happen before action talk [10] . Now, let's see how LTSA checks the processes. Figure 4 shows the FSP code to simulate the diagram in Figure 3(c) . Figure 5 shows the possible trace to deadlock. At the beginning we may start with think or scratch. If we follow the predicted trace (scratch  think  talk) then the deadlock happens. LTSA can be used to predict deadlock. However one should know FSP notations to use LTSA. Also, LTSA cannot be used for performance evaluation. We introduce a concurrency modeling technique which is capable of analyzing concurrency and performance metrics of a multiprocessor system at a higher-level of abstraction.
Proposed Concurrency Modeling Technique
In this work, we develop an efficient simulation method which is capable of analyzing concurrency and other complexities of multiprocessor systems at a higher-level of abstraction. VisualSim (a system-level simulation tool) is used to model the multiprocessor system and to simulate the concurrency. Two performance metrics, namely average response time and task completion time are measured in this experiment.
Simulated Architecture
In a concurrent system, a large task is divided into small tasks and distributed among the processors. We simulate a simplified concurrent system with three processors working together to solve problems. Each processor has its own local memory to do its task. Figure 6 shows the simulated architecture. Figure 6 : Concurrent multiprocessor architecture.
Each processor (P-1 to P-3) needs access to the shared memory to perform the job. Processors submit their requests for the shared memory to the scheduler and scheduler allow one processor at a time, depending on the policy, to access the shared memory. Two or more processors cannot access the shared memory at the same time.
Assumptions
Following assumptions are made to model and run the VisualSim simulation.
1. Pre-emption, priority, and ME are used to avoid deadlock and starvation. 2. Only one task can access the shared memory at any point of time. So, tasks from the same processor are assigned numbers and are executed in order. Similarly, tasks from different processors are assigned priorities.
3. A processor can take all the time it needs to process a task. So, for two independent tasks T1 and T2, both orders T1 T2 and T2  T1 are considered okay.
4. To keep the simulation program simple, we ignore other irrelevant details.
System Parameters
Various task groups and scheduling schemes are used to run the simulation. Each task may have start time, mean time (when the next task may generate), priority, and ME indicator. We use three different scheduling schemes using FCFS, priority + pre-emption, and RR + time-slicing. Table 1 shows Schedule 1, which is simple FCFS schemeno priority, no ME are involved. Schedule 2 is FCFS with Pre-Emption (PreE), priority, and no ME as shown in Table 2 . Table 3 shows Schedule 3, which is RR (TS = 1 for all tasks) with PreE, priority, and no ME. Three tasks (Task-1 from Proc-1, Task-2 from Proc-2, and Task-3 from Proc-3) and various schedule schemes are used to run the simulation model.
ITCH = (scratch->STOP). CONVERSE = (think->talk->STOP). ||CONVERSE_ITCH = (ITCH || CONVERSE).

VisualSim Model
The VisualSim model is shown in Figure 7 . VisualSim is a system-level simulation tool from Mirabilis Design [11] . In VisualSim, a system is described in three major pasts -Architecture, Behavior, and Workload. Architecture includes the major elements such as processor and memory. Behavior describes the actions performed on the system. Workload captures the transactions that traverse the system during the simulation. Mapping between behavior and architecture is performed using virtual execution. VisualSim simulator is suitable of performance modeling for early analysis of multicore systems [12] . The processors submit their tasks to the scheduler. Scheduler allows a task to access the shared memory according to the scheduling policy.
Results and Discussions
In this work, we present a simulation method which is capable of analyzing concurrency and other complexities of multiprocessor systems. First we discuss how this method can be used to determine deadlock and starvation. Then we present the performance metrics obtained using this technique.
Deadlock and Starvation
First, Task-1 (from Proc-1), Task-2 (from Proc-2), and Task-3 (from Proc-3) are generated at time 0.0 with the same priority. In various occasions, VisualSim generate exception message as shown in Figure 8 . One of those situations is deadlock where none of the Tasks are being executed even though there are enough free resources. Second, Task-1 (from Proc-1), Task-2 (from Proc-2), and Task-3 (from Proc-3) are generated at time 0.0 with priority 1, 2, and 3 respectively where the bigger the number, the higher the priority. At time 0.0 the scheduler queue was empty. So, Task-3 starts at 0.0. After the first time slice, Task-3 gives up the resources. Task-2 starts at 1.0 as it has the higher priority (higher than Task-1's priority). After second time slice, Task-2 gives up the resources.
Task-3 starts again at 2.0 as it has the higher priority (higher than Task-1's priority). If the system generates more Task-2 and Task-3 before they are finished, then Task-1 will keep starving as shown in Figure 9 . Figure 10 shows the simulation output for schedule FCFS (no pre-emption). Task-1 (from Proc-1) is issued at time 0.0 with priority 1 when the scheduler queue was empty. So, Task-1 starts at 0.0. Task-2 (from Proc-2) is issued at 2.0 with priority 2 (higher than Task-1's priority). But Task-1's mutual exclusion option is set to 'Yes' (means, once Task-1 accesses the resource, it should not give them up until the task is completed). So, Task-2 waits (for 1.0 unit of time) until Task-1 is finished at 3.0 (at time 3.0, scheduler is free and Task-2 starts). Similarly, Task-3 (from Proc-3) is issued at 4.0 with priority 3 (higher than the priorities of Task-2 and Task-1), waits for 2.0 units of time, starts at 6.0, and completes at 9.0. Now we investigate the impact of FCFS with preemption and ME. Figure 11 shows the simulation output for this schedule. Task-1 (from Proc-1) is issued at time 0.0 with priority 1 when the scheduler queue was empty. So, Task-1 starts at 0.0 as before. Task-2 (from Proc-2) is issued at 2.0 with priority 2 (higher than Task-1's priority). Now, due to the fact that Task-1's mutual exclusion option is set to 'No' (means, in an event when a higher priority task arrives, Task-1 should give the resources even though it is not completed), Task-1 immediately gives up the resources at 2.0 and Task-2 starts. Similarly Task-3 (from Proc-3) is issued at 4.0 with priority 3 (higher than the priority of Task-1 and Task-2), starts at 4.0, and completes at 7.0. Task-2 and Task-1 are completed later time (at time 8.0 and 9.0 respectively) based on their priorities. Figures 4 and 5 clearly indicate that scheduling has significant impact on load balancing in a multiprocessor system. The impact of scheduling (i.e., load balancing) on response time and completion time is presented in the following subsections.
Load-Balancing by Scheduling
Performance Analysis
Response time is a measure of time a system takes to react to a given input (from request to first react). In Figure 12 , response time of Task-1 is 0.0 unit of time. The average response time versus schedules is presented in Figure 12 . All three tasks start at 0.0 and mean time is 10.0 for all. Simulation results show schedule 3 offers the best performance for this task group. Completion time is the time a system takes to perform a task (from start to finish). In Figure 13 , completion time of Task-1 is 3.0 unit of time. The total completion time versus schedules is presented in Figure 13 . All three tasks start at 0.0 and mean time is 10.0 for all. Results show schedules 1 and 2 offer the best performance. 
Total Completion Time Vs
Conclusion
Measurement and analytical techniques are proven to be inappropriate for evaluating the concurrency in large multiprocessor, multicore, and manycore systems. Several simulation methods are proposed to model concurrency of complex systems with many CPUs/cores. However, currently available simulation techniques are not effective to model concurrency and analyze performance of highperformance computing systems. For example, LTSA is used for concurrency modeling; LTSA can determine deadlock, but cannot evaluate system performance. In this paper, we present a concurrency modeling technique which is capable of analyzing concurrency, performance, and other complexities of multiprocessor systems and multicore/manycore architectures. Proposed approach uses VisualSim tool to model and run simulation programs. We model concurrency in a shared memory system with 3 processors and measure average response time and task completion time. Experimental results show that the proposed technique can be used for determining system deadlock and starvation, if any. This technique can also be used for balancing loads and scheduling tasks among the cores. We find the proposed concurrency modeling technique easy, flexible, and better than other existing concurrency modeling techniques.
