A variety of technologies in recent years have been developed in designing on-chip networks with the multicore system. In this endeavor, network interfaces mainly differ in the way a network physically connects to a multicore system along with the data path. Semantic substances of communication for a multicore system are transmitted as data packets. Thus, whenever a communication is made from a network, it is first segmented into sub-packets and then into fixed-length bits for flow control digits. To measure required space, energy & latency overheads for the implementation of various interconnection topologies we will be using multi2sim simulator tool that will act as research bed to experiment various tradeoffs between performance and power, and between performance and area requires analysis for further possible optimizations.
I. INTRODUCTION
The topology of a network on chip determines how the cores are interconnected. We shall have to ponder over all possible paths where data can be traversed across the network. The routing algorithm selects the specific path a message will take from source to destination. Therein flow control protocols predict the actual path of data towards the designated route, including when & where should the data leaves a core. The micro architecture element recognizes the routing and flow control protocols and carefully maneuvers its circuit's implementation. Expanding trend of multiple cores is a matter of grave concern, particularly on the die. Subtle & precise understanding is still lacking in the literature of the designing area of the interconnection framework. We know a very little about how it interacts with the rest of the multi-core architecture. A topology would be asserting the number of hops data must route as well as the interconnect lengths between hops. This would be influencing network latency significantly. As routing the data and links consumes energy, the effect of topology on hop count also directly affects network energy consumption.
II. RELATED WORKS
The term Core presupposes a processor unit that can scan instructions & accomplish specific tasks. Syntactic rules are arranged in a manner so as to execute them in real time & formulate into a computer we sense today. Samples from routine life of work require a processing core to open a folder, type into a word document, drawing the desktop environment etc. Gaming on graphics card contains cents of processing cores to promptly work on data in parallel along with some of the computer's processing core. Thus, requirements of designing of cores are extremely complex and cognitive which fluctuate between brands and even models.
In 1965 Gordon Moore stated that the number of transistors on a chip wills approximately double each year (he later refined this, in 1975, to every two years). What is often quoted as Moore's Law is Dave Hour's revision that computer performance wills double every 18 months. In the early 1970s, Intel manufactured 4-bit 4004 which was first ever microprocessor. It was simply a digit crunching machine. Shortly afterwards they made evolution & produced 8008 and 8080 chips, both 8-bit. Then after Motorola supplanted with its 6800-chip comparative to Intel's 8080. The competitive organizations further fabricated 16-bit microprocessors towards advancement. Motorola served as basis for introduction Intel's 8086 32-bit and later their popular Pentium lineup which were in the first consumer-based PCs [1] .
All the generations of processors were constantly designed smaller in area with faster performance requirements it started dissipating more heat and exhausted with more power consumption. Starting from the development of Intel's 8086 through the Pentium 4, there was gradual increase in performance from one generation to another due to the increase in processor frequency. As in case of Pentium 4 the frequency ranged from 1.3 to 3.8 GHz in its 8 years of evolution. On the other hand, physical size of chips decreased while the number of transistors per chip increased. As the clock speeds also increases it aids the heating across the chip by raising temperature to a dangerous level. Speeding processor's frequency had inspired many in industry throughout a decade's time however chip designers were still in need for a better technology so as to improve the performance. Due to increasing demand, the idea of having additional processing cores to the same chip came into the mind of designers. Hypothetically it was expected that the performance will double and heat dissipation would be less.
In 2000, SPEC CINT2000 benchmark suite was rolled out which consisted over 5.9 trillion instructions when executed with reference inputs. Researchers substantially rely on simulators to analyze, debug and validate new designs before implementation. Modern hardware like a 3.06GHz Pentium 4 [5] consumes about 31 minutes to conclude the benchmark task. If we were to compare the same hardware with one of the fastest and detailed singleprocessor, the superscalar models can only simulate about million instructions per second. This would be taking over 72 days to finish one invocation of the SPEC CINT2000 suite. When additional features are infused such as cache-coherent memories and configuration information to boot the Linux kernel, simulation time becomes even more tedious. A fully configured system with cache-coherent simulator will run only 300,000 instructions per second which would be translating it to 228 days for the SPEC CINT2000 suite.
It can be clearly understood by the slow down situation, there is a constant necessity for finding accurate ways and race up the simulation tasks. When power trend reached a barrier in year 2000 for microprocessors, it motivated the research & development of many university scientists aiming to explore scalable designs such as the MIT Raw microprocessor, the Stanford Smart Memories project, the Stanford Merrimac-Streaming Supercomputer, the MIT Scale project, the UW Wave Scalar, the UT Austin TRIPS, the UC Davis Synchro scalar etc. Power consumption and wire delays have limited the continued scaling of centralized systems while making multi core architectures increasingly popular.
In 2007, A research on Migration from Electronics to Photonics in Multicore Processor by an engineer from National University of Singapore suggested that the resistive tendency of metals causes the bottleneck problem in interconnects. By replacing Aluminum with Copper, one is slightly able to improve the interconnect performance provisionally, however to achieve a complete sustainable solution so that ongoing pace of progress persists, it is fairly acceptable to have an idea of having an optical interconnect to metallic wires. Manycore microprocessors are also likely to push performance per chip from the 10 giga flop to the 10tera flop range. [15] . Table 1 Proposed an design which incorporates a product instrument to deal with settings, a fast correspondence framework, and in addition a locking framework to guarantee common avoidance.
Different commitments have demonstrated that the successful yield of a SMT processor is more noteworthy than that of a standard superscalar processor, it can't be expressed with sureness that the outcome will be as great in homogeneous multitasking mode with code produced by a parallelizing-compiler Directed power-execution reenactments of a few SMT and CMP models utilizing centers of changing many-sided quality. Our examinations distinguish effective pipeline measurements and layout the ramifications of utilizing a power execution proficiency metric for center many.
Here power and execution are taken essential measurements however territory and interconnect impacts will move toward becoming noteworthy in CMP designs for a bigger number of cores. Researches programming structures and methods for rapidly mimicking current store intelligent multiprocessors by amortizing the time spent to mimic the memory framework and branch indicators.
Expanded checkpoint measure and the need to know ahead of time the small scale compositional subtle elements that must be warmed.
R. Ubal (2007) Microprocessor, interconnection networks
Displayed the Multi2Sim reproduction structure, which models the real segments of approaching frameworks, and is proposed to cover the impediments of existing test systems.
Very complex
Bryan Schauer (2008) Coherence protocols, efficiency of multicore processors Multicore processors are architected to hold fast to sensible power utilization, warm scattering, and store soundness conventions.
Need to short out the best trouble of showing parallel programming methods (since most developers are so versed in consecutive programming) and in upgrading current applications to run ideally on a multicore framework.
Zhoujia Xu (2008)
Microprocessor performance, bandwidth performance, performance, Presentation of copper set up of aluminum has incidentally enhanced the interconnect execution, however a more problematic arrangement will be required with a specific end goal to keep the current pace of advance, optical interconnect is a fascinating other option to metallic wires.
So as to take the optical jump, be that as it may, the capacity of proficient treatment of optical flag at low cast is required. Transmit information all the while on the system and advancement of the venture in a various leveled organize, will make a system as versatile.
Yaser
Increment unwavering quality and better utilize, blame tolerant strategies can be utilized to serve about nature of administration in this structure. Additionally, utilizing productive steering calculations as indicated by introduced structure and its elements in programming calculation to locate the ideal course will have the capacity to diminish the deferral between the transmitter and beneficiary notwithstanding having a consistent esteem.
Liyaqat Nazir (2016)
Network-on-chip, virtual channels, buffers.
Introduced the execution examination of different flexible buffering strategies expected to outline miniaturized scale design switches for NoC Execution with other buffering approaches, elective full throughput flexible supports full nonexclusive versatile cushion and assess them for credit based stream control convention utilized as a part of NoC switch correspondence with neighboring switches. Swati Rustogi (2017) Multi-core, data mining, parallelism, Apriori.
An enhanced Apriori method for multi-center condition is proposed.
It can be investigated from the perspective of load on the centers.
III. SIMULATION FRAMEWORK
In past years the use of Simple Scalar simulators became very common [8] . It acted as base designs of some of the Multi2Sim modules too. It also shapes an out-of-order superscalar processor. A number of add-ons have been extended in Simple Scalar to design in a more precise way the certain aspects of superscalar processors keeps demanding. However, it is not always easy that a Simple Scalar model can be made to implement latest parallel micro architectures without doing any modifications in the underlying morphology. Still two Simple Scalar Siddons were developed to adapt multithreading in the SSMT [9] and M-Sim [10] simulators. These tools are suitable to fabricate designs stationed on simultaneous multithreaded processors, with the constraint of implementing a set of workloads progressively. The strategy of fixed resource sharing among threads also proves to be a limitation in the case.
An endeavor Turandot simulator [11, 12] is further inventive effort which simulates a PowerPC architecture. Along with the aid of simultaneous multithread SMT the project was extended for multicore. This effort is made available by the Power-Timer tool [13] as a practical implementation. Tornado addons for parallel micro architectures are ranked under highly researched topics (e.g., [14] ), are not available as open source. Both Simple Scalar and Turandot are application-only tools. This means that the simulators would be straightly running the application and simulates its interaction with an underlying virtual operating system. The tool is not prepared to meet the requirements of architecture-specific privileged instruction set as applications cannot be allowed to implement it. The only merit to offer is isolation of the execution instances so that statistics are not affected by a simulation of a real operating system. Multi2Sim is to be categorized as an application-only simulator here.
A key hallmark of chip simulators is "timing-first approach". It was initiated by GEMS and replicated in Multi2Sim as an addon feature. This approach where timing module is supposed to discover the state of the processor pipeline, instructions helps in spanning over it in an analytical manner. Next functional module is executed along with the instructions dynamically till it attains the commit stage. Thus, legitimate execution paths are perpetually guaranteed by a formerly developed robust simulator. Multi2Sim can be downloaded as a compressed tar file, and has been tested on 32 bit and 64 bit machine architectures, with Ubuntu (Linux OS). The simulator compilation requires the library libbfd, not preset in some Linux distributions by default. All the executables are required to be compiled statically as dynamic linking is not supported. A command line to compile a program composed by a single source file, executables usually have an approximate minimum size of 4MB, since all libraries are linked with it [17, 25] .
The following commands are supposed to do us favor in a command terminal to compile it: tar xzf multi2sim.tar.gz cd multi2sim ./configure make On simulation bench, booting an application is the operation where an executable file is selectively aligned into different virtual memory regions. In physical system, the operating system is responsible for these operations. In comparison to other simulators (e.g. SimpleScalar), Multi2Sim keeps its orientation away from supporting the simulation of an entire Operating System and is confined for running compiled applications only. Thus, loading process must be proactively organized by the simulator at the time of initialization.
The gcc bundle that dissipates executable files as output are intended to adapt the ELF (Executable and Linkable Format) specification. The format design is earmarked to comply with shared libraries, core dumps and object code. An ELF file is made up of an ELF header, a set of segments and a set of sections. Typically, one or more sections are enclosed in a segment. ELF Copyright © 2017 MECS I.J. Computer Network and Information Security, 2017, 11, 52-62 sections are identified by a name and contain useful data for program loading or debugging. They are labeled with a set of flags that indicate its type and the way they have to be handled during the program loading [3] . When cited, "libbfd" library equips the analysis of the ELF file. It manifests the required operation to query the executable file sections and access its data. The loader module probes the entire set and derives out their characteristic information like starting address, size, flags, and data. When flags stipulate the loadability of a section, its data is replicated into memory with a respective starting address. Loading stage is further followed by the operation of initializing the process stack. With a tendency of growing towards lower region, stack posse's dynamically variable length memory. The virtual address of its initial address is ought to be "0x7fffffff". Local variables & parameters are stored in this program stack & while any application is executed the stack pointer (register $sp) gets managed by the own application code. In contrast, when the program starts, it expects some data in it. This loading operation of ELF into virtual memory and analyzing the simulator configuration, is mainly divided into two phases:
 Functional Simulation: The engine which supports the machine design here is MIPS32. It is developed as an autonomous library and supplies interface to the simulator. This simulator kernel is responsible for incurring the functions to create/destroy software contexts, initiate application loading, enumerate existing contexts, consult their status, execute a new instruction and handle speculative execution.  Detailed Simulation: In Multi2Sim, "Execution-Driven" simulation is performed by the detailed simulator using former functional engine contained in Libkernel. In each cycle, context state is revised by sequence of calls to the kernel on periodic basis. The latest execution of machine instructions invokes analysis process in detailed simulator about its operational nature and records the function latencies consumed by physical entities.
IV. MULTI-PROCESSING & PIPELINING DURING SIMULATION
The pipeline process is basically classified into five stages. First fetch stage inputs the instructions from cache and dispatches them into an IFQ (Instruction Fetch Queue). Next to decode these instructions, decode/ rename stage inputs these instructions from an IFQ, renames its registers and allocate them a block in the ROB (Reorder Buffer). When the input operands are signaled as available, the decoded instructions are placed into a RQ (Ready Queue). Further in issue stage, instructions from the RQ are processed and transmitted to a respective functional unit. In Ex (Execute Stage) the functional units process the task and store its result back into a record file. Finally, the commit stage retires instructions from the ROB in program order [20] [21] [22] [23] [24] . This processing flowchart is comparative to the one designed by the SimpleScalar tool set [8] . In additional this uses a ROB, an IQ (Instruction Queue) and a physical record file in place of integrated RUU (Register Update Unit). The sharing strategy of each stage can be varied in a multithreaded pipeline [16] with the Ex stage being the only exception. This scheme aids in achieving superior overall throughput by making use of multithreading. It takes advantage of the sharing of functional units, located in the Ex stage. Thus, utilization is subsequently increased for increasing performance [17] [18] [19] . Fig.2 depicts two of the pipeline flow model classified on the basis of stages. Fig.2(a) Stages are shared here among various threads, whereas in Fig.2(b) Except "Ex" stages are looped as many times as endured by hardware threads. The application of Multi2Sim aids in accounting variable stage sharing strategies. The multithread design can be classified as fine-grain (FGMT), coarse-grain (CGMT) or simultaneous multithread (SMT), depending on the stages sharing and thread selection protocols. Factors like performance, power/area budget, bandwidth, technology, system software etc. gets affected while inventing out for the best possible design of chip in multiprocessing environments. Latest researches try to orient towards comprehensive analysis of the implementation issues for a design class of chip multiprocessor interconnection network.
Our work does a comparative study of three interconnect network i.e. ring, mesh and torus. Experiments were performed using simulation to find out the best possible combination of core and network for better performance.
A. Experimental Setup for Ring Topology Interconnect:
Experiment was performed for Ring Topology Interconnect where number of cores were varied as 2, 4, 8, and 16. The simulation benchmark has to be configured as follows.
The configuration file for 8 cores are shown here: Simulation results for Ring Topology Interconnect shows that there is significant improvement in "Dispatch IPC" with number of cores. It is to be noted here that, initially the "Dispatch IPC" increases but after 4 cores it remains constant. The "Issue IPC" also increases up to 4 cores and after 4 cores it comes to steady state. There is significant improvement in "Commit IPC" for Ring Topology with number of cores as 2 and 4. After 4 cores, no significant improvement in commit IPC is seen. The average latency increases for 2 and 4 cores. For For this topology too, initially the "Dispatch IPC" was noted to be increased till 4 cores & remains constant afterwards. The "Issue IPC" was showing the same tends as "Dispatch IPC" i.e. increases up to 4 cores and after that there was no remarkable variation. From 2 to 4 cores, the "Commit IPC" showed improvement and after 4 cores no significant improvement was observed on simulation bench. As the number of core is increased from 2 to 4, the average latency increases. After 4 cores i.e. for core 8 and 16, average latency is seen as constant. Therefore, it can be concluded the 4 cores is acting as optimal number for Mesh Topology Interconnect too.
C. Experimental Setup for Torus Topology Interconnect:
Experiments performed for Torus Topology Interconnect consisted the variation for number of cores as 2, 4,8, and 16. The simulation was setup by various configuration files.
The configuration file for 8 cores are shown here: The simulation results for Torus Topology Interconnect again showed that there is no significant improvement in "Dispatch IPC" after 4 cores. Initially the "Dispatch IPC" increases till 4 cores & remains same afterwards. As we increase cores from 2 to 4, the "Issue IPC" increases proportionally. After 4 cores, it remains almost constant. The "Commit IPC" was seen to be rapidly changing from 2 to 4 cores. After that i.e. for 8 and 16 cores, no significant improvement in "Commit IPC" was seen. The average latency increases for 2 and 4 cores. After that i.e. for 8 and 16 cores it too remains constant. Hence it can be concluded that number of cores which is giving best result for Torus Topology Interconnect is 4 cores. The combined performance of the three topologies for interconnect network at their best suitable core i.e. 4 cores for various parameters like "Dispatch IPC", "Issue IPC", "Commit IPC" and "Average Latency". 
VI. CONCLUSIONS
On simulation environment, various scenarios of multicore processing were observed & optimal number of cores requirement was found to be 4. Hence, combined performance for various parameters like dispatch IPC, issue IPC, commit IPC and average latency of Ring, Mesh & Torus interconnect network topologies for 4 cores was observed & compared. Since this work is devoted to find the best interconnect network topology, the average latency is considered to compare the networks. The average latency is observed to be minimum for Torus Topology and also all IPCs are observed to be minimum. It is to be concluded that the Torus with 4 cores is the best suitable interconnect network topology for designing. The "timing-first" scheme supported by Multi2Sim framework helps in taking account of efficiency & robustness in as customized manner. It also provides opportunity of experimenting simulations on a variety of deep levels. The uniqueness in this context is the execution of Copyright © 2017 MECS I.J. Computer Network and Information Security, 2017, 11, 52-62 "timing-first" simulation along with functional units. Hence there is no requirement to simulate a whole operating system. Executing parallel workloads with dynamic threads creation would be sufficient. The simulation framework which we have used has been developed for adapting the key attributes of popular simulators like partitioning functional and timing simulation, SMT and multiprocessor support and cache coherence. The module of the simulator also supports application of execution-driven simulation like SimpleScalar. This design facilitates the unitization of the functional kernel as an independent library plus allows the definition of the instruction set to be mapped into a central file (machine.def).
