359 research outputs found

    Processor allocator for chip multiprocessors

    Full text link
    Chip MultiProcessor (CMP) architectures consisting of many cores connected through Network-on-Chip (NoC) are becoming main computing platforms for research and computer centers, and in the future for commercial solutions. In order to effectively use CMPs, operating system is an important factor and it should support a multiuser environment in which many parallel jobs are executed simultaneously. It is done by the processor management system of the operating system, which consists of two components: Job Scheduler (JS) and Processor Allocator (PA). The JS is responsible for job scheduling that deals with selection of the next job to be executed, while the task of the PA is processor allocation that selects a set of processors for the job selected by the JS. In this thesis, the PA architecture for the NoC-based CMP is explored. The idea of the PA hardware implementation and its integration on one die together with processing elements of CMP is presented. Such an approach requires the PA to be fast as well as area and energy efficient, because it is only a small component of the CMP. The architecture of hardware version of a PA is presented. The main factor of the structure is a type of processor allocation algorithm, employed inside. Thus, all important allocation techniques are intensively investigated and new schemes are proposed. All of them are compared using experimentation system. The PA driven by the described allocation techniques is synthesized on FPGA and crucial energy and area consumption together with performance parameters are extracted. The proposed CMP uses NoC as interconnection architecture. Therefore, all main NoC structures are studied and tested. Most important parameters such as topology, flow control and routing algorithms are presented and discussed. For the proposed NoC structures, an energy model is proposed and described. Finally, the synthesized PAs and NoCs are evaluated in a simulation system, where NoC-based CMP is created. The experimental environment took into consideration energy and traffic balance characteristics. As a result, the most efficient PA and NoC for CMP are presented

    Framework for simulation of fault tolerant heterogeneous multiprocessor system-on-chip

    Full text link
    Due to the ever growing requirement in high performance data computation, current Uniprocessor systems fall short of hand to meet critical real-time performance demands in (i) high throughput (ii) faster processing time (iii) low power consumption (iv) design cost and time-to-market factors and more importantly (v) fault tolerant processing. Shifting the design trend to MPSOCs is a work-around to meet these challenges. However, developing efficient fault tolerant task scheduling and mapping techniques requires optimized algorithms that consider the various scenarios in Multiprocessor environments. Several works have been done in the past few years which proposed simulation based frameworks for scheduling and mapping strategies that considered homogenous systems and error avoidance techniques. However, most of these works inadequately describe today\u27s MPSOC trend because they were focused on the network domain and didn\u27t consider heterogeneous systems with fault tolerant capabilities; In order to address these issues, this work proposes (i) a performance driven scheduling algorithm (PD SA) based on simulated annealing technique (ii) an optimized Homogenous-Workload-Distribution (HWD) Multiprocessor task mapping algorithm which considers the dynamic workload on processors and (iii) a dynamic Fault Tolerant (FT) scheduling/mapping algorithm to employ robust application processing system. The implementation was accompanied by a heterogeneous Multiprocessor system simulation framework developed in systemC/C++. The proposed framework reads user data, set the architecture, execute input task graph and finally generate performance variables. This framework alleviates previous work issues with respect to (i) architectural flexibility in number-of-processors, processor types and topology (ii) optimized scheduling and mapping strategies and (iii) fault-tolerant processing capability focusing more on the computational domain; A set of random as well as application specific STG benchmark suites were run on the simulator to evaluate and verify the performance of the proposed algorithms. The simulations were carried out for (i) scheduling policy evaluation (ii) fault tolerant evaluation (iii) topology evaluation (iv) Number of processor evaluation (v) Mapping policy evaluation and (vi) Processor Type evaluation. The results showed that PD scheduling algorithm showed marginally better performance than EDF with respect to utilization, Execution-Time and Power factors. The dynamic Fault Tolerant implementation showed to be a viable and efficient strategy to meet real-time constraints without posing significant system performance degradation. Torus topology gave better performance than Tile with respect to task completion time and power factors. Executing highly heterogeneous Tasks showed higher power consumption and execution time. Finally, increasing the number of processors showed a decrease in average Utilization but improved task completion time and power consumption; Based on the simulation results, the system designer can compare tradeoffs between a various design choices with respect to the performance requirement specifications. In general, designing an optimized Multiprocessor scheduling and mapping strategy with added fault tolerant capability will enable to develop efficient Multiprocessor systems which meet future performance goal requirements. This is the substance of this work

    Cycle-accurate evaluation of reconfigurable photonic networks-on-chip

    Get PDF
    There is little doubt that the most important limiting factors of the performance of next-generation Chip Multiprocessors (CMPs) will be the power efficiency and the available communication speed between cores. Photonic Networks-on-Chip (NoCs) have been suggested as a viable route to relieve the off- and on-chip interconnection bottleneck. Low-loss integrated optical waveguides can transport very high-speed data signals over longer distances as compared to on-chip electrical signaling. In addition, with the development of silicon microrings, photonic switches can be integrated to route signals in a data-transparent way. Although several photonic NoC proposals exist, their use is often limited to the communication of large data messages due to a relatively long set-up time of the photonic channels. In this work, we evaluate a reconfigurable photonic NoC in which the topology is adapted automatically (on a microsecond scale) to the evolving traffic situation by use of silicon microrings. To evaluate this system's performance, the proposed architecture has been implemented in a detailed full-system cycle-accurate simulator which is capable of generating realistic workloads and traffic patterns. In addition, a model was developed to estimate the power consumption of the full interconnection network which was compared with other photonic and electrical NoC solutions. We find that our proposed network architecture significantly lowers the average memory access latency (35% reduction) while only generating a modest increase in power consumption (20%), compared to a conventional concentrated mesh electrical signaling approach. When comparing our solution to high-speed circuit-switched photonic NoCs, long photonic channel set-up times can be tolerated which makes our approach directly applicable to current shared-memory CMPs

    A system-level multiprocessor system-on-chip modeling framework

    Get PDF
    We present a system-level modeling framework to model system-on-chips (SoC) consisting of heterogeneous multiprocessors and network-on-chip communication structures in order to enable the developers of today's SoC designs to take advantage of the flexibility and scalability of network-on-chip and rapidly explore high-level design alternatives to meet their system requirements. We present a modeling approach for developing high-level performance models for these SoC designs and outline how this system-level performance analysis capability can be integrated into an overall environment for efficient SoC design. We show how a hand-held multimedia terminal, consisting of JPEG, MP3 and GSM applications, can be modeled as a multiprocessor SoC in our framework

    Simulation Of Multi-core Systems And Interconnections And Evaluation Of Fat-Mesh Networks

    Get PDF
    Simulators are very important in computer architecture research as they enable the exploration of new architectures to obtain detailed performance evaluation without building costly physical hardware. Simulation is even more critical to study future many-core architectures as it provides the opportunity to assess currently non-existing computer systems. In this thesis, a multiprocessor simulator is presented based on a cycle accurate architecture simulator called SESC. The shared L2 cache system is extended into a distributed shared cache (DSC) with a directory-based cache coherency protocol. A mesh network module is extended and integrated into SESC to replace the bus for scalable inter-processor communication. While these efforts complete an extended multiprocessor simulation infrastructure, two interconnection enhancements are proposed and evaluated. A novel non-uniform fat-mesh network structure similar to the idea of fat-tree is proposed. This non-uniform mesh network takes advantage of the average traffic pattern, typically all-to-all in DSC, to dedicate additional links for connections with heavy traffic (e.g., near the center) and fewer links for lighter traffic (e.g., near the periphery). Two fat-mesh schemes are implemented based on different routing algorithms. Analytical fat-mesh models are constructed by presenting the expressions for the traffic requirements of personalized all-to-all traffic. Performance improvements over the uniform mesh are demonstrated in the results from the simulator. A hybrid network consisting of one packet switching plane and multiple circuit switching planes is constructed as the second enhancement. The circuit switching planes provide fast paths between neighbors with heavy communication traffic. A compiler technique that abstracts the symbolic expressions of benchmarks' communication patterns can be used to help facilitate the circuit establishment

    Data routing in multicore processors using dimension increment method

    Full text link
    A Deadlock-free routing algorithm can be generated for arbitrary interconnection network using the concept of virtual channels but the virtual channels will lead to more complex algorithms and more demands of NOC resource. In this thesis, we study a Torus topology for NOC application, design its structure and propose a routing algorithm exploiting the characteristics of NOC. We have chosen a typical 16 (4 by 4) routers Torus and propose the corresponding route algorithm. In our algorithm, all the channels are assigned 4 different dimensions (n0,n1,n2 & n3). By following the dimension increment method, we break the dependent route circles, and avoid dead lock and live-lock and avoid the overhead of virtual channels. Xilinx offers two soft core processors, namely Picoblaze and Microblaze. The Picoblaze processor is 8-bit configurable processor core. These soft processor cores offer designers tremendous flexibility during the design process, allowing the designers to configure the processor to meet the needs of their systems (e.g., adding custom instructions or including/excluding particular data path coprocessors) and to quickly integrate the processor within any FPGA. Unlike single chip Microprocessor/FPGA systems using hard-core processors, soft processor cores allow designers to incorporate varying numbers of processors within a single FPGA design depending on an application’s needs.Soft processor cores implemented using FPGAs typically have higher power consumption and decreased performance compared with hard-core processors. Key features of the Picoblaze processor, as well as other soft processor cores, include the user configurable options that allow a designer to tailor the processor’s functionality to their specific design. The proposed design implements sixteen instances of a soft processor, Picoblaze, connected in a torus topology. Data is passed from one processor to another employing a routing algorithm which is based on dimension increment method. Thus we design an NOC with multiple microcontrollers and related logic, synthesize the process and test its performance in a simulation environment

    Porting the Sisal functional language to distributed-memory multiprocessors

    Get PDF
    Parallel computing is becoming increasingly ubiquitous in recent years. The sizes of application problems continuously increase for solving real-world problems. Distributed-memory multiprocessors have been regarded as a viable architecture of scalable and economical design for building large scale parallel machines. While these parallel machines can provide computational capabilities, programming such large-scale machines is often very difficult due to many practical issues including parallelization, data distribution, workload distribution, and remote memory latency. This thesis proposes to solve the programmability and performance issues of distributed-memory machines using the Sisal functional language. The programs written in Sisal will be automatically parallelized, scheduled and run on distributed-memory multiprocessors with no programmer intervention. Specifically, the proposed approach consists of the following steps. Given a program written in Sisal, the front end Sisal compiler generates a directed acyclic graph(DAG) to expose parallelism in the program. The DAG is partitioned and scheduled based on loop parallelism. The scheduled DAG is then translated to C programs with machine specific parallel constructs. The parallel C programs are finally compiled by the target machine specific compilers to generate executables. A distributed-memory parallel machine, the 80-processor ETL EM-X, has been chosen to perform experiments. The entire procedure has been implemented on the EMX multiprocessor. Four problems are selected for experiments: bitonic sorting, search, dot-product and Fast Fourier Transform. Preliminary execution results indicate that automatic parallelization of the Sisal programs based on loop parallelism is effective. The speedup for these four problems is ranging from 17 to 60 on a 64-processor EM-X. Preliminary experimental results further indicate that programming distributed-memory multiprocessors using a functional language indeed frees the programmers from lowl-evel programming details while allowing them to focus on algorithmic performance improvement

    Master of Science

    Get PDF
    thesisIntegrated circuits often consist of multiple processing elements that are regularly tiled across the two-dimensional surface of a die. This work presents the design and integration of high speed relative timed routers for asynchronous network-on-chip. It researches NoC's efficiency through simplicity by directly translating simple T-router, source-routing, single-flit packet to higher radix routers. This work is intended to study performance and power trade-offs adding higher radix routers, 3D topologies, Virtual Channels, Accurate NoC modeling, and Transmission line communication links. Routers with and without virtual channels are designed and integrated to arrayed communication networks. Furthermore, the work investigates 3D networks with diffusive RC wires and transmission lines on long wrap interconnects

    Novel Cache Hierarchies with Photonic Interconnects for Chip Multiprocessors

    Full text link
    [ES] Los procesadores multinúcleo actuales cuentan con recursos compartidos entre los diferentes núcleos. Dos de estos recursos compartidos, la cache de último nivel y el ancho de banda de memoria principal, pueden convertirse en cuellos de botella para el rendimiento. Además, con el crecimiento del número de núcleos que implementan los diseños más recientes, la red dentro del chip también se convierte en un cuello de botella que puede afectar negativamente al rendimiento, ya que las redes tradicionales pueden encontrar limitaciones a su escalabilidad en el futuro cercano. Prácticamente la totalidad de los diseños actuales implementan jerarquías de memoria que se comunican mediante rápidas redes de interconexión. Esta organización es eficaz dado que permite reducir el número de accesos que se realizan a memoria principal y la latencia media de acceso a memoria. Las caches, la red de interconexión y la memoria principal, conjuntamente con otras técnicas conocidas como la prebúsqueda, permiten reducir las enormes latencias de acceso a memoria principal, limitando así el impacto negativo ocasionado por la diferencia de rendimiento existente entre los núcleos de cómputo y la memoria. Sin embargo, compartir los recursos mencionados es fuente de diferentes problemas y retos, siendo uno de los principales el manejo de la interferencia entre aplicaciones. Hacer un uso eficiente de la jerarquía de memoria y las caches, así como contar con una red de interconexión apropiada, es necesario para sostener el crecimiento del rendimiento en los diseños tanto actuales como futuros. Esta tesis analiza y estudia los principales problemas e inconvenientes observados en estos dos recursos: la cache de último nivel y la red dentro del chip. En primer lugar, se estudia la escalabilidad de las tradicionales redes dentro del chip con topología de malla, así como esta puede verse comprometida en próximos diseños que cuenten con mayor número de núcleos. Los resultados de este estudio muestran que, a mayor número de núcleos, el impacto negativo de la distancia entre núcleos en la latencia puede afectar seriamente al rendimiento del procesador. Como solución a este problema, en esta tesis proponemos una de red de interconexión óptica modelada en un entorno de simulación detallado, que supone una solución viable a los problemas de escalabilidad observados en los diseños tradicionales. A continuación, esta tesis dedica un esfuerzo importante a identificar y proponer soluciones a los principales problemas de diseño de las jerarquías de memoria actuales como son, por ejemplo, el sobredimensionado del espacio de cache privado, la existencia de réplicas de datos y rigidez e incapacidad de adaptación de las estructuras de cache. Aunque bien conocidos, estos problemas y sus efectos adversos en el rendimiento pueden ser evitados en procesadores de alto rendimiento gracias a la enorme capacidad de la cache de último nivel que este tipo de procesadores típicamente implementan. Sin embargo, en procesadores de bajo consumo, no existe la posibilidad de contar con tales capacidades y hacer un uso eficiente del espacio disponible es crítico para mantener el rendimiento. Como solución a estos problemas en procesadores de bajo consumo, proponemos una novedosa organización de jerarquía de dos niveles cache que utiliza una red de interconexión óptica. Los resultados obtenidos muestran que, comparado con diseños convencionales, el consumo de energía estática en la arquitectura propuesta es un 60% menor, pese a que los resultados de rendimiento presentan valores similares. Por último, hemos extendido la arquitectura propuesta para dar soporte tanto a aplicaciones paralelas como secuenciales. Los resultados obtenidos con la esta nueva arquitectura muestran un ahorro de hasta el 78 % de energía estática en la ejecución de aplicaciones paralelas.[CA] Els processadors multinucli actuals compten amb recursos compartits entre els diferents nuclis. Dos d'aquests recursos compartits, la memòria d’últim nivell i l'ample de banda de memòria principal, poden convertir-se en colls d'ampolla per al rendiment. A mes, amb el creixement del nombre de nuclis que implementen els dissenys mes recents, la xarxa dins del xip també es converteix en un coll d'ampolla que pot afectar negativament el rendiment, ja que les xarxes tradicionals poden trobar limitacions a la seva escalabilitat en el futur proper. Pràcticament la totalitat dels dissenys actuals implementen jerarquies de memòria que es comuniquen mitjançant rapides xarxes d’interconnexió. Aquesta organització es eficaç ates que permet reduir el nombre d'accessos que es realitzen a memòria principal i la latència mitjana d’accés a memòria. Les caches, la xarxa d’interconnexió i la memòria principal, conjuntament amb altres tècniques conegudes com la prebúsqueda, permeten reduir les enormes latències d’accés a memòria principal, limitant així l'impacte negatiu ocasionat per la diferencia de rendiment existent entre els nuclis de còmput i la memòria. No obstant això, compartir els recursos esmentats és font de diversos problemes i reptes, sent un dels principals la gestió de la interferència entre aplicacions. Fer un us eficient de la jerarquia de memòria i les caches, així com comptar amb una xarxa d’interconnexió apropiada, es necessari per sostenir el creixement del rendiment en els dissenys tant actuals com futurs. Aquesta tesi analitza i estudia els principals problemes i inconvenients observats en aquests dos recursos: la memòria cache d’últim nivell i la xarxa dins del xip. En primer lloc, s'estudia l'escalabilitat de les xarxes tradicionals dins del xip amb topologia de malla, així com aquesta es pot veure compromesa en propers dissenys que compten amb major nombre de nuclis. Els resultats d'aquest estudi mostren que, a major nombre de nuclis, l'impacte negatiu de la distància entre nuclis en la latència pot afectar seriosament al rendiment del processador. Com a solució' a aquest problema, en aquesta tesi proposem una xarxa d’interconnexió' òptica modelada en un entorn de simulació detallat, que suposa una solució viable als problemes d'escalabilitat observats en els dissenys tradicionals. A continuació, aquesta tesi dedica un esforç important a identificar i proposar solucions als principals problemes de disseny de les jerarquies de memòria actuals com son, per exemple, el sobredimensionat de l'espai de memòria cache privat, l’existència de repliques de dades i la rigidesa i incapacitat d’adaptació' de les estructures de memòria cache. Encara que ben coneguts, aquests problemes i els seus efectes adversos en el rendiment poden ser evitats en processadors d'alt rendiment gracies a l'enorme capacitat de la memòria cache d’últim nivell que aquest tipus de processadors típicament implementen. No obstant això, en processadors de baix consum, no hi ha la possibilitat de comptar amb aquestes capacitats, i fer un us eficient de l'espai disponible es torna crític per mantenir el rendiment. Com a solució a aquests problemes en processadors de baix consum, proposem una nova organització de jerarquia de dos nivells de memòria cache que utilitza una xarxa d’interconnexió òptica. Els resultats obtinguts mostren que, comparat amb dissenys convencionals, el consum d'energia estàtica en l'arquitectura proposada és un 60% menor, malgrat que els resultats de rendiment presenten valors similars. Per últim, hem estes l'arquitectura proposada per donar suport tant a aplicacions paral·leles com seqüencials. Els resultats obtinguts amb aquesta nova arquitectura mostren un estalvi de fins al 78 % d'energia estàtica en l’execució d'aplicacions paral·leles.[EN] Current multicores face the challenge of sharing resources among the different processor cores. Two main shared resources act as major performance bottlenecks in current designs: the off-chip main memory bandwidth and the last level cache. Additionally, as the core count grows, the network on-chip is also becoming a potential performance bottleneck, since traditional designs may find scalability issues in the near future. Memory hierarchies communicated through fast interconnects are implemented in almost every current design as they reduce the number of off-chip accesses and the overall latency, respectively. Main memory, caches, and interconnection resources, together with other widely-used techniques like prefetching, help alleviate the huge memory access latencies and limit the impact of the core-memory speed gap. However, sharing these resources brings several concerns, being one of the most challenging the management of the inter-application interference. Since almost every running application needs to access to main memory, all of them are exposed to interference from other co-runners in their way to the memory controller. For this reason, making an efficient use of the available cache space, together with achieving fast and scalable interconnects, is critical to sustain the performance in current and future designs. This dissertation analyzes and addresses the most important shortcomings of two major shared resources: the Last Level Cache (LLC) and the Network on Chip (NoC). First, we study the scalability of both electrical and optical NoCs for future multicoresand many-cores. To perform this study, we model optical interconnects in a cycle-accurate multicore simulation framework. A proper model is required; otherwise, important performance deviations may be observed otherwise in the evaluation results. The study reveals that, as the core count grows, the effect of distance on the end-to-end latency can negatively impact on the processor performance. In contrast, the study also shows that silicon nanophotonics are a viable solution to solve the mentioned latency problems. This dissertation is also motivated by important design concerns related to current memory hierarchies, like the oversizing of private cache space, data replication overheads, and lack of flexibility regarding sharing of cache structures. These issues, which can be overcome in high performance processors by virtue of huge LLCs, can compromise performance in low power processors. To address these issues we propose a more efficient cache hierarchy organization that leverages optical interconnects. The proposed architecture is conceived as an optically interconnected two-level cache hierarchy composed of multiple cache modules that can be dynamically turned on and off independently. Experimental results show that, compared to conventional designs, static energy consumption is improved by up to 60% while achieving similar performance results. Finally, we extend the proposal to support both sequential and parallel applications. This extension is required since the proposal adapts to the dynamic cache space needs of the running applications, and multithreaded applications's behaviors widely differ from those of single threaded programs. In addition, coherence management is also addressed, which is challenging since each cache module can be assigned to any core at a given time in the proposed approach. For parallel applications, the evaluation shows that the proposal achieves up to 78% static energy savings. In summary, this thesis tackles major challenges originated by the sharing of on-chip caches and communication resources in current multicores, and proposes new cache hierarchy organizations leveraging optical interconnects to address them. The proposed organizations reduce both static and dynamic energy consumption compared to conventional approaches while achieving similar performance; which results in better energy efficiency.Puche Lara, J. (2021). Novel Cache Hierarchies with Photonic Interconnects for Chip Multiprocessors [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/165254TESI

    Classification of networks-on-chip in the context of analysis of promising self-organizing routing algorithms

    Full text link
    This paper contains a detailed analysis of the current state of the network-on-chip (NoC) research field, based on which the authors propose the new NoC classification that is more complete in comparison with previous ones. The state of the domain associated with wireless NoC is investigated, as the transition to these NoCs reduces latency. There is an assumption that routing algorithms from classical network theory may demonstrate high performance. So, in this article, the possibility of the usage of self-organizing algorithms in a wireless NoC is also provided. This approach has a lot of advantages described in the paper. The results of the research can be useful for developers and NoC manufacturers as specific recommendations, algorithms, programs, and models for the organization of the production and technological process.Comment: 10 p., 5 fig. Oral presentation on APSSE 2021 conferenc
    • …
    corecore