12 research outputs found

    Tightly-Coupled and Fault-Tolerant Communication in Parallel Systems

    Full text link
    The demand for processing power is increasing steadily. In the past, single processor architectures clearly dominated the markets. As instruction level parallelism is limited in most applications, significant performance can only be achieved in the future by exploiting parallelism at the higher levels of thread or process parallelism. As a consequence, modern “processors” incorporate multiple processor cores that form a single shared memory multiprocessor. In such systems, high performance devices like network interface controllers are connected to processors and memory like every other input/output device over a hierarchy of peripheral interconnects. Thus, one target must be to couple coprocessors physically closer to main memory and to the processors of a computing node. This removes the overhead of today’s peripheral interconnect structures. Such a step is the direct connection of HyperTransport (HT) devices to Opteron processors, which is presented in this thesis. Also, this work analyzes how communication from a device to processors can be optimized on the protocol level. As today’s computing nodes are shared memory systems, the cache coherence protocol is the central protocol for data exchange between processors and devices. Consequently, the analysis extends to classes of devices that are cache coherence protocol aware. Also, the concept of a transfer cache is proposed in this thesis, which reduces latency significantly even for non-coherent devices. The trend to the exploitation of process and thread level parallelism leads to a steady increase of system sizes. Networks that are used in such large systems are very susceptible to both hard and transient faults. Most transient fault rates are constant per bit that is stored or transmitted. With increasing system sizes and higher clock frequencies, the number of faults in time increases drastically. In the end, the error rate may rise at a level where high level error recovery becomes too costly if lower layers do not perform error correction that is transparent to the layers above. The second part of this thesis describes a direct interconnection network that provides a reliable transport service even without the use of end-to-end protocols. Also, a novel hardware based solution for intermediate routing is developed in this thesis, which allows an efficient, deadlock free routing around faulty links

    Performance Analysis of Complex Shared Memory Systems

    Get PDF
    Systems for high performance computing are getting increasingly complex. On the one hand, the number of processors is increasing. On the other hand, the individual processors are getting more and more powerful. In recent years, the latter is to a large extent achieved by increasing the number of cores per processor. Unfortunately, scientific applications often fail to fully utilize the available computational performance. Therefore, performance analysis tools that help to localize and fix performance problems are indispensable. Large scale systems for high performance computing typically consist of multiple compute nodes that are connected via network. Performance analysis tools that analyze performance problems that arise from using multiple nodes are readily available. However, the increasing number of cores per processor that can be observed within the last decade represents a major change in the node architecture. Therefore, this work concentrates on the analysis of the node performance. The goal of this thesis is to improve the understanding of the achieved application performance on existing hardware. It can be observed that the scaling of parallel applications on multi-core processors differs significantly from the scaling on multiple processors. Therefore, the properties of shared resources in contemporary multi-core processors as well as remote accesses in multi-processor systems are investigated and their respective impact on the application performance is analyzed. As a first step, a comprehensive suite of highly optimized micro-benchmarks is developed. These benchmarks are able to determine the performance of memory accesses depending on the location and coherence state of the data. They are used to perform an in-depth analysis of the characteristics of memory accesses in contemporary multi-processor systems, which identifies potential bottlenecks. However, in order to localize performance problems, it also has to be determined to which extend the application performance is limited by certain resources. Therefore, a methodology to derive metrics for the utilization of individual components in the memory hierarchy as well as waiting times caused by memory accesses is developed in the second step. The approach is based on hardware performance counters, which record the number of certain hardware events. The developed micro-benchmarks are used to selectively stress individual components, which can be used to identify the events that provide a reasonable assessment for the utilization of the respective component and the amount of time that is spent waiting for memory accesses to complete. Finally, the knowledge gained from this process is used to implement a visualization of memory related performance issues in existing performance analysis tools. The results of the micro-benchmarks reveal that the increasing number of cores per processor and the usage of multiple processors per node leads to complex systems with vastly different performance characteristics of memory accesses depending on the location of the accessed data. Furthermore, it can be observed that the aggregated throughput of shared resources in multi-core processors does not necessarily scale linearly with the number of cores that access them concurrently, which limits the scalability of parallel applications. It is shown that the proposed methodology for the identification of meaningful hardware performance counters yields useful metrics for the localization of memory related performance limitations

    Integrated System Architectures for High-Performance Internet Servers

    Full text link
    Ph.D.Computer Science and EngineeringUniversity of Michiganhttp://deepblue.lib.umich.edu/bitstream/2027.42/90845/1/binkert-thesis.pd

    Adding Token Counting to Directory-Based Cache Coherence

    Get PDF
    The coherence protocol is a first-order design concern in multicore designs. Directory protocols are naturally scalable, as they place no restrictions on the interconnect and have minimal bandwidth requirements; however, this scalability comes at the cost of increased sharing latency due to indirection. In contrast, broadcast-based systems such as snooping protocols and token coherence reduce latency of sharing misses by sending requests directly to other processors. Unfortunately, their reliance on totally ordered interconnects and/or broadcast limits their scalability. This work introduces PATCH (Predictive/Adaptive Token Counting Hybrid), a coherence protocol that provides the scalability of directory protocols while opportunistically using available bandwidth to reduce sharing latency. PATCH extends a standard directory protocol to track tokens and use token counting rules for enforcing coherence permissions. Token counting allows PATCH to support direct requests on an unordered interconnect, while a novel mechanism called token tenure uses local processor timeouts and the directory’s per-block point of ordering at the home node to guarantee forward progress without relying on broadcast. PATCH makes three main contributions. First, PATCH uses direct request prioritization to match the performance of broadcast-based protocols without restricting scalability. Second, PATCH introduces token tenure, which provides broadcast-free forward progress for token counting protocols. Finally, PATCH provides greater scalability than directory protocols when using inexact encodings of sharers because only processors holding tokens need to acknowledge requests. Overall, PATCH is a “one-size-fits-all” coherence protocol that dynamically adapts to work well for small systems, large systems, and anywhere in betwee

    Acceleration of the hardware-software interface of a communication device for parallel systems

    Full text link
    During the last decades the ever growing need for computational power fostered the development of parallel computer architectures. Applications need to be parallelized and optimized to be able to exploit modern system architectures. Today, scalability of applications is more and more limited both by development resources, as programming of complex parallel applications becomes increasingly demanding, and by the fundamental scalability issues introduced by the cost of communication in distributed memory systems. Lowering the latency of communication is mandatory to increase scalability and serves as an enabling technology for programming of distributed memory systems at a higher abstraction layer using higher degrees of compiler driven automation. At the same time it can increase performance of such systems in general. In this work, the software/hardware interface and the network interface controller functions of the EXTOLL network architecture, which is specifically designed to satisfy the needs of low-latency networking for high-performance computing, is presented. Several new architectural contributions are made in this thesis, namely a new efficient method for virtual-tophysical address-translation named ATU and a novel method to issue operations to a virtual device in an optimal way which has been termed Transactional I/O. This new method needs changes in the architecture of the host CPU the device is connected to. Two additional methods that emulate most of the characteristics of Transactional I/O are developed and employed in the development of the EXTOLL hardware to facilitate usage together with contemporary CPUs. These new methods heavily leverage properties of the HyperTransport interface used to connect the device to the CPU. Finally, this thesis also introduces an optimized remote-memory-access architecture for efficient split-phase transactions and atomic operations. The complete architecture has been prototyped using FPGA technology enabling a more precise analysis and verification than is possible using simulation alone. The resulting design utilizes 95 % of a 90 nm FPGA device and reaches speeds of 200 MHz and 156 MHz in the different clock domains of the design. The EXTOLL software stack is developed and a performance evaluation of the software using the EXTOLL hardware is performed. The performance evaluation shows an excellent start-up latency value of 1.3 μs, which competes with the most advanced networks available, in spite of the technological performance handicap encountered by FPGA technology. The resulting network is, to the best of the knowledge of the author, the fastest FPGA-based interconnection network for commodity processors ever built

    Virtualization services: scalable methods for virtualizing multicore systems

    Get PDF
    Multi-core technology is bringing parallel processing capabilities from servers to laptops and even handheld devices. At the same time, platform support for system virtualization is making it easier to consolidate server and client resources, when and as needed by applications. This consolidation is achieved by dynamically mapping the virtual machines on which applications run to underlying physical machines and their processing cores. Low cost processor and I/O virtualization methods efficiently scaled to different numbers of processing cores and I/O devices are key enablers of such consolidation. This dissertation develops and evaluates new methods for scaling virtualization functionality to multi-core and future many-core systems. Specifically, it re-architects virtualization functionality to improve scalability and better exploit multi-core system resources. Results from this work include a self-virtualized I/O abstraction, which virtualizes I/O so as to flexibly use different platforms' processing and I/O resources. Flexibility affords improved performance and resource usage and most importantly, better scalability than that offered by current I/O virtualization solutions. Further, by describing system virtualization as a service provided to virtual machines and the underlying computing platform, this service can be enhanced to provide new and innovative functionality. For example, a virtual device may provide obfuscated data to guest operating systems to maintain data privacy; it could mask differences in device APIs or properties to deal with heterogeneous underlying resources; or it could control access to data based on the ``trust' properties of the guest VM. This thesis demonstrates that extended virtualization services are superior to existing operating system or user-level implementations of such functionality, for multiple reasons. First, this solution technique makes more efficient use of key performance-limiting resource in multi-core systems, which are memory and I/O bandwidth. Second, this solution technique better exploits the parallelism inherent in multi-core architectures and exhibits good scalability properties, in part because at the hypervisor level, there is greater control in precisely which and how resources are used to realize extended virtualization services. Improved control over resource usage makes it possible to provide value-added functionalities for both guest VMs and the platform. Specific instances of virtualization services described in this thesis are the network virtualization service that exploits heterogeneous processing cores, a storage virtualization service that provides location transparent access to block devices by extending the functionality provided by network virtualization service, a multimedia virtualization service that allows efficient media device sharing based on semantic information, and an object-based storage service with enhanced access control.Ph.D.Committee Chair: Schwan, Karsten; Committee Member: Ahamad, Mustaq; Committee Member: Fujimoto, Richard; Committee Member: Gavrilovska, Ada; Committee Member: Owen, Henry; Committee Member: Xenidis, Jim

    Broadcast-oriented wireless network-on-chip : fundamentals and feasibility

    Get PDF
    Premi extraordinari doctorat UPC curs 2015-2016, àmbit Enginyeria de les TICRecent years have seen the emergence and ubiquitous adoption of Chip Multiprocessors (CMPs), which rely on the coordinated operation of multiple execution units or cores. Successive CMP generations integrate a larger number of cores seeking higher performance with a reasonable cost envelope. For this trend to continue, however, important scalability issues need to be solved at different levels of design. Scaling the interconnect fabric is a grand challenge by itself, as new Network-on-Chip (NoC) proposals need to overcome the performance hurdles found when dealing with the increasingly variable and heterogeneous communication demands of manycore processors. Fast and flexible NoC solutions are needed to prevent communication become a performance bottleneck, situation that would severely limit the design space at the architectural level and eventually lead to the use of software frameworks that are slow, inefficient, or less programmable. The emergence of novel interconnect technologies has opened the door to a plethora of new NoCs promising greater scalability and architectural flexibility. In particular, wireless on-chip communication has garnered considerable attention due to its inherent broadcast capabilities, low latency, and system-level simplicity. Most of the resulting Wireless Network-on-Chip (WNoC) proposals have set the focus on leveraging the latency advantage of this paradigm by creating multiple wireless channels to interconnect far-apart cores. This strategy is effective as the complement of wired NoCs at moderate scales, but is likely to be overshadowed at larger scales by technologies such as nanophotonics unless bandwidth is unrealistically improved. This dissertation presents the concept of Broadcast-Oriented Wireless Network-on-Chip (BoWNoC), a new approach that attempts to foster the inherent simplicity, flexibility, and broadcast capabilities of the wireless technology by integrating one on-chip antenna and transceiver per processor core. This paradigm is part of a broader hybrid vision where the BoWNoC serves latency-critical and broadcast traffic, tightly coupled to a wired plane oriented to large flows of data. By virtue of its scalable broadcast support, BoWNoC may become the key enabler of a wealth of unconventional hardware architectures and algorithmic approaches, eventually leading to a significant improvement of the performance, energy efficiency, scalability and programmability of manycore chips. The present work aims not only to lay the fundamentals of the BoWNoC paradigm, but also to demonstrate its viability from the electronic implementation, network design, and multiprocessor architecture perspectives. An exploration at the physical level of design validates the feasibility of the approach at millimeter-wave bands in the short term, and then suggests the use of graphene-based antennas in the terahertz band in the long term. At the link level, this thesis provides an insightful context analysis that is used, afterwards, to drive the design of a lightweight protocol that reliably serves broadcast traffic with substantial latency improvements over state-of-the-art NoCs. At the network level, our hybrid vision is evaluated putting emphasis on the flexibility provided at the network interface level, showing outstanding speedups for a wide set of traffic patterns. At the architecture level, the potential impact of the BoWNoC paradigm on the design of manycore chips is not only qualitatively discussed in general, but also quantitatively assessed in a particular architecture for fast synchronization. Results demonstrate that the impact of BoWNoC can go beyond simply improving the network performance, thereby representing a possible game changer in the manycore era.Avenços en el disseny de multiprocessadors han portat a una àmplia adopció dels Chip Multiprocessors (CMPs), que basen el seu potencial en la operació coordinada de múltiples nuclis de procés. Generacions successives han anat integrant més nuclis en la recerca d'alt rendiment amb un cost raonable. Per a que aquesta tendència continuï, però, cal resoldre importants problemes d'escalabilitat a diferents capes de disseny. Escalar la xarxa d'interconnexió és un gran repte en ell mateix, ja que les noves propostes de Networks-on-Chip (NoC) han de servir un tràfic eminentment variable i heterogeni dels processadors amb molts nuclis. Són necessàries solucions ràpides i flexibles per evitar que les comunicacions dins del xip es converteixin en el pròxim coll d'ampolla de rendiment, situació que limitaria en gran mesura l'espai de disseny a nivell d'arquitectura i portaria a l'ús d'arquitectures i models de programació lents, ineficients o poc programables. L'aparició de noves tecnologies d'interconnexió ha possibilitat la creació de NoCs més flexibles i escalables. En particular, la comunicació intra-xip sense fils ha despertat un interès considerable en virtut de les seva baixa latència, simplicitat, i bon rendiment amb tràfic broadcast. La majoria de les Wireless NoC (WNoC) proposades fins ara s'han centrat en aprofitar l'avantatge en termes de latència d'aquest nou paradigma creant múltiples canals sense fils per interconnectar nuclis allunyats entre sí. Aquesta estratègia és efectiva per complementar a NoCs clàssiques en escales mitjanes, però és probable que altres tecnologies com la nanofotònica puguin jugar millor aquest paper a escales més grans. Aquesta tesi presenta el concepte de Broadcast-Oriented WNoC (BoWNoC), un nou enfoc que intenta rendibilitzar al màxim la inherent simplicitat, flexibilitat, i capacitats broadcast de la tecnologia sense fils integrant una antena i transmissor/receptor per cada nucli del processador. Aquest paradigma forma part d'una visió més àmplia on un BoWNoC serviria tràfic broadcast i urgent, mentre que una xarxa convencional serviria fluxos de dades més pesats. En virtut de la escalabilitat i del seu suport broadcast, BoWNoC podria convertir-se en un element clau en una gran varietat d'arquitectures i algoritmes poc convencionals que milloressin considerablement el rendiment, l'eficiència, l'escalabilitat i la programabilitat de processadors amb molts nuclis. El present treball té com a objectius no només estudiar els aspectes fonamentals del paradigma BoWNoC, sinó també demostrar la seva viabilitat des dels punts de vista de la implementació, i del disseny de xarxa i arquitectura. Una exploració a la capa física valida la viabilitat de l'enfoc usant tecnologies longituds d'ona milimètriques en un futur proper, i suggereix l'ús d'antenes de grafè a la banda dels terahertz ja a més llarg termini. A capa d'enllaç, la tesi aporta una anàlisi del context de l'aplicació que és, més tard, utilitzada per al disseny d'un protocol d'accés al medi que permet servir tràfic broadcast a baixa latència i de forma fiable. A capa de xarxa, la nostra visió híbrida és avaluada posant èmfasi en la flexibilitat que aporta el fet de prendre les decisions a nivell de la interfície de xarxa, mostrant grans millores de rendiment per una àmplia selecció de patrons de tràfic. A nivell d'arquitectura, l'impacte que el concepte de BoWNoC pot tenir sobre el disseny de processadors amb molts nuclis no només és debatut de forma qualitativa i genèrica, sinó també avaluat quantitativament per una arquitectura concreta enfocada a la sincronització. Els resultats demostren que l'impacte de BoWNoC pot anar més enllà d'una millora en termes de rendiment de xarxa; representant, possiblement, un canvi radical a l'era dels molts nuclisAward-winningPostprint (published version

    Broadcast-oriented wireless network-on-chip : fundamentals and feasibility

    Get PDF
    Premi extraordinari doctorat UPC curs 2015-2016, àmbit Enginyeria de les TICRecent years have seen the emergence and ubiquitous adoption of Chip Multiprocessors (CMPs), which rely on the coordinated operation of multiple execution units or cores. Successive CMP generations integrate a larger number of cores seeking higher performance with a reasonable cost envelope. For this trend to continue, however, important scalability issues need to be solved at different levels of design. Scaling the interconnect fabric is a grand challenge by itself, as new Network-on-Chip (NoC) proposals need to overcome the performance hurdles found when dealing with the increasingly variable and heterogeneous communication demands of manycore processors. Fast and flexible NoC solutions are needed to prevent communication become a performance bottleneck, situation that would severely limit the design space at the architectural level and eventually lead to the use of software frameworks that are slow, inefficient, or less programmable. The emergence of novel interconnect technologies has opened the door to a plethora of new NoCs promising greater scalability and architectural flexibility. In particular, wireless on-chip communication has garnered considerable attention due to its inherent broadcast capabilities, low latency, and system-level simplicity. Most of the resulting Wireless Network-on-Chip (WNoC) proposals have set the focus on leveraging the latency advantage of this paradigm by creating multiple wireless channels to interconnect far-apart cores. This strategy is effective as the complement of wired NoCs at moderate scales, but is likely to be overshadowed at larger scales by technologies such as nanophotonics unless bandwidth is unrealistically improved. This dissertation presents the concept of Broadcast-Oriented Wireless Network-on-Chip (BoWNoC), a new approach that attempts to foster the inherent simplicity, flexibility, and broadcast capabilities of the wireless technology by integrating one on-chip antenna and transceiver per processor core. This paradigm is part of a broader hybrid vision where the BoWNoC serves latency-critical and broadcast traffic, tightly coupled to a wired plane oriented to large flows of data. By virtue of its scalable broadcast support, BoWNoC may become the key enabler of a wealth of unconventional hardware architectures and algorithmic approaches, eventually leading to a significant improvement of the performance, energy efficiency, scalability and programmability of manycore chips. The present work aims not only to lay the fundamentals of the BoWNoC paradigm, but also to demonstrate its viability from the electronic implementation, network design, and multiprocessor architecture perspectives. An exploration at the physical level of design validates the feasibility of the approach at millimeter-wave bands in the short term, and then suggests the use of graphene-based antennas in the terahertz band in the long term. At the link level, this thesis provides an insightful context analysis that is used, afterwards, to drive the design of a lightweight protocol that reliably serves broadcast traffic with substantial latency improvements over state-of-the-art NoCs. At the network level, our hybrid vision is evaluated putting emphasis on the flexibility provided at the network interface level, showing outstanding speedups for a wide set of traffic patterns. At the architecture level, the potential impact of the BoWNoC paradigm on the design of manycore chips is not only qualitatively discussed in general, but also quantitatively assessed in a particular architecture for fast synchronization. Results demonstrate that the impact of BoWNoC can go beyond simply improving the network performance, thereby representing a possible game changer in the manycore era.Avenços en el disseny de multiprocessadors han portat a una àmplia adopció dels Chip Multiprocessors (CMPs), que basen el seu potencial en la operació coordinada de múltiples nuclis de procés. Generacions successives han anat integrant més nuclis en la recerca d'alt rendiment amb un cost raonable. Per a que aquesta tendència continuï, però, cal resoldre importants problemes d'escalabilitat a diferents capes de disseny. Escalar la xarxa d'interconnexió és un gran repte en ell mateix, ja que les noves propostes de Networks-on-Chip (NoC) han de servir un tràfic eminentment variable i heterogeni dels processadors amb molts nuclis. Són necessàries solucions ràpides i flexibles per evitar que les comunicacions dins del xip es converteixin en el pròxim coll d'ampolla de rendiment, situació que limitaria en gran mesura l'espai de disseny a nivell d'arquitectura i portaria a l'ús d'arquitectures i models de programació lents, ineficients o poc programables. L'aparició de noves tecnologies d'interconnexió ha possibilitat la creació de NoCs més flexibles i escalables. En particular, la comunicació intra-xip sense fils ha despertat un interès considerable en virtut de les seva baixa latència, simplicitat, i bon rendiment amb tràfic broadcast. La majoria de les Wireless NoC (WNoC) proposades fins ara s'han centrat en aprofitar l'avantatge en termes de latència d'aquest nou paradigma creant múltiples canals sense fils per interconnectar nuclis allunyats entre sí. Aquesta estratègia és efectiva per complementar a NoCs clàssiques en escales mitjanes, però és probable que altres tecnologies com la nanofotònica puguin jugar millor aquest paper a escales més grans. Aquesta tesi presenta el concepte de Broadcast-Oriented WNoC (BoWNoC), un nou enfoc que intenta rendibilitzar al màxim la inherent simplicitat, flexibilitat, i capacitats broadcast de la tecnologia sense fils integrant una antena i transmissor/receptor per cada nucli del processador. Aquest paradigma forma part d'una visió més àmplia on un BoWNoC serviria tràfic broadcast i urgent, mentre que una xarxa convencional serviria fluxos de dades més pesats. En virtut de la escalabilitat i del seu suport broadcast, BoWNoC podria convertir-se en un element clau en una gran varietat d'arquitectures i algoritmes poc convencionals que milloressin considerablement el rendiment, l'eficiència, l'escalabilitat i la programabilitat de processadors amb molts nuclis. El present treball té com a objectius no només estudiar els aspectes fonamentals del paradigma BoWNoC, sinó també demostrar la seva viabilitat des dels punts de vista de la implementació, i del disseny de xarxa i arquitectura. Una exploració a la capa física valida la viabilitat de l'enfoc usant tecnologies longituds d'ona milimètriques en un futur proper, i suggereix l'ús d'antenes de grafè a la banda dels terahertz ja a més llarg termini. A capa d'enllaç, la tesi aporta una anàlisi del context de l'aplicació que és, més tard, utilitzada per al disseny d'un protocol d'accés al medi que permet servir tràfic broadcast a baixa latència i de forma fiable. A capa de xarxa, la nostra visió híbrida és avaluada posant èmfasi en la flexibilitat que aporta el fet de prendre les decisions a nivell de la interfície de xarxa, mostrant grans millores de rendiment per una àmplia selecció de patrons de tràfic. A nivell d'arquitectura, l'impacte que el concepte de BoWNoC pot tenir sobre el disseny de processadors amb molts nuclis no només és debatut de forma qualitativa i genèrica, sinó també avaluat quantitativament per una arquitectura concreta enfocada a la sincronització. Els resultats demostren que l'impacte de BoWNoC pot anar més enllà d'una millora en termes de rendiment de xarxa; representant, possiblement, un canvi radical a l'era dels molts nuclisAward-winningPostprint (published version

    Thin Hypervisor-Based Security Architectures for Embedded Platforms

    Get PDF
    Virtualization has grown increasingly popular, thanks to its benefits of isolation, management, and utilization, supported by hardware advances. It is also receiving attention for its potential to support security, through hypervisor-based services and advanced protections supplied to guests. Today, virtualization is even making inroads in the embedded space, and embedded systems, with their security needs, have already started to benefit from virtualization’s security potential. In this thesis, we investigate the possibilities for thin hypervisor-based security on embedded platforms. In addition to significant background study, we present implementation of a low-footprint, thin hypervisor capable of providing security protections to a single FreeRTOS guest kernel on ARM. Backed by performance test results, our hypervisor provides security to a formerly unsecured kernel with minimal performance overhead, and represents a first step in a greater research effort into the security advantages and possibilities of embedded thin hypervisors. Our results show that thin hypervisors are both possible and beneficial even on limited embedded systems, and sets the stage for more advanced investigations, implementations, and security applications in the future

    Optimizations and Cost Models for multi-core architectures: an approach based on parallel paradigms

    Get PDF
    The trend in modern microprocessor architectures is clear: multi-core chips are here to stay, and researchers expect multiprocessors with 128 to 1024 cores on a chip in some years. Yet the software community is slowly taking the path towards parallel programming: while some works target multi-cores, these are usually inherited from the previous tools for SMP architectures, and rarely exploit specific characteristics of multi-cores. But most important, current tools have no facilities to guarantee performance or portability among architectures. Our research group was one of the first to propose the structured parallel programming approach to solve the problem of performance portability and predictability. This has been successfully demonstrated years ago for distributed and shared memory multiprocessors, and we strongly believe that the same should be applied to multi-core architectures. The main problem with performance portability is that optimizations are effective only under specific conditions, making them dependent on both the specific program and the target architecture. For this reason in current parallel programming (in general, but especially with multi-cores) optimizations usually follows a try-and-decide approach: each one must be implemented and tested on the specific parallel program to understand its benefits. If we want to make a step forward and really achieve some form of performance portability, we require some kind of prediction of the expected performance of a program. The concept of performance modeling is quite old in the world of parallel programming; yet, in the last years, this kind of research saw small improvements: cost models to describe multi-cores are missing, mainly because of the increasing complexity of microarchitectures and the poor knowledge of specific implementation details of current processors. In the first part of this thesis we prove that the way of performance modeling is still feasible, by studying the Tilera TilePro64. The high number of cores on-chip in this processor (64) required the use of several innovative solutions, such as a complex interconnection network and the use of multiple memory interfaces per chip. For these features the TilePro64 can be considered an insight of what to expect in future multi-core processors. The availability of a cycle-accurate simulator and an extensive documentation allowed us to model the architecture, and in particular its memory subsystem, at the accuracy level required to compare optimizations In the second part, focused on optimizations, we cover one of the most important issue of multi-core architectures: the memory subsystem. In this area multi-core strongly differs in their structure w.r.t off-chip parallel architectures, both SMP and NUMA, thus opening new opportunities. In detail, we investigate the problem of data distribution over the memory controllers in several commercial multi-cores, and the efficient use of the cache coherency mechanisms offered by the TilePro64 processor. Finally, by using the performance model, we study different implementations, derived from the previous optimizations, of a simple test-case application. We are able to predict the best version using only profiled data from a sequential execution. The accuracy of the model has been verified by experimentally comparing the implementations on the real architecture, giving results within 1 − 2% of accuracy
    corecore