6 research outputs found

    Reducing consistency traffic and cache misses in the avalanche multiprocessor

    Get PDF
    Journal ArticleFor a parallel architecture to scale effectively, communication latency between processors must be avoided. We have found that the source of a large number of avoidable cache misses is the use of hardwired write-invalidate coherency protocols, which often exhibit high cache miss rates due to excessive invalidations and subsequent reloading of shared data. In the Avalanche project at the University of Utah, we are building a 64-node multiprocessor designed to reduce the end-to-end communication latency of both shared memory and message passing programs. As part of our design efforts, we are evaluating the potential performance benefits and implementation complexity of providing hardware support for multiple coherency protocols. Using a detailed architecture simulation of Avalanche, we have found that support for multiple consistency protocols can reduce the time parallel applications spend stalled on memory operations by up to 66% and overall execution time by up to 31%. Most of this reduction in memory stall time is due to a novel release-consistent multiple-writer write-update protocol implemented using a write state buffer

    Avalanche: A communication and memory architecture for scalable parallel computing

    Get PDF
    technical reportAs the gap between processor and memory speeds widens, system designers will inevitably incorporate increasingly deep memory hierarchies to maintain the balance between processor and memory system performance. At the same time, most communication subsystems are permitted access only to main memory and not a processor's top level cache. As memory latencies increase, this lack of integration between the memory and communication systems will seriously impede interprocessor communication performance and limit effective scalability. In the Avalanche project we are redesigning the memory architecture of a commercial RISC multiprocessor, the HP PA-RISC 7100, to include a new multi-level context sensitive cache that is tightly coupled to the communication fabric. The primary goal of Avalanche's integrated cache and communication controller is attacking end to end communication latency in all of its forms. This includes cache misses induced by excessive invalidations and reloading of shared data by write-invalidate coherence protocols and cache misses induced by depositing incoming message data in main memory and faulting it into the cache. An execution-driven simulation study of Avalanche's architecture indicates that it can reduce cache stalls by 5-60% and overall execution times by 10-28%

    Investigación de nuevas metodologías para la planificación de sistemas de tiempo real multinúcleo mediante técnicas no convencionales

    Get PDF
    Tesis por compendio[ES] Los sistemas de tiempo real se caracterizan por exigir el cumplimento de unos requisitos temporales que garanticen el funcionamiento aceptable de un sistema. Especialmente, en los sistemas de tiempo real estricto estos requisitos temporales deben ser inviolables. Estos sistemas suelen aplicarse en áreas como la aviación, la seguridad ferroviaria, satélites y control de procesos, entre otros. Por tanto, el incumplimiento de un requisito temporal en un sistema de tiempo real estricto puede ocasionar un fallo catastrófico. La planificación de sistemas de tiempo real es una área en la que se estudian y aplican diversas metodologías, heurísticas y algoritmos que intentan asignar el recurso de la CPU sin pérdidas de plazo. El uso de sistemas de computación multinúcleo es una opción cada vez más recurrente en los sistemas de tiempo real estrictos. Esto se debe, entre otras causas, a su alto rendimiento a nivel de computación gracias a su capacidad de ejecutar varios procesos en paralelo. Por otro lado, los sistemas multinúcleo presentan un nuevo problema, la contención que ocurre debido a la compartición de los recursos de hardware. El origen de esta contención es la interferencia que en ocasiones ocurre entre tareas asignadas en distintos núcleos que pretenden acceder al mismo recurso compartido simultáneamente, típicamente acceso a memoria compartida. Esta interferencia añadida puede suponer un incumplimiento de los requisitos temporales, y por tanto, la planificación no sería viable. En este trabajo se proponen nuevas metodologías y estrategias de planificación no convencionales para aportar soluciones al problema de la interferencia en sistemas multinúcleo. Estas metodologías y estrategias abarcan algoritmos de planificación, algoritmos de asignación de tareas a núcleos, modelos temporales y análisis de planificabilidad. El resultado del trabajo realizado se ha publicado en diversos artículos en revistas del área. En ellos se presentan estas nuevas propuestas que afrontan los retos de la planificación de tareas. En la mayoría de los artículos presentados la estructura es similar: se introduce el contexto en el que nos situamos, se plantea la problemática existente, se expone una propuesta para solventar o mejorar los resultados de la planificación, después se realiza una experimentación para evaluar de forma práctica la metodología propuesta, se analizan los resultados obtenidos y finalmente se exponen unas conclusiones sobre la propuesta. Los resultados de las metodologías no convencionales propuestas en los artículos que conforman esta tesis muestran una mejora del rendimiento de las planificaciones en comparación con algoritmos clásicos del área. Especialmente la mejora se produce en términos de disminución de la interferencia producida y mejora de la tasa de planificabilidad.[CA] Els sistemes de temps real es caracteritzen per exigir el compliment d'uns requisits temporals que garantisquen el funcionament acceptable d'un sistema. Especialment, en els sistemes de temps real estricte aquests requisits temporals han de ser inviolables. Aquests sistemes solen aplicar-se en àrees com l'aviació, la seguretat ferroviària, satèl·lits i control de processos, entre altres. Per tant, l'incompliment d'un requisit temporal en un sistema de temps real estricte pot ocasionar un error catastròfic. La planificació de sistemes de temps real és una àrea en la qual s'estudien i apliquen diverses metodologies, heurístiques i algorismes que intenten assignar el recurs de la CPU sense pèrdues de termini. L'ús de sistemes de computació multinucli és una opció cada vegada més recurrent en els sistemes de temps real estrictes. Això es deu, entre altres causes, al seu alt rendiment a nivell de computació gràcies a la seua capacitat d'executar diversos processos en paral·lel. D'altra banda, els sistemes multinucli presenten un nou problema, la contenció que ocorre a causa de la compartició dels recursos de hardware. L'origen d'aquesta contenció és la interferència que a vegades ocorre entre tasques assignades en diferents nuclis que pretenen accedir al mateix recurs compartit simultàniament, típicament accés a memòria compartida. Aquesta interferència afegida pot suposar un incompliment dels requisits temporals, i per tant, la planificació no seria viable. En aquest treball es proposen noves metodologies i estratègies de planificació no convencionals per aportar solucions al problema de la interferència en sistemes multinucli. Aquestes metodologies i estratègies comprenen algorismes de planificació, algorismes d'assignació de tasques a nuclis, models temporals i anàlisis de planificabilitat. El resultat del treball realitzat s'ha publicat en diversos articles en revistes de l'àrea. En ells es presenten aquestes noves propostes que afronten els reptes de la planificació de tasques. En la majoria dels articles presentats l'estructura és similar: s'introdueix el context en el qual ens situem, es planteja la problemàtica existent, s'exposa una proposta per a solucionar o millorar els resultats de la planificació, després es realitza una experimentació per a avaluar de manera pràctica la metodologia proposada, s'analitzen els resultats obtinguts i finalment s'exposen unes conclusions sobre la proposta. Els resultats de les metodologies no convencionals proposades en els articles que conformen aquesta tesi mostren una millora del rendiment de les planificacions en comparació amb algorismes clàssics de l'àrea. Especialment, la millora es produeix en termes de disminució de la interferència produïda i millora de la taxa de planificabilitat.[EN] Real-time systems are characterised by the demand for temporal constraints that guarantee acceptable operation and feasibility of a system. Especially, in hard real-time systems these temporal constraints must be respected. These systems are typically applied in areas such as aviation, railway safety, satellites and process control, among others. Therefore, a missed deadline in a hard-real time system can lead to a catastrophic failure. The scheduling of real-time systems is an area where various methodologies, heuristics and algorithms are studied and applied in an attempt to allocate the CPU resources without missing any deadline. The use of multicore computing systems is an increasingly recurrent option in hard real-time systems. This is due, among other reasons, to its high computational performance thanks to the ability to run multiple processes in parallel. On the other hand, multicore systems present a new problem, the contention that occurs due to the sharing of hardware resources. The source of this contention is the interference that sometimes happens between tasks allocated in different cores that try to access the same shared resource simultaneously, typically shared memory access. This added interference can lead to miss a deadline, and therefore, the scheduling would not be feasible. This paper proposes new non-conventional scheduling methodologies and strategies to provide solutions to the interference problem in multicore systems. These methodologies and strategies include scheduling algorithms, task allocation algorithms, temporal models and schedulability analysis. The results of this work have been published in several journal articles in the field. In these articles the new proposals are presented, they face the challenges of task scheduling. In the majority of these articles the structure is similar: the context is introduced, the existing problem is identified, a proposal to solve or improve the results of the scheduling is presented, then the proposed methodology is experimented in order to evaluate it in practical terms, the results obtained are analysed and finally conclusions about the proposal are expressed. The results of the non-conventional methodologies proposed in the articles that comprise this thesis show an improvement in the performance of the scheduling compared to classical algorithms in the area. In particular, the improvement is produced in terms of reducing the interference and a higher schedulability rate.Esta tesis se ha realizado en el marco de dos proyectos de investigación de carácter nacional. Uno de ellos es el proyecto es PRECON-I4. Consiste en la búsqueda de sistemas informáticos predecibles y confiables para la industria 4.0. El otro proyecto es PRESECREL, que consiste en la búsqueda de modelos y plataformas para sistemas informáticos industriales predecibles, seguros y confiables. Tanto PRECON-I4 como PRESECREL son proyectos coordinados financiados por el Ministerio de Ciencia, Innovación y Universidades y los fondos FEDER (AEI/FEDER, UE). En ambos proyectos participa la Universidad Politécnica de Valencia, la Universidad de Cantabria y la Universidad Politécnica de Madrid. Además, en PRESECREL también participa IKERLAN S. COOP I.P. Además, parte de los resultados de esta tesis también han servido para validar la asignación de recursos temporales en sistemas críticos en el marco del proyecto METROPOLIS (PLEC2021-007609).Aceituno Peinado, JM. (2024). Investigación de nuevas metodologías para la planificación de sistemas de tiempo real multinúcleo mediante técnicas no convencionales [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/203212Compendi

    ADAM : a decentralized parallel computer architecture featuring fast thread and data migration and a uniform hardware abstraction

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.Includes bibliographical references (p. 247-256).The furious pace of Moore's Law is driving computer architecture into a realm where the the speed of light is the dominant factor in system latencies. The number of clock cycles to span a chip are increasing, while the number of bits that can be accessed within a clock cycle is decreasing. Hence, it is becoming more difficult to hide latency. One alternative solution is to reduce latency by migrating threads and data, but the overhead of existing implementations has previously made migration an unserviceable solution so far. I present an architecture, implementation, and mechanisms that reduces the overhead of migration to the point where migration is a viable supplement to other latency hiding mechanisms, such as multithreading. The architecture is abstract, and presents programmers with a simple, uniform fine-grained multithreaded parallel programming model with implicit memory management. In other words, the spatial nature and implementation details (such as the number of processors) of a parallel machine are entirely hidden from the programmer. Compiler writers are encouraged to devise programming languages for the machine that guide a programmer to express their ideas in terms of objects, since objects exhibit an inherent physical locality of data and code. The machine implementation can then leverage this locality to automatically distribute data and threads across the physical machine by using a set of high performance migration mechanisms.(cont.) An implementation of this architecture could migrate a null thread in 66 cycles - over a factor of 1000 improvement over previous work. Performance also scales well; the time required to move a typical thread is only 4 to 5 times that of a null thread. Data migration performance is similar, and scales linearly with data block size. Since the performance of the migration mechanism is on par with that of an L2 cache, the implementation simulated in my work has no data caches and relies instead on multithreading and the migration mechanism to hide and reduce access latencies.by Andrew "bunnie" Huang.Ph.D

    ADAM: A Decentralized Parallel Computer Architecture Featuring Fast Thread and Data Migration and a Uniform Hardware Abstraction

    Get PDF
    The furious pace of Moore's Law is driving computer architecture into a realm where the the speed of light is the dominant factor in system latencies. The number of clock cycles to span a chip are increasing, while the number of bits that can be accessed within a clock cycle is decreasing. Hence, it is becoming more difficult to hide latency. One alternative solution is to reduce latency by migrating threads and data, but the overhead of existing implementations has previously made migration an unserviceable solution so far. I present an architecture, implementation, and mechanisms that reduces the overhead of migration to the point where migration is a viable supplement to other latency hiding mechanisms, such as multithreading. The architecture is abstract, and presents programmers with a simple, uniform fine-grained multithreaded parallel programming model with implicit memory management. In other words, the spatial nature and implementation details (such as the number of processors) of a parallel machine are entirely hidden from the programmer. Compiler writers are encouraged to devise programming languages for the machine that guide a programmer to express their ideas in terms of objects, since objects exhibit an inherent physical locality of data and code. The machine implementation can then leverage this locality to automatically distribute data and threads across the physical machine by using a set of high performance migration mechanisms. An implementation of this architecture could migrate a null thread in 66 cycles -- over a factor of 1000 improvement over previous work. Performance also scales well; the time required to move a typical thread is only 4 to 5 times that of a null thread. Data migration performance is similar, and scales linearly with data block size. Since the performance of the migration mechanism is on par with that of an L2 cache, the implementation simulated in my work has no data caches and relies instead on multithreading and the migration mechanism to hide and reduce access latencies
    corecore