38 research outputs found

    Proposta de uma arquitetura multi-threading voltada para sistemas multi processos

    Get PDF
    Este artigo propõe o desenvolvimento de uma arquitetura multi-threading capaz de extrair tanto o paralelismo ao nível de instruções quanto aquele disponível entre os diferentes processos executados pelos sistemas operacionais nas estações de trabalho compartilhadas e servidores de rede. A arquitetura proposta alivia o sistema operacional das atividades mais onerosas em consumo de tempo de cpu, tais como escalonamento e troca de contexto entre processos. Aqui são apresentados os principais componentes da arquitetura, bem como os algoritmos básicos a serem executados pelos estágios do pipeline superescalar.Sistemas Distribuidos - Redes ConcurrenciaRed de Universidades con Carreras en Informática (RedUNCI

    Proposta de uma arquitetura multi-threading voltada para sistemas multi processos

    Get PDF
    Este artigo propõe o desenvolvimento de uma arquitetura multi-threading capaz de extrair tanto o paralelismo ao nível de instruções quanto aquele disponível entre os diferentes processos executados pelos sistemas operacionais nas estações de trabalho compartilhadas e servidores de rede. A arquitetura proposta alivia o sistema operacional das atividades mais onerosas em consumo de tempo de cpu, tais como escalonamento e troca de contexto entre processos. Aqui são apresentados os principais componentes da arquitetura, bem como os algoritmos básicos a serem executados pelos estágios do pipeline superescalar.Sistemas Distribuidos - Redes ConcurrenciaRed de Universidades con Carreras en Informática (RedUNCI

    A markov-model-based framework for supporting real-time generation of synthetic memory references effectively and efficiently

    Get PDF
    Driven by several real-life case studies and in-lab developments, synthetic memory reference generation has a long tradition in computer science research. The goal is that of reproducing the running of an arbitrary program, whose generated traces can later be used for simulations and experiments. In this paper we investigate this research context and provide principles and algorithms of a Markov-Model-based framework for supporting real-time generation of synthetic memory references effectively and efficiently. Specifically, our approach is based on a novel Machine Learning algorithm we called Hierarchical Hidden/ non Hidden Markov Model (HHnHMM). Experimental results conclude this paper

    Accelerating Generic Graph Neural Networks via Architecture, Compiler, Partition Method Co-Design

    Full text link
    Graph neural networks (GNNs) have shown significant accuracy improvements in a variety of graph learning domains, sparking considerable research interest. To translate these accuracy improvements into practical applications, it is essential to develop high-performance and efficient hardware acceleration for GNN models. However, designing GNN accelerators faces two fundamental challenges: the high bandwidth requirement of GNN models and the diversity of GNN models. Previous works have addressed the first challenge by using more expensive memory interfaces to achieve higher bandwidth. For the second challenge, existing works either support specific GNN models or have generic designs with poor hardware utilization. In this work, we tackle both challenges simultaneously. First, we identify a new type of partition-level operator fusion, which we utilize to internally reduce the high bandwidth requirement of GNNs. Next, we introduce partition-level multi-threading to schedule the concurrent processing of graph partitions, utilizing different hardware resources. To further reduce the extra on-chip memory required by multi-threading, we propose fine-grained graph partitioning to generate denser graph partitions. Importantly, these three methods make no assumptions about the targeted GNN models, addressing the challenge of model variety. We implement these methods in a framework called SwitchBlade, consisting of a compiler, a graph partitioner, and a hardware accelerator. Our evaluation demonstrates that SwitchBlade achieves an average speedup of 1.85×1.85\times and energy savings of 19.03×19.03\times compared to the NVIDIA V100 GPU. Additionally, SwitchBlade delivers performance comparable to state-of-the-art specialized accelerators

    DTAPO: Dynamic thermal-aware performance optimization for dark silicon many-core systems

    Get PDF
    Future many-core systems need to handle high power density and chip temperature effectively. Some cores in many-core systems need to be turned off or ‘dark’ to manage chip power and thermal density. This phenomenon is also known as the dark silicon problem. This problem prevents many-core systems from utilizing and gaining improved performance from a large number of processing cores. This paper presents a dynamic thermal-aware performance optimization of dark silicon many-core systems (DTaPO) technique for optimizing dark silicon a many-core system performance under temperature constraint. The proposed technique utilizes both task migration and dynamic voltage frequency scaling (DVFS) for optimizing the performance of a many-core system while keeping system temperature in a safe operating limit. Task migration puts hot cores in low-power states and moves tasks to cooler dark cores to aggressively reduce chip temperature while maintaining high overall system performance. To reduce task migration overhead due to cold start, the source core (i.e., active core) keeps its L2 cache content during the initial migration phase. The destination core (i.e., dark core) can access it to reduce the impact of cold start misses. Moreover, the proposed technique limits tasks migration among cores that share the last level cache (LLC). In the case of major thermal violation and no cooler cores being available, DVFS is used to reduce the hot cores temperature gradually by reducing their frequency. Experimental results for different threshold temperatures show that DTaPO can keep the average system temperature below the thermal limit. Affirmatively, the execution time penalty is reduced by up to 18% compared with using only DVFS for all thermal thresholds. Moreover, the average peak temperature is reduced by up to 10.8◦ C. In addition, the experimental results show that DTaPO improves the system’s performance by up to 80% compared to optimal sprinting patterns (OSP) and reduces the temperature by up to 13.6◦ C

    Evaluation of Interconnection Network Performance under Heavy Nonuniform Loads

    Get PDF
    Abstract. Many simulation-based performance studies of interconnection networks are carried out using synthetic workloads under the assumption of independent traffic sources. We show that this assumption, although may be useful for some traffic patterns, can lead to deceptive performance results for loads beyond saturation. Network throughput varies so much amongst the network nodes that average throughput does not reflect anymore network performance as a whole. We propose the utilization of burst synchronized traffic sources that better reflect the behavior of parallel applications at high loads. A performance study of a restrictive injection mechanism is used to illustrate the different results obtained using independent and non-independent traffic sources

    Mage: Online Interference-Aware Scheduling in Multi-Scale Heterogeneous Systems

    Full text link
    Heterogeneity has grown in popularity both at the core and server level as a way to improve both performance and energy efficiency. However, despite these benefits, scheduling applications in heterogeneous machines remains challenging. Additionally, when these heterogeneous resources accommodate multiple applications to increase utilization, resources are prone to contention, destructive interference, and unpredictable performance. Existing solutions examine heterogeneity either across or within a server, leading to missed performance and efficiency opportunities. We present Mage, a practical interference-aware runtime that optimizes performance and efficiency in systems with intra- and inter-server heterogeneity. Mage leverages fast and online data mining to quickly explore the space of application placements, and determine the one that minimizes destructive interference between co-resident applications. Mage continuously monitors the performance of active applications, and, upon detecting QoS violations, it determines whether alternative placements would prove more beneficial, taking into account any overheads from migration. Across 350 application mixes on a heterogeneous CMP, Mage improves performance by 38% and up to 2x compared to a greedy scheduler. Across 160 mixes on a heterogeneous cluster, Mage improves performance by 30% on average and up to 52% over the greedy scheduler, and by 11% over the combination of Paragon [15] for inter- and intra-server heterogeneity

    A Rapid Prototyping Environment for Wireless Communication Embedded Systems

    Get PDF
    This paper introduces a rapid prototyping methodology which overcomes important barriers in the design and implementation of digital signal processing (DSP) algorithms and systems on embedded hardware platforms, such as cellular phones. This paper describes rapid prototyping in terms of a simulation/prototype bridge and in terms of appropriate language design. The simulation/prototype bridge combines the strengths of simulation and of prototyping, allowing the designer to develop and evaluate next-generation communications systems, partly in simulation on a host computer and partly as a prototype on embedded hardware. Appropriate language design allows designers to express a communications system as a block diagram, in which each block represents an algorithm specified by a set of equations. Software tools developed for this paper implement both concepts, and have been successfully used in the development of a next-generation code division multiple access (CDMA) cellular wireless communications system.NokiaTexas InstrumentsThe Texas Advanced Technology ProgramNational Science Foundatio
    corecore