38 research outputs found
Proposta de uma arquitetura multi-threading voltada para sistemas multi processos
Este artigo propõe o desenvolvimento de uma arquitetura multi-threading capaz de extrair tanto o paralelismo ao nível de instruções quanto aquele disponível entre os diferentes processos executados pelos sistemas operacionais nas estações de trabalho compartilhadas e servidores de rede. A arquitetura proposta alivia o sistema operacional das atividades mais onerosas em consumo de tempo de cpu, tais como escalonamento e troca de contexto entre processos. Aqui são apresentados os principais componentes da arquitetura, bem como os algoritmos básicos a serem executados pelos estágios do pipeline superescalar.Sistemas Distribuidos - Redes ConcurrenciaRed de Universidades con Carreras en Informática (RedUNCI
Proposta de uma arquitetura multi-threading voltada para sistemas multi processos
Este artigo propõe o desenvolvimento de uma arquitetura multi-threading capaz de extrair tanto o paralelismo ao nível de instruções quanto aquele disponível entre os diferentes processos executados pelos sistemas operacionais nas estações de trabalho compartilhadas e servidores de rede. A arquitetura proposta alivia o sistema operacional das atividades mais onerosas em consumo de tempo de cpu, tais como escalonamento e troca de contexto entre processos. Aqui são apresentados os principais componentes da arquitetura, bem como os algoritmos básicos a serem executados pelos estágios do pipeline superescalar.Sistemas Distribuidos - Redes ConcurrenciaRed de Universidades con Carreras en Informática (RedUNCI
A markov-model-based framework for supporting real-time generation of synthetic memory references effectively and efficiently
Driven by several real-life case studies and in-lab developments,
synthetic memory reference generation has a
long tradition in computer science research. The goal is
that of reproducing the running of an arbitrary program,
whose generated traces can later be used for simulations
and experiments. In this paper we investigate this research
context and provide principles and algorithms of a
Markov-Model-based framework for supporting real-time
generation of synthetic memory references effectively and
efficiently. Specifically, our approach is based on a novel
Machine Learning algorithm we called Hierarchical Hidden/
non Hidden Markov Model (HHnHMM). Experimental
results conclude this paper
Accelerating Generic Graph Neural Networks via Architecture, Compiler, Partition Method Co-Design
Graph neural networks (GNNs) have shown significant accuracy improvements in
a variety of graph learning domains, sparking considerable research interest.
To translate these accuracy improvements into practical applications, it is
essential to develop high-performance and efficient hardware acceleration for
GNN models. However, designing GNN accelerators faces two fundamental
challenges: the high bandwidth requirement of GNN models and the diversity of
GNN models. Previous works have addressed the first challenge by using more
expensive memory interfaces to achieve higher bandwidth. For the second
challenge, existing works either support specific GNN models or have generic
designs with poor hardware utilization.
In this work, we tackle both challenges simultaneously. First, we identify a
new type of partition-level operator fusion, which we utilize to internally
reduce the high bandwidth requirement of GNNs. Next, we introduce
partition-level multi-threading to schedule the concurrent processing of graph
partitions, utilizing different hardware resources. To further reduce the extra
on-chip memory required by multi-threading, we propose fine-grained graph
partitioning to generate denser graph partitions. Importantly, these three
methods make no assumptions about the targeted GNN models, addressing the
challenge of model variety. We implement these methods in a framework called
SwitchBlade, consisting of a compiler, a graph partitioner, and a hardware
accelerator. Our evaluation demonstrates that SwitchBlade achieves an average
speedup of and energy savings of compared to the
NVIDIA V100 GPU. Additionally, SwitchBlade delivers performance comparable to
state-of-the-art specialized accelerators
DTAPO: Dynamic thermal-aware performance optimization for dark silicon many-core systems
Future many-core systems need to handle high power density and chip temperature effectively. Some cores in many-core systems need to be turned off or ‘dark’ to manage chip power and thermal density. This phenomenon is also known as the dark silicon problem. This problem prevents many-core systems from utilizing and gaining improved performance from a large number of processing cores. This paper presents a dynamic thermal-aware performance optimization of dark silicon many-core systems (DTaPO) technique for optimizing dark silicon a many-core system performance under temperature constraint. The proposed technique utilizes both task migration and dynamic voltage frequency scaling (DVFS) for optimizing the performance of a many-core system while keeping system temperature in a safe operating limit. Task migration puts hot cores in low-power states and moves tasks to cooler dark cores to aggressively reduce chip temperature while maintaining high overall system performance. To reduce task migration overhead due to cold start, the source core (i.e., active core) keeps its L2 cache content during the initial migration phase. The destination core (i.e., dark core) can access it to reduce the impact of cold start misses. Moreover, the proposed technique limits tasks migration among cores that share the last level cache (LLC). In the case of major thermal violation and no cooler cores being available, DVFS is used to reduce the hot cores temperature gradually by reducing their frequency. Experimental results for different threshold temperatures show that DTaPO can keep the average system temperature below the thermal limit. Affirmatively, the execution time penalty is reduced by up to 18% compared with using only DVFS for all thermal thresholds. Moreover, the average peak temperature is reduced by up to 10.8◦ C. In addition, the experimental results show that DTaPO improves the system’s performance by up to 80% compared to optimal sprinting patterns (OSP) and reduces the temperature by up to 13.6◦ C
Evaluation of Interconnection Network Performance under Heavy Nonuniform Loads
Abstract. Many simulation-based performance studies of interconnection networks are carried out using synthetic workloads under the assumption of independent traffic sources. We show that this assumption, although may be useful for some traffic patterns, can lead to deceptive performance results for loads beyond saturation. Network throughput varies so much amongst the network nodes that average throughput does not reflect anymore network performance as a whole. We propose the utilization of burst synchronized traffic sources that better reflect the behavior of parallel applications at high loads. A performance study of a restrictive injection mechanism is used to illustrate the different results obtained using independent and non-independent traffic sources
Mage: Online Interference-Aware Scheduling in Multi-Scale Heterogeneous Systems
Heterogeneity has grown in popularity both at the core and server level as a
way to improve both performance and energy efficiency. However, despite these
benefits, scheduling applications in heterogeneous machines remains
challenging. Additionally, when these heterogeneous resources accommodate
multiple applications to increase utilization, resources are prone to
contention, destructive interference, and unpredictable performance. Existing
solutions examine heterogeneity either across or within a server, leading to
missed performance and efficiency opportunities. We present Mage, a practical
interference-aware runtime that optimizes performance and efficiency in systems
with intra- and inter-server heterogeneity. Mage leverages fast and online data
mining to quickly explore the space of application placements, and determine
the one that minimizes destructive interference between co-resident
applications. Mage continuously monitors the performance of active
applications, and, upon detecting QoS violations, it determines whether
alternative placements would prove more beneficial, taking into account any
overheads from migration. Across 350 application mixes on a heterogeneous CMP,
Mage improves performance by 38% and up to 2x compared to a greedy scheduler.
Across 160 mixes on a heterogeneous cluster, Mage improves performance by 30%
on average and up to 52% over the greedy scheduler, and by 11% over the
combination of Paragon [15] for inter- and intra-server heterogeneity
A Rapid Prototyping Environment for Wireless Communication Embedded Systems
This paper introduces a rapid prototyping methodology which overcomes important barriers in the design and implementation of digital signal processing (DSP) algorithms and systems on embedded hardware platforms, such as cellular phones. This paper describes rapid prototyping in terms of a simulation/prototype bridge and in terms of appropriate language design. The simulation/prototype bridge combines the strengths of simulation and of prototyping, allowing the designer to develop and evaluate next-generation communications systems, partly in simulation on a host computer and partly as a prototype on embedded hardware. Appropriate language design allows designers to express a communications system as a block diagram, in which each block represents an algorithm specified by a set of equations. Software tools developed for this paper implement both concepts, and have been successfully used in the development of a next-generation code division multiple access (CDMA) cellular wireless communications system.NokiaTexas InstrumentsThe Texas Advanced Technology ProgramNational Science Foundatio