830 research outputs found
Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors
Sparse matrix-vector multiplication (SpMV) is a central building block for
scientific software and graph applications. Recently, heterogeneous processors
composed of different types of cores attracted much attention because of their
flexible core configuration and high energy efficiency. In this paper, we
propose a compressed sparse row (CSR) format based SpMV algorithm utilizing
both types of cores in a CPU-GPU heterogeneous processor. We first
speculatively execute segmented sum operations on the GPU part of a
heterogeneous processor and generate a possibly incorrect results. Then the CPU
part of the same chip is triggered to re-arrange the predicted partial sums for
a correct resulting vector. On three heterogeneous processors from Intel, AMD
and nVidia, using 20 sparse matrices as a benchmark suite, the experimental
results show that our method obtains significant performance improvement over
the best existing CSR-based SpMV algorithms. The source code of this work is
downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSRComment: 22 pages, 8 figures, Published at Parallel Computing (PARCO
Efficient Proactive Caching for Supporting Seamless Mobility
We present a distributed proactive caching approach that exploits user
mobility information to decide where to proactively cache data to support
seamless mobility, while efficiently utilizing cache storage using a congestion
pricing scheme. The proposed approach is applicable to the case where objects
have different sizes and to a two-level cache hierarchy, for both of which the
proactive caching problem is hard. Additionally, our modeling framework
considers the case where the delay is independent of the requested data object
size and the case where the delay is a function of the object size. Our
evaluation results show how various system parameters influence the delay gains
of the proposed approach, which achieves robust and good performance relative
to an oracle and an optimal scheme for a flat cache structure.Comment: 10 pages, 9 figure
Compilation Optimizations to Enhance Resilience of Big Data Programs and Quantum Processors
Modern computers can experience a variety of transient errors due to the surrounding environment, known as soft faults. Although the frequency of these faults is low enough to not be noticeable on personal computers, they become a considerable concern during large-scale distributed computations or systems in more vulnerable environments like satellites. These faults occur as a bit flip of some value in a register, operation, or memory during execution. They surface as either program crashes, hangs, or silent data corruption (SDC), each of which can waste time, money, and resources. Hardware methods, such as shielding or error correcting memory (ECM), exist, though they can be difficult to implement, expensive, and may be limited to only protecting against errors in specific locations. Researchers have been exploring software detection and correction methods as an alternative, commonly trading either overhead in execution time or memory usage to protect against faults.
Quantum computers, a relatively recent advancement in computing technology, experience similar errors on a much more severe scale. The errors are more frequent, costly, and difficult to detect and correct. Error correction algorithms like Shor’s code promise to completely remove errors, but they cannot be implemented on current noisy intermediate-scale quantum (NISQ) systems due to the low number of available qubits. Until the physical systems become large enough to support error correction, researchers instead have been studying other methods to reduce and compensate for errors.
In this work, we present two methods for improving the resilience of classical processes, both single- and multi-threaded. We then introduce quantum computing and compare the nature of errors and correction methods to previous classical methods. We further discuss two designs for improving compilation of quantum circuits. One method, focused on quantum neural networks (QNNs), takes advantage of partial compilation to avoid recompiling the entire circuit each time. The other method is a new approach to compiling quantum circuits using graph neural networks (GNNs) to improve the resilience of quantum circuits and increase fidelity. By using GNNs with reinforcement learning, we can train a compiler to provide improved qubit allocation that improves the success rate of quantum circuits
System-Scenario Methodology to Design a Highly Reliable Radiation-Hardened Memory for Space Applications
Cache memory circuits are one of the concerns of computing systems, especially in terms of power consumption, reliability, and high performance. Voltage-scaling techniques can be used to reduce the total power consumption of the caches. However, aggressive voltage scaling significantly increases the probability of memory failure, especially in environments with high radiation levels, such as space. It is, therefore, important to deploy techniques to deal with reliability issues along with voltage scaling. In this chapter, we present a system-scenario methodology for radiation-hardened memory design to keep the reliability during voltage scaling. Although any SRAM array can benefit from the design, we frame our study on the recently proposed radiation-hardened cell, Nwise, which provides high level of tolerance against single event and multi event upsets in memories. To reduce the power consumption while upholding reliability, we leverage the system-scenario-based design methodology to optimize the energy consumption in applications, where system requirements vary dynamically at run time. We demonstrate the use of the methodology with a use case related to satellite systems and solar activity. Our simulations show that we achieve up to 49.3% power consumption saving compared to using a cache design with a fixed nominal power supply level
Resource-aware scheduling for 2D/3D multi-/many-core processor-memory systems
This dissertation addresses the complexities of 2D/3D multi-/many-core processor-memory systems, focusing on two key areas: enhancing timing predictability in real-time multi-core processors and optimizing performance within thermal constraints. The integration of an increasing number of transistors into compact chip designs, while boosting computational capacity, presents challenges in resource contention and thermal management. The first part of the thesis improves timing predictability. We enhance shared cache interference analysis for set-associative caches, advancing the calculation of Worst-Case Execution Time (WCET). This development enables accurate assessment of cache interference and the effectiveness of partitioned schedulers in real-world scenarios. We introduce TCPS, a novel task and cache-aware partitioned scheduler that optimizes cache partitioning based on task-specific WCET sensitivity, leading to improved schedulability and predictability. Our research explores various cache and scheduling configurations, providing insights into their performance trade-offs. The second part focuses on thermal management in 2D/3D many-core systems. Recognizing the limitations of Dynamic Voltage and Frequency Scaling (DVFS) in S-NUCA many-core processors, we propose synchronous thread migrations as a thermal management strategy. This approach culminates in the HotPotato scheduler, which balances performance and thermal safety. We also introduce 3D-TTP, a transient temperature-aware power budgeting strategy for 3D-stacked systems, reducing the need for Dynamic Thermal Management (DTM) activation. Finally, we present 3QUTM, a novel method for 3D-stacked systems that combines core DVFS and memory bank Low Power Modes with a learning algorithm, optimizing response times within thermal limits. This research contributes significantly to enhancing performance and thermal management in advanced processor-memory systems
Data Storage and Dissemination in Pervasive Edge Computing Environments
Nowadays, smart mobile devices generate huge amounts of data in all sorts of gatherings.
Much of that data has localized and ephemeral interest, but can be of great use if shared
among co-located devices. However, mobile devices often experience poor connectivity,
leading to availability issues if application storage and logic are fully delegated to a
remote cloud infrastructure. In turn, the edge computing paradigm pushes computations
and storage beyond the data center, closer to end-user devices where data is generated
and consumed. Hence, enabling the execution of certain components of edge-enabled
systems directly and cooperatively on edge devices.
This thesis focuses on the design and evaluation of resilient and efficient data storage
and dissemination solutions for pervasive edge computing environments, operating with
or without access to the network infrastructure. In line with this dichotomy, our goal can
be divided into two specific scenarios. The first one is related to the absence of network
infrastructure and the provision of a transient data storage and dissemination system
for networks of co-located mobile devices. The second one relates with the existence of
network infrastructure access and the corresponding edge computing capabilities.
First, the thesis presents time-aware reactive storage (TARS), a reactive data storage
and dissemination model with intrinsic time-awareness, that exploits synergies between
the storage substrate and the publish/subscribe paradigm, and allows queries within a
specific time scope. Next, it describes in more detail: i) Thyme, a data storage and dis-
semination system for wireless edge environments, implementing TARS; ii) Parsley, a
flexible and resilient group-based distributed hash table with preemptive peer relocation
and a dynamic data sharding mechanism; and iii) Thyme GardenBed, a framework
for data storage and dissemination across multi-region edge networks, that makes use of
both device-to-device and edge interactions.
The developed solutions present low overheads, while providing adequate response
times for interactive usage and low energy consumption, proving to be practical in a
variety of situations. They also display good load balancing and fault tolerance properties.Resumo
Hoje em dia, os dispositivos móveis inteligentes geram grandes quantidades de dados
em todos os tipos de aglomerações de pessoas. Muitos desses dados têm interesse loca-
lizado e efêmero, mas podem ser de grande utilidade se partilhados entre dispositivos
co-localizados. No entanto, os dispositivos móveis muitas vezes experienciam fraca co-
nectividade, levando a problemas de disponibilidade se o armazenamento e a lógica das
aplicações forem totalmente delegados numa infraestrutura remota na nuvem. Por sua
vez, o paradigma de computação na periferia da rede leva as computações e o armazena-
mento para além dos centros de dados, para mais perto dos dispositivos dos utilizadores
finais onde os dados são gerados e consumidos. Assim, permitindo a execução de certos
componentes de sistemas direta e cooperativamente em dispositivos na periferia da rede.
Esta tese foca-se no desenho e avaliação de soluções resilientes e eficientes para arma-
zenamento e disseminação de dados em ambientes pervasivos de computação na periferia
da rede, operando com ou sem acesso à infraestrutura de rede. Em linha com esta dico-
tomia, o nosso objetivo pode ser dividido em dois cenários específicos. O primeiro está
relacionado com a ausência de infraestrutura de rede e o fornecimento de um sistema
efêmero de armazenamento e disseminação de dados para redes de dispositivos móveis
co-localizados. O segundo diz respeito à existência de acesso à infraestrutura de rede e
aos recursos de computação na periferia da rede correspondentes.
Primeiramente, a tese apresenta armazenamento reativo ciente do tempo (ARCT), um
modelo reativo de armazenamento e disseminação de dados com percepção intrínseca
do tempo, que explora sinergias entre o substrato de armazenamento e o paradigma pu-
blicação/subscrição, e permite consultas num escopo de tempo específico. De seguida,
descreve em mais detalhe: i) Thyme, um sistema de armazenamento e disseminação de
dados para ambientes sem fios na periferia da rede, que implementa ARCT; ii) Pars-
ley, uma tabela de dispersão distribuída flexível e resiliente baseada em grupos, com
realocação preventiva de nós e um mecanismo de particionamento dinâmico de dados; e
iii) Thyme GardenBed, um sistema para armazenamento e disseminação de dados em
redes multi-regionais na periferia da rede, que faz uso de interações entre dispositivos e
com a periferia da rede.
As soluções desenvolvidas apresentam baixos custos, proporcionando tempos de res-
posta adequados para uso interativo e baixo consumo de energia, demonstrando serem
práticas nas mais diversas situações. Estas soluções também exibem boas propriedades de balanceamento de carga e tolerância a faltas
Building Internet caching systems for streaming media delivery
The proxy has been widely and successfully used to cache the static Web objects fetched by a client so that the subsequent clients requesting the same Web objects can be served directly from the proxy instead of other sources faraway, thus reducing the server\u27s load, the network traffic and the client response time. However, with the dramatic increase of streaming media objects emerging on the Internet, the existing proxy cannot efficiently deliver them due to their large sizes and client real time requirements.;In this dissertation, we design, implement, and evaluate cost-effective and high performance proxy-based Internet caching systems for streaming media delivery. Addressing the conflicting performance objectives for streaming media delivery, we first propose an efficient segment-based streaming media proxy system model. This model has guided us to design a practical streaming proxy, called Hyper-Proxy, aiming at delivering the streaming media data to clients with minimum playback jitter and a small startup latency, while achieving high caching performance. Second, we have implemented Hyper-Proxy by leveraging the existing Internet infrastructure. Hyper-Proxy enables the streaming service on the common Web servers. The evaluation of Hyper-Proxy on the global Internet environment and the local network environment shows it can provide satisfying streaming performance to clients while maintaining a good cache performance. Finally, to further improve the streaming delivery efficiency, we propose a group of the Shared Running Buffers (SRB) based proxy caching techniques to effectively utilize proxy\u27s memory. SRB algorithms can significantly reduce the media server/proxy\u27s load and network traffic and relieve the bottlenecks of the disk bandwidth and the network bandwidth.;The contributions of this dissertation are threefold: (1) we have studied several critical performance trade-offs and provided insights into Internet media content caching and delivery. Our understanding further leads us to establish an effective streaming system optimization model; (2) we have designed and evaluated several efficient algorithms to support Internet streaming content delivery, including segment caching, segment prefetching, and memory locality exploitation for streaming; (3) having addressed several system challenges, we have successfully implemented a real streaming proxy system and deployed it in a large industrial enterprise
- …