Search CORE

3,518 research outputs found

NoSQ: Store-Load Communication without a Store Queue

Author: Martin Milo
Roth Amir
Sha Tingting
Publication venue: ScholarlyCommons
Publication date: 01/12/2006
Field of study

This paper presents NoSQ (short for No Store Queue), a microarchitecture that performs store-load communication without a store queue and without executing stores in the out-of-order engine. NoSQ implements store-load communication using speculative memory bypassing (SMB), the dynamic short-circuiting of DEF-store-load-USE chains to DEF-USE chains. Whereas previous proposals used SMB as an opportunistic complement to conventional store queue-based forwarding, NoSQ uses SMB as a store queue replacement. NoSQ relies on two supporting mechanisms. The first is an advanced store-load bypassing predictor that for a given dynamic load can predict whether that load will bypass and the identity of the communicating store. The second is an efficient verification mechanism for both bypassed and non-bypassed loads using in-order load re-execution with an SMB-aware store vulnerability window (SVW) filter. The primary benefit of NoSQ is a simple, fast datapath that does not contain store-load forwarding hardware; all loads get their values either from the data cache or from the register file. Experiments show that this simpler design - despite being more speculative - slightly outperforms a conventional store-queue based design on most benchmarks (by 2% on average)

Crossref

ScholarlyCommons@Penn

Control-flow speculation through value prediction for superscalar processors

Author: González Colás Antonio María
González González José
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1999
Field of study

In this paper, we introduce a new branch predictor that predicts the outcomes of branches by predicting the value of their inputs and performing an early computation of their results according to the predicted values. The design of a hybrid predictor comprising our branch predictor and a correlating branch predictor is presented. We also propose a new selector that chooses the most reliable prediction for each branch. This selector is based on the path followed to reach the branch. Results for immediate updates show a significant improvement with respect to a conventional hybrid predictor for different size configurations. In addition, the proposed hybrid predictor with a size of 8 KB achieves the same miss ratio as a conventional one of 64 KB. Performance evaluation for a dynamically-scheduled superscalar processor, with realistic updates, shows a speed-up of 11% despite its higher latency (up to 4 cycles)Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load/Store Optimization

Author: Roth Amir
Publication venue: ScholarlyCommons
Publication date: 01/01/2004
Field of study

A high-bandwidth, low-latency load-store unit is a critical component of a dynamically scheduled processor. Unfortunately, it is also one of the most complex and non-scalable components. Recently, several researchers have proposed techniques that simplify the core load-store unit and improve its scalability in exchange for the in-order pre-retirement re-execution of some subset of the loads in the program. We call such techniques load/store optimizations. One recent optimization attacks load queue (LQ) scalability by replacing the expensive associative search that is used to enforce intra- and inter- thread ordering with load re-execution. A second attacks store queue (SQ) scalability by speculatively filtering some load accesses and some store entries from it. The speculatively accessed, speculatively populated SQ can be made smaller and faster, but load re-execution is required to verify the speculation. A third uses a hardware table to identify redundant loads and skip their execution altogether. Redundant load elimination is highly accurate but not 100%, so re-execution is needed to flag false eliminations. Unfortunately, the inherent benefits of load/store optimizations are mitigated by re-execution itself. Re-execution contends for cache bandwidths with store retirement, and serializes load re-execution with subsequent store retirement. If a particular technique requires a sufficient number of load re-executions, the cost of these re-executions will outweigh the benefits of the technique entirely and may even produce drastic slowdowns. This is the case for the SQ technique. Store Vulnerability Window (SVW) is a new mechanism that reduces the re-execution requirements of a given load/store optimization significantly, by an average of 85% across the three load/store optimizations we study. This reduction relieves cache port contention and removes many of the dynamic serialization events that contribute the bulk of re-execution’s cost, and allows these techniques to perform up to their full potential. For the scalable SQ optimization, this means the chnace to perform at all. Without SVW, this technique posts significant slowdowns. SVW is a simple scheme based on monotonic store sequence numbering and a novel application of Bloom Filtering. The cost of an effective SVW implementation is a 1KB buffer and an 2B field per LQ entry

ScholarlyCommons@Penn

Computing graph neural networks: A survey from algorithms to accelerators

Author: Abadal Cavallé Sergi
Alarcón Cot Eduardo José
Guirado Liñan Robert
Jain Akshay
López Alonso Jorge
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/12/2022
Field of study

Graph Neural Networks (GNNs) have exploded onto the machine learning scene in recent years owing to their capability to model and learn from graph-structured data. Such an ability has strong implications in a wide variety of fields whose data are inherently relational, for which conventional neural networks do not perform well. Indeed, as recent reviews can attest, research in the area of GNNs has grown rapidly and has lead to the development of a variety of GNN algorithm variants as well as to the exploration of ground-breaking applications in chemistry, neurology, electronics, or communication networks, among others. At the current stage research, however, the efficient processing of GNNs is still an open challenge for several reasons. Besides of their novelty, GNNs are hard to compute due to their dependence on the input graph, their combination of dense and very sparse operations, or the need to scale to huge graphs in some applications. In this context, this article aims to make two main contributions. On the one hand, a review of the field of GNNs is presented from the perspective of computing. This includes a brief tutorial on the GNN fundamentals, an overview of the evolution of the field in the last decade, and a summary of operations carried out in the multiple phases of different GNN algorithm variants. On the other hand, an in-depth analysis of current software and hardware acceleration schemes is provided, from which a hardware-software, graph-aware, and communication-centric vision for GNN accelerators is distilled.This work is possible thanks to funding from the European Union’s Horizon 2020 research and innovation programme under Grant No. 863337 (WiPLASH project) and the Spanish Ministry of Economy and Competitiveness under contract TEC2017-90034-C2-1-R (ALLIANCE project) that receives funding from FEDER.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Horizontally distributed inference of deep neural networks for AI-enabled IoT

Author: Campos Bastos Celso
Fernández Riverola Florentino
Rodríguez Conde Iván
Publication venue: Sistemas Informáticos de Nova Xeración
Publication date: 01/02/2023
Field of study

Motivated by the pervasiveness of artificial intelligence (AI) and the Internet of Things (IoT) in the current “smart everything” scenario, this article provides a comprehensive overview of the most recent research at the intersection of both domains, focusing on the design and development of specific mechanisms for enabling a collaborative inference across edge devices towards the in situ execution of highly complex state-of-the-art deep neural networks (DNNs), despite the resource-constrained nature of such infrastructures. In particular, the review discusses the most salient approaches conceived along those lines, elaborating on the specificities of the partitioning schemes and the parallelism paradigms explored, providing an organized and schematic discussion of the underlying workflows and associated communication patterns, as well as the architectural aspects of the DNNs that have driven the design of such techniques, while also highlighting both the primary challenges encountered at the design and operational levels and the specific adjustments or enhancements explored in response to them.Agencia Estatal de Investigación | Ref. DPI2017-87494-RMinisterio de Ciencia e Innovación | Ref. PDC2021-121644-I00Xunta de Galicia | Ref. ED431C 2022/03-GR

Investigo

Directory of Open Access Journals

Memory dependence prediction using store sets

Author: George Z. Chrysos
Hesson J.
Joel S. Emer
Lipasti M.
Steely S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Design of a distributed memory unit for clustered microarchitectures

Author: Bieschewski Stefan
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2013
Field of study

Power constraints led to the end of exponential growth in single–processor performance, which characterized the semiconductor industry for many years. Single–chip multiprocessors allowed the performance growth to continue so far. Yet, Amdahl’s law asserts that the overall performance of future single–chip multiprocessors will depend crucially on single–processor performance. In a multiprocessor a small growth in single–processor performance can justify the use of significant resources. Partitioning the layout of critical components can improve the energy–efficiency and ultimately the performance of a single processor. In a clustered microarchitecture parts of these components form clusters. Instructions are processed locally in the clusters and benefit from the smaller size and complexity of the clusters components. Because the clusters together process a single instruction stream communications between clusters are necessary and introduce an additional cost. This thesis proposes the design of a distributed memory unit and first level cache in the context of a clustered microarchitecture. While the partitioning of other parts of the microarchitecture has been well studied the distribution of the memory unit and the cache has received comparatively little attention. The first proposal consists of a set of cache bank predictors. Eight different predictor designs are compared based on cost and accuracy. The second proposal is the distributed memory unit. The load and store queues are split into smaller queues for distributed disambiguation. The mapping of memory instructions to cache banks is delayed until addresses have been calculated. We show how disambiguation can be implemented efficiently with unordered queues. A bank predictor is used to map instructions that consume memory data near the data origin. We show that this organization significantly reduces both energy usage and latency. The third proposal introduces Dispatch Throttling and Pre-Access Queues. These mechanisms avoid load/store queue overflows that are a result of the late allocation of entries. The fourth proposal introduces Memory Issue Queues, which add functionality to select instructions for execution and re-execution to the memory unit. The fifth proposal introduces Conservative Deadlock Aware Entry Allocation. This mechanism is a deadlock safe issue policy for the Memory Issue Queues. Deadlocks can result from certain queue allocations because entries are allocated out-of-order instead of in-order like in traditional architectures. The sixth proposal is the Early Release of Load Queue Entries. Architectures with weak memory ordering such as Alpha, PowerPC or ARMv7 can take advantage of this mechanism to release load queue entries before the commit stage. Together, these proposals allow significantly smaller and more energy efficient load queues without the need of energy hungry recovery mechanisms and without performance penalties. Finally, we present a detailed study that compares the proposed distributed memory unit to a centralized memory unit and confirms its advantages of reduced energy usage and of improved performance

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

Design of a receiver for measurement of real-time ionospheric reflection height

Author: Raghavendar Changalvala
Publication venue
Publication date: 01/08/2005
Field of study

Thesis (M.S.) University of Alaska Fairbanks, 2005The HF (high frequency) radar at Kodiak Island, Alaska, is part of the SuperDARN (Super Dual Auroral Radar Network) network of radars designed to detect echoes from ionospheric field-aligned density irregularities. Normal azimuth scans of the radar begin on whole minute boundaries leading to 12 s downtime between each scan. The radar makes use of this down time, by stepping through eight different frequencies for each beam direction using 1 or 2 s integration periods. A new receiver system has been developed at Poker Flat Research Range (PFRR), to utilize the ground scatter returns from radar's sounding mode of operation and calculate the ionospheric virtual reflection height. This would result in considerable improvement in the accuracy of critical frequency and Angle Of Arrival (AOA) estimations made by the Kodiak SuperDARN.Introduction -- Background -- Structure of the ionosphere -- Photoionization -- Recombination -- Layers -- Ionospheric refraction -- Ionospheric propagation -- Reflection at vertical incidence -- Virtual height concept -- Oblique incidence -- Motivation -- Problem statement and proposed solution -- Equipment overview -- Basic radar definitions -- Overview of the HF radar at Kodiak -- Frequency operation -- Sounding mode -- Antennas -- Power -- Receiver antenna -- Reflector analysis -- GPS clock card -- Clock card specifications -- Overview of PCI card countrol/status registers -- The synchronized generator : GPS mode outline -- Software time capture -- Event time capture -- Receiver card -- specifications -- The system design and implementation -- Specifications -- The pulse sequence -- The QNX operating system -- Configuring the clock card -- Configuring the GC214 -- Sampling -- Mixing -- Decimation -- Filtering -- Resampling -- GC214 latency -- Gain -- Data header format -- Direct memory access (DMA) -- DMA buffer creation -- RAM--disk -- External trigger synchronization -- Signal processing code -- Link budget -- Results and future work -- Final code -- Results -- Errors -- Applications -- Future work -- Bibliography

ScholarWorks@UA

Physical Register Reference Counting

Author: Roth Amir
Publication venue: ScholarlyCommons
Publication date: 01/01/2008
Field of study

Several recently proposed techniques including CPR (Checkpoint Processing and Recovery) and NoSQ (No Store Queue) rely on reference counting to manage physical registers. However, the register reference counting mechanism itself has received surprisingly little attention. This paper fills this gap by describing potential register reference counting schemes for NoSQ, CPR, and a hypothetical NoSQ/CPR hybrid. Although previously described in terms of binary counters, we find that reference counts are actually more naturally represented as matrices. Binary representations can be used as an optimization in specific situations

ScholarlyCommons@Penn