Search CORE

214 research outputs found

The CMS Event Builder

Author: Berti L.
Bhattacharya S.
Branson J.
Brigljevic V.
Bruno G.
Cano E.
Cittolin S.
Csilling A.
de Abril I. Magrans
Erhan S.
Gigi D.
Glege F.
Gomez-Reino R.
Gulmini M.
Gutleber J.
Jacobs C.
Kozlovszky M.
Larsen H.
Maron G.
Meijers F.
Meschi E.
Murray S.
Ninane A.
ODell V.
Oh A.
Orsini L.
Pollet L.
Racz A.
Samyn D.
Scharff-Hansen P.
Schwick C.
Sphicas P.
Suzuki I.
Toniolo N.
Zangrando L.
Publication venue
Publication date: 01/01/2003
Field of study

The data acquisition system of the CMS experiment at the Large Hadron Collider will employ an event builder which will combine data from about 500 data sources into full events at an aggregate throughput of 100 GByte/s. Several architectures and switch technologies have been evaluated for the DAQ Technical Design Report by measurements with test benches and by simulation. This paper describes studies of an EVB test-bench based on 64 PCs acting as data sources and data consumers and employing both Gigabit Ethernet and Myrinet technologies as the interconnect. In the case of Ethernet, protocols based on Layer-2 frames and on TCP/IP are evaluated. Results from ongoing studies, including measurements on throughput and scaling are presented. The architecture of the baseline CMS event builder will be outlined. The event builder is organised into two stages with intelligent buffers in between. The first stage contains 64 switches performing a first level of data concentration by building super-fragments from fragments of 8 data sources. The second stage combines the 64 super-fragments into full events. This architecture allows installation of the second stage of the event builder in steps, with the overall throughput scaling linearly with the number of switches in the second stage. Possible implementations of the components of the event builder are discussed and the expected performance of the full event builder is outlined.Comment: Conference CHEP0

arXiv.org e-Print Archive

CERN Document Server

Switching techniques in data-acquisition systems for future experiments

Author: Letheren M F
Publication venue: CERN
Publication date: 25/10/1995
Field of study

An overview of the current state of development of parallel event-building techniques is given, with emphasis of future applications in the high-rate experiments proposed at the Large Hadron Collider (LHC). The paper describes the ain architectural options in parallel event builders, the proposed event-building architectures for LHC experiments, and the use of standard net- working protocols for event building and their limitations. The main issues around the potential use of circuit switching, message switching and packet switching are examined. Results from various laboratory demonstrator systems are presented

CERN Document Server

Klessydra-T: Designing Vector Coprocessors for Multi-Threaded Edge-Computing Cores

Author: Cheikh Abdallah
Mastrandrea Antonio
Menichelli Francesco
Olivieri Mauro
Scotti Giuseppe
Sordillo Stefano
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

Computation intensive kernels, such as convolutions, matrix multiplication and Fourier transform, are fundamental to edge-computing AI, signal processing and cryptographic applications. Interleaved-Multi-Threading (IMT) processor cores are interesting to pursue energy efficiency and low hardware cost for edge-computing, yet they need hardware acceleration schemes to run heavy computational workloads. Following a vector approach to accelerate computations, this study explores possible alternatives to implement vector coprocessing units in RISC-V cores, showing the synergy between IMT and data-level parallelism in the target workloads.Comment: Final revision accepted for publication on IEEE Micro Journa

arXiv.org e-Print Archive

Archivio della ricerca- Università di Roma La Sapienza

3D-SoftChip: A novel 3D vertically integrated adaptive computing system [thesis]

Author: Kim Chul
Publication venue: Edith Cowan University, Research Online, Perth, Western Australia
Publication date: 01/01/2005
Field of study

At present, as we enter the nano and giga-scaled integrated-circuit era, there are many system design challenges which must be overcome to resolve problems in current systems. The incredibly increased nonrecurring engineering (NRE) cost, abruptly shortened Time-to- Market (ITA) period and ever widening design productive gaps are good examples illustrating the problems in current systems. To cope with these problems, the concept of an Adaptive Computing System is becoming a critical technology for next generation computing systems. The other big problem is an explosion in the interconnection wire requirements in standard planar technology resulting from the very high data-bandwidth requirements demanded for real-time communications and multimedia signal processing. The concept of 3D-vertical integration of 2D planar chips becomes an attractive solution to combat the ever increasing interconnect wire requirements. As a result, this research proposes the concept of a novel 3D integrated adaptive computing system, which we term 3D-ACSoC. The architecture and advanced system design methodology of the proposed 3D-SoftChip as a forthcoming giga-scaled integrated circuit computing system has been introduced, along with high-level system modeling and functional verification in the early design stage using SystemC

Research Online @ ECU

Movement of vector elements inside a de-coupled vector processing unit for high-performance memory operations

Author: Aguilera Dangla Albert
Publication venue: Universitat Politècnica de Catalunya
Publication date: 24/01/2023
Field of study

This thesis is part of the eProcessor project. Within it, the BSC is developing a RISC-V based decoupled vector accelerator. This accelerator must support the execution of vector memory instructions. More specifically, I have worked on the development of a set of modules oriented to the displacement of data between the vector registers and the memory hierarchy, as well as their correct mapping. For this task, it is essential to elaborate a design that is able to meet the requirements faced by the project. This is a first implementation that may receive updates in the future.Aquesta tesi forma part del projecte eProcessor. Dins d'ell, el BSC està desenvolupant un accelerador vectorial desacoblat basat en RISC-V. Aquest accelerador ha de suportar l'execució d'instruccions de memòria vectorial. Més concretament, he treballat en el desenvolupament d'un conjunt de mòduls orientats al desplaçament de dades entre els registres vectorials i la jerarquia de memòria, així com el seu correcte mapeig. Per a aquesta tasca, és fonamental elaborar un disseny que sigui capaç de satisfer els requisits que afronta el projecte. Aquesta és una primera implementació que pot rebre actualitzacions en el futur

UPCommons. Portal del coneixement obert de la UPC

Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

Author: Benini Luca
Cavalcante Matheus
Perotti Matteo
Riedel Samuel
Wüthrich Domenic
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 16/07/2022
Field of study

While parallel architectures based on clusters of Processing Elements (PEs) sharing L1 memory are widespread, there is no consensus on how lean their PE should be. Architecting PEs as vector processors holds the promise to greatly reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck (VNB). However, due to their historical association with supercomputers, classical vector machines include micro-architectural tricks to improve the Instruction Level Parallelism (ILP), which increases their instruction fetch and decode energy overhead. In this paper, we explore for the first time vector processing as an option to build small and efficient PEs for large-scale shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector processing unit based on the integer embedded subset of the RISC-V Vector Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate operation, 40% less energy than an equivalent cluster built with four Snitch scalar cores. We analyzed Spatz' performance by integrating it within MemPool, a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system achieves up to 285 GOPS when running a 256x256 32-bit integer matrix multiplication, 70% more than the equivalent Snitch-based MemPool system. In terms of energy efficiency, the Spatz-based MemPool system achieves up to 266 GOPS/W when running the same kernel, more than twice the energy efficiency of the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show the viability of lean vector processors as high-performance and energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.Comment: 9 pages. Accepted for publication in the 2022 International Conference on Computer-Aided Design (ICCAD 2022

arXiv.org e-Print Archive

Threading model optimization of the AEMB Microprocessor

Author: Mansour Mostafa Mohamed Ibrahim
Publication venue: Universiti Teknologi PETRONAS
Publication date: 01/08/2014
Field of study

AEMB is a 32-bit RISC architecture processor with multi threading. It is a soft core processor designed for FPGA implementation and available as an open source. The processor runs on the instruction set of the Microblaze processor developed by Xilinx. The current threading model in AEMB is a fine grained model that interleaves threads one instruction at a time with separate register sets for each thread. This project aims at understanding the architecture of the AEMB and improving the performance of its threading model. The chosen optimization is to change the current threading model to a coarse grained one that switches threads on branch instructions. The advantage of this approach is that the pipeline no longer has to stall on every branch instruction executed as the processor will be executing instructions from another thread. Thus, branches cause the processor to stall only when there is back to back branch instructions or when two branch instructions with one gap between them and the first of them has no delay slot. This is quite an improvement over the previous case where the processor stalls for one cycle on any branch instruction encountered. The disadvantage to the coarse grained threading model is that data hazards that can’t be forwarded can now cause the processor to stall up to three cycles in the worst case scenario compared to only one cycle stall in the old model. As for Area consumption on FPGA, synthesis showed that the modified core utilizes double the number of LUTs that the original AEMB needs but there was no significant increase in the number of register. Further quantitative analysis is necessary to determine the total gain in performance by running the suitable benchmarks on both versions of the processor. The results are expected to be in favor of the design if the improved case is more common that the negatively affected cases

UTPedia

Comparative Study of Keccak SHA-3 Implementations

Author: Dolmeta Alessandra
Martina Maurizio
Masera Guido
Publication venue: MDPI
Publication date: 01/01/2023
Field of study

This paper conducts an extensive comparative study of state-of-the-art solutions for im- plementing the SHA-3 hash function. SHA-3, a pivotal component in modern cryptography, has spawned numerous implementations across diverse platforms and technologies. This research aims to provide valuable insights into selecting and optimizing Keccak SHA-3 implementations. Our study encompasses an in-depth analysis of hardware, software, and software–hardware (hybrid) solutions. We assess the strengths, weaknesses, and performance metrics of each approach. Critical factors, including computational efficiency, scalability, and flexibility, are evaluated across differ- ent use cases. We investigate how each implementation performs in terms of speed and resource utilization. This research aims to improve the knowledge of cryptographic systems, aiding in the informed design and deployment of efficient cryptographic solutions. By providing a comprehensive overview of SHA-3 implementations, this study offers a clear understanding of the available options and equips professionals and researchers with the necessary insights to make informed decisions in their cryptographic endeavors

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)