214 research outputs found
The CMS Event Builder
The data acquisition system of the CMS experiment at the Large Hadron
Collider will employ an event builder which will combine data from about 500
data sources into full events at an aggregate throughput of 100 GByte/s.
Several architectures and switch technologies have been evaluated for the DAQ
Technical Design Report by measurements with test benches and by simulation.
This paper describes studies of an EVB test-bench based on 64 PCs acting as
data sources and data consumers and employing both Gigabit Ethernet and Myrinet
technologies as the interconnect. In the case of Ethernet, protocols based on
Layer-2 frames and on TCP/IP are evaluated. Results from ongoing studies,
including measurements on throughput and scaling are presented.
The architecture of the baseline CMS event builder will be outlined. The
event builder is organised into two stages with intelligent buffers in between.
The first stage contains 64 switches performing a first level of data
concentration by building super-fragments from fragments of 8 data sources. The
second stage combines the 64 super-fragments into full events. This
architecture allows installation of the second stage of the event builder in
steps, with the overall throughput scaling linearly with the number of switches
in the second stage. Possible implementations of the components of the event
builder are discussed and the expected performance of the full event builder is
outlined.Comment: Conference CHEP0
Switching techniques in data-acquisition systems for future experiments
An overview of the current state of development of parallel event-building techniques is given, with emphasis of future applications in the high-rate experiments proposed at the Large Hadron Collider (LHC). The paper describes the ain architectural options in parallel event builders, the proposed event-building architectures for LHC experiments, and the use of standard net- working protocols for event building and their limitations. The main issues around the potential use of circuit switching, message switching and packet switching are examined. Results from various laboratory demonstrator systems are presented
Klessydra-T: Designing Vector Coprocessors for Multi-Threaded Edge-Computing Cores
Computation intensive kernels, such as convolutions, matrix multiplication
and Fourier transform, are fundamental to edge-computing AI, signal processing
and cryptographic applications. Interleaved-Multi-Threading (IMT) processor
cores are interesting to pursue energy efficiency and low hardware cost for
edge-computing, yet they need hardware acceleration schemes to run heavy
computational workloads. Following a vector approach to accelerate
computations, this study explores possible alternatives to implement vector
coprocessing units in RISC-V cores, showing the synergy between IMT and
data-level parallelism in the target workloads.Comment: Final revision accepted for publication on IEEE Micro Journa
3D-SoftChip: A novel 3D vertically integrated adaptive computing system [thesis]
At present, as we enter the nano and giga-scaled integrated-circuit era, there are many system design challenges which must be overcome to resolve problems in current systems. The incredibly increased nonrecurring engineering (NRE) cost, abruptly shortened Time-to- Market (ITA) period and ever widening design productive gaps are good examples illustrating the problems in current systems. To cope with these problems, the concept of an Adaptive Computing System is becoming a critical technology for next generation computing systems. The other big problem is an explosion in the interconnection wire requirements in standard planar technology resulting from the very high data-bandwidth requirements demanded for real-time communications and multimedia signal processing. The concept of 3D-vertical integration of 2D planar chips becomes an attractive solution to combat the ever increasing interconnect wire requirements. As a result, this research proposes the concept of a novel 3D integrated adaptive computing system, which we term 3D-ACSoC. The architecture and advanced system design methodology of the proposed 3D-SoftChip as a forthcoming giga-scaled integrated circuit computing system has been introduced, along with high-level system modeling and functional verification in the early design stage using SystemC
Movement of vector elements inside a de-coupled vector processing unit for high-performance memory operations
This thesis is part of the eProcessor project. Within it, the BSC is developing a RISC-V based decoupled vector accelerator. This accelerator must support the execution of vector memory instructions. More specifically, I have worked on the development of a set of modules oriented to the displacement of data between the vector registers and the memory hierarchy, as well as their correct mapping. For this task, it is essential to elaborate a design that is able to meet the requirements faced by the project. This is a first implementation that may receive updates in the future.Aquesta tesi forma part del projecte eProcessor. Dins d'ell, el BSC està desenvolupant un accelerador vectorial desacoblat basat en RISC-V. Aquest accelerador ha de suportar l'execució d'instruccions de memòria vectorial. Més concretament, he treballat en el desenvolupament d'un conjunt de mòduls orientats al desplaçament de dades entre els registres vectorials i la jerarquia de memòria, aixà com el seu correcte mapeig. Per a aquesta tasca, és fonamental elaborar un disseny que sigui capaç de satisfer els requisits que afronta el projecte. Aquesta és una primera implementació que pot rebre actualitzacions en el futur
Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters
While parallel architectures based on clusters of Processing Elements (PEs)
sharing L1 memory are widespread, there is no consensus on how lean their PE
should be. Architecting PEs as vector processors holds the promise to greatly
reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck
(VNB). However, due to their historical association with supercomputers,
classical vector machines include micro-architectural tricks to improve the
Instruction Level Parallelism (ILP), which increases their instruction fetch
and decode energy overhead. In this paper, we explore for the first time vector
processing as an option to build small and efficient PEs for large-scale
shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector
processing unit based on the integer embedded subset of the RISC-V Vector
Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate
Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate
operation, 40% less energy than an equivalent cluster built with four Snitch
scalar cores. We analyzed Spatz' performance by integrating it within MemPool,
a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system
achieves up to 285 GOPS when running a 256x256 32-bit integer matrix
multiplication, 70% more than the equivalent Snitch-based MemPool system. In
terms of energy efficiency, the Spatz-based MemPool system achieves up to 266
GOPS/W when running the same kernel, more than twice the energy efficiency of
the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show
the viability of lean vector processors as high-performance and
energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.Comment: 9 pages. Accepted for publication in the 2022 International
Conference on Computer-Aided Design (ICCAD 2022
Threading model optimization of the AEMB Microprocessor
AEMB is a 32-bit RISC architecture processor with multi threading. It is a soft core processor designed for FPGA implementation and available as an open source. The processor runs on the instruction set of the Microblaze processor developed by Xilinx. The current threading model in AEMB is a fine grained model that interleaves threads one instruction at a time with separate register sets for each thread. This project aims at understanding the architecture of the AEMB and improving the performance of its threading model. The chosen optimization is to change the current threading model to a coarse grained one that switches threads on branch instructions. The advantage of this approach is that the pipeline no longer has to stall on every branch instruction executed as the processor will be executing instructions from another thread. Thus, branches cause the processor to stall only when there is back to back branch instructions or when two branch instructions with one gap between them and the first of them has no delay slot. This is quite an improvement over the previous case where the processor stalls for one cycle on any branch instruction encountered. The disadvantage to the coarse grained threading model is that data hazards that can’t be forwarded can now cause the processor to stall up to three cycles in the worst case scenario compared to only one cycle stall in the old model. As for Area consumption on FPGA, synthesis showed that the modified core utilizes double the number of LUTs that the original AEMB needs but there was no significant increase in the number of register. Further quantitative analysis is necessary to determine the total gain in performance by running the suitable benchmarks on both versions of the processor. The results are expected to be in favor of the design if the improved case is more common that the negatively affected cases
Comparative Study of Keccak SHA-3 Implementations
This paper conducts an extensive comparative study of state-of-the-art solutions for im-
plementing the SHA-3 hash function. SHA-3, a pivotal component in modern cryptography, has
spawned numerous implementations across diverse platforms and technologies. This research aims
to provide valuable insights into selecting and optimizing Keccak SHA-3 implementations. Our
study encompasses an in-depth analysis of hardware, software, and software–hardware (hybrid)
solutions. We assess the strengths, weaknesses, and performance metrics of each approach. Critical
factors, including computational efficiency, scalability, and flexibility, are evaluated across differ-
ent use cases. We investigate how each implementation performs in terms of speed and resource
utilization. This research aims to improve the knowledge of cryptographic systems, aiding in the
informed design and deployment of efficient cryptographic solutions. By providing a comprehensive
overview of SHA-3 implementations, this study offers a clear understanding of the available options
and equips professionals and researchers with the necessary insights to make informed decisions in
their cryptographic endeavors
- …