367 research outputs found
Prototyping of petalets for the Phase-II Upgrade of the silicon strip tracking detector of the ATLAS Experiment
In the high luminosity era of the Large Hadron Collider, the HL-LHC, the
instantaneous luminosity is expected to reach unprecedented values, resulting
in about 200 proton-proton interactions in a typical bunch crossing. To cope
with the resultant increase in occupancy, bandwidth and radiation damage, the
ATLAS Inner Detector will be replaced by an all-silicon system, the Inner
Tracker (ITk). The ITk consists of a silicon pixel and a strip detector and
exploits the concept of modularity. Prototyping and testing of various strip
detector components has been carried out. This paper presents the developments
and results obtained with reduced-size structures equivalent to those foreseen
to be used in the forward region of the silicon strip detector. Referred to as
petalets, these structures are built around a composite sandwich with embedded
cooling pipes and electrical tapes for routing the signals and power. Detector
modules built using electronic flex boards and silicon strip sensors are glued
on both the front and back side surfaces of the carbon structure. Details are
given on the assembly, testing and evaluation of several petalets. Measurement
results of both mechanical and electrical quantities are shown. Moreover, an
outlook is given for improved prototyping plans for large structures.Comment: 22 pages for submission for Journal of Instrumentatio
Scalable communication for high-order stencil computations using CUDA-aware MPI
Modern compute nodes in high-performance computing provide a tremendous level
of parallelism and processing power. However, as arithmetic performance has
been observed to increase at a faster rate relative to memory and network
bandwidths, optimizing data movement has become critical for achieving strong
scaling in many communication-heavy applications. This performance gap has been
further accentuated with the introduction of graphics processing units, which
can provide by multiple factors higher throughput in data-parallel tasks than
central processing units. In this work, we explore the computational aspects of
iterative stencil loops and implement a generic communication scheme using
CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations
based on high-order finite differences and third-order Runge-Kutta integration.
We put particular focus on improving intra-node locality of workloads. In
comparison to a theoretical performance model, our implementation exhibits
strong scaling from one to devices at -- efficiency in
sixth-order stencil computations when the problem domain consists of
-- cells.Comment: 17 pages, 15 figure
Automated Scratchpad Mapping and Allocation for Embedded Processors
Embedded system-on-chip processors such as the Texas Instruments C66 DSP and the IBM Cell provide the programmer with a software controlled on-chip memory to supplement a traditional but simple two-level cache. By decomposing data sets and their corresponding workload into small subsets that fit within this on-chip memory, the processor can potentially achieve equivalent or better performance, power efficiency, and area efficiency than with its sophisticated cache. However, program controlled on chip memory requires a shift in the responsibility for management and allocation from the hardware to the programmer. Specifically, this requires the explicit mapping of program arrays to specific types of on chip memory structure and the addition of supporting code that allocates and manages the on chip memory. Previous work in tiling focuses on automated loop transformations but are hardware agnostic and do not incorporate a performance model of the underlying memory design. In this work we will explore the relationship between mapping and allocation of tiles for stencil loops and linear algebra kernels on the Texas Instruments Keystone II DSP platform
Memory controller for vector processor
To manage power and memory wall affects, the HPC industry supports FPGA reconfigurable accelerators and vector processing cores for data-intensive scientific applications. FPGA based vector accelerators are used to increase the performance of high-performance application kernels. Adding more vector lanes does not affect the performance, if the processor/memory performance gap dominates. In addition if on/off-chip communication time becomes more critical than computation time, causes performance degradation. The system generates multiple delays due to application’s irregular data arrangement and complex scheduling scheme. Therefore, just like generic scalar processors, all sets of vector machine – vector supercomputers to vector microprocessors – are required to have data management and access units that improve the on/off-chip bandwidth and hide main memory latency. In this work, we propose an Advanced Programmable Vector Memory Controller (PVMC), which boosts noncontiguous vector data accesses by integrating descriptors of memory patterns, a specialized on-chip memory, a memory manager in hardware, and multiple DRAM controllers. We implemented and validated the proposed system on an Altera DE4 FPGA board. The PVMC is also integrated with ARM Cortex-A9 processor on Xilinx Zynq All-Programmable System on Chip architecture. We compare the performance of a system with vector and scalar processors without PVMC. When compared with a baseline vector system, the results show that the PVMC system transfers data sets up to 1.40x to 2.12x faster, achieves between 2.01x to 4.53x of speedup for 10 applications and consumes 2.56 to 4.04 times less energy.Peer ReviewedPostprint (author's final draft
A Device to Record Naturally Daily Wrist Motion
We introduce a new device to record and store wrist motion activity data. The motivation to create this device was the fact that this data can be used to detect periods of eating or the number of bites consumed. There is no similar device available in the market. This device uses new components that have been recently introduced to the market, and newer techniques that can be used for low quantity production. The production cost for this device was $52, similar to other fitness trackers on the market. The device was capable of recording wrist motion activity for 24 hours and was similar in weight to a wrist watch
- …