367 research outputs found

    Prototyping of petalets for the Phase-II Upgrade of the silicon strip tracking detector of the ATLAS Experiment

    Full text link
    In the high luminosity era of the Large Hadron Collider, the HL-LHC, the instantaneous luminosity is expected to reach unprecedented values, resulting in about 200 proton-proton interactions in a typical bunch crossing. To cope with the resultant increase in occupancy, bandwidth and radiation damage, the ATLAS Inner Detector will be replaced by an all-silicon system, the Inner Tracker (ITk). The ITk consists of a silicon pixel and a strip detector and exploits the concept of modularity. Prototyping and testing of various strip detector components has been carried out. This paper presents the developments and results obtained with reduced-size structures equivalent to those foreseen to be used in the forward region of the silicon strip detector. Referred to as petalets, these structures are built around a composite sandwich with embedded cooling pipes and electrical tapes for routing the signals and power. Detector modules built using electronic flex boards and silicon strip sensors are glued on both the front and back side surfaces of the carbon structure. Details are given on the assembly, testing and evaluation of several petalets. Measurement results of both mechanical and electrical quantities are shown. Moreover, an outlook is given for improved prototyping plans for large structures.Comment: 22 pages for submission for Journal of Instrumentatio

    Scalable communication for high-order stencil computations using CUDA-aware MPI

    Full text link
    Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge-Kutta integration. We put particular focus on improving intra-node locality of workloads. In comparison to a theoretical performance model, our implementation exhibits strong scaling from one to 6464 devices at 50%50\%--87%87\% efficiency in sixth-order stencil computations when the problem domain consists of 2563256^3--102431024^3 cells.Comment: 17 pages, 15 figure

    Automated Scratchpad Mapping and Allocation for Embedded Processors

    Get PDF
    Embedded system-on-chip processors such as the Texas Instruments C66 DSP and the IBM Cell provide the programmer with a software controlled on-chip memory to supplement a traditional but simple two-level cache. By decomposing data sets and their corresponding workload into small subsets that fit within this on-chip memory, the processor can potentially achieve equivalent or better performance, power efficiency, and area efficiency than with its sophisticated cache. However, program controlled on chip memory requires a shift in the responsibility for management and allocation from the hardware to the programmer. Specifically, this requires the explicit mapping of program arrays to specific types of on chip memory structure and the addition of supporting code that allocates and manages the on chip memory. Previous work in tiling focuses on automated loop transformations but are hardware agnostic and do not incorporate a performance model of the underlying memory design. In this work we will explore the relationship between mapping and allocation of tiles for stencil loops and linear algebra kernels on the Texas Instruments Keystone II DSP platform

    Memory controller for vector processor

    Get PDF
    To manage power and memory wall affects, the HPC industry supports FPGA reconfigurable accelerators and vector processing cores for data-intensive scientific applications. FPGA based vector accelerators are used to increase the performance of high-performance application kernels. Adding more vector lanes does not affect the performance, if the processor/memory performance gap dominates. In addition if on/off-chip communication time becomes more critical than computation time, causes performance degradation. The system generates multiple delays due to application’s irregular data arrangement and complex scheduling scheme. Therefore, just like generic scalar processors, all sets of vector machine – vector supercomputers to vector microprocessors – are required to have data management and access units that improve the on/off-chip bandwidth and hide main memory latency. In this work, we propose an Advanced Programmable Vector Memory Controller (PVMC), which boosts noncontiguous vector data accesses by integrating descriptors of memory patterns, a specialized on-chip memory, a memory manager in hardware, and multiple DRAM controllers. We implemented and validated the proposed system on an Altera DE4 FPGA board. The PVMC is also integrated with ARM Cortex-A9 processor on Xilinx Zynq All-Programmable System on Chip architecture. We compare the performance of a system with vector and scalar processors without PVMC. When compared with a baseline vector system, the results show that the PVMC system transfers data sets up to 1.40x to 2.12x faster, achieves between 2.01x to 4.53x of speedup for 10 applications and consumes 2.56 to 4.04 times less energy.Peer ReviewedPostprint (author's final draft

    A Device to Record Naturally Daily Wrist Motion

    Get PDF
    We introduce a new device to record and store wrist motion activity data. The motivation to create this device was the fact that this data can be used to detect periods of eating or the number of bites consumed. There is no similar device available in the market. This device uses new components that have been recently introduced to the market, and newer techniques that can be used for low quantity production. The production cost for this device was $52, similar to other fitness trackers on the market. The device was capable of recording wrist motion activity for 24 hours and was similar in weight to a wrist watch
    • …
    corecore