605 research outputs found

    CXL Memory as Persistent Memory for Disaggregated HPC: A Practical Approach

    Full text link
    In the landscape of High-Performance Computing (HPC), the quest for efficient and scalable memory solutions remains paramount. The advent of Compute Express Link (CXL) introduces a promising avenue with its potential to function as a Persistent Memory (PMem) solution in the context of disaggregated HPC systems. This paper presents a comprehensive exploration of CXL memory's viability as a candidate for PMem, supported by physical experiments conducted on cutting-edge multi-NUMA nodes equipped with CXL-attached memory prototypes. Our study not only benchmarks the performance of CXL memory but also illustrates the seamless transition from traditional PMem programming models to CXL, reinforcing its practicality. To substantiate our claims, we establish a tangible CXL prototype using an FPGA card embodying CXL 1.1/2.0 compliant endpoint designs (Intel FPGA CXL IP). Performance evaluations, executed through the STREAM and STREAM-PMem benchmarks, showcase CXL memory's ability to mirror PMem characteristics in App-Direct and Memory Mode while achieving impressive bandwidth metrics with Intel 4th generation Xeon (Sapphire Rapids) processors. The results elucidate the feasibility of CXL memory as a persistent memory solution, outperforming previously established benchmarks. In contrast to published DCPMM results, our CXL-DDR4 memory module offers comparable bandwidth to local DDR4 memory configurations, albeit with a moderate decrease in performance. The modified STREAM-PMem application underscores the ease of transitioning programming models from PMem to CXL, thus underscoring the practicality of adopting CXL memory.Comment: 12 pages, 9 figure

    Dependability investigation of wireless short range embedded systems: hardware platform oriented approach

    Get PDF
    A new direction in short-range wireless applications has appeared in the form of high-speed data communication devices for distances of hundreds meters. Behind these embedded applications, a complex heterogeneous architecture is built. Moreover, these short range communications are introduced into critical applications, where the dependability/reliability is mandatory. Thus, dependability concerns around reliability evaluation become a major challenge in these systems, and pose several questions to answer. Obviously, in such systems, the attribute reliability has to be investigated for various components and at different abstraction levels. In this paper, we discuss the investigation of dependability in wireless short range systems. We present a hardware platform for wireless system dependability analysis as an alternative for the time consuming simulation techniques. The platform is built using several instances of one of the commercial FPGA platforms available on the market place. We describe the different steps of building the wireless hardware platform for short range systems dependability analysis. Then, we show how this HW platform based dependability investigation framework can be a very interactive approach. Based on this platform we introduce a new methodology and a flow to investigate the different parts of system dependability at different abstraction levels. The benefits to use the proposed framework are three fold: first, it takes care of the whole system (HW/SW -digital part, mixed RF part, and wireless part); Second, the hardware platform enables to explore the application’s reliability under real environmental conditions taking into account the effect of the environment threats on the system; And last, the wireless platform built for dependability investigation present a fast investigation approach in comparison with the time consuming co-simulation technique

    Management and Service-aware Networking Architectures (MANA) for Future Internet Position Paper: System Functions, Capabilities and Requirements

    Get PDF
    Future Internet (FI) research and development threads have recently been gaining momentum all over the world and as such the international race to create a new generation Internet is in full swing: GENI, Asia Future Internet, Future Internet Forum Korea, European Union Future Internet Assembly (FIA). This is a position paper identifying the research orientation with a time horizon of 10 years, together with the key challenges for the capabilities in the Management and Service-aware Networking Architectures (MANA) part of the Future Internet (FI) allowing for parallel and federated Internet(s)

    Is there a Moore's law for quantum computing?

    Full text link
    There is a common wisdom according to which many technologies can progress according to some exponential law like the empirical Moore's law that was validated for over half a century with the growth of transistors number in chipsets. As a still in the making technology with a lot of potential promises, quantum computing is supposed to follow the pack and grow inexorably to maturity. The Holy Grail in that domain is a large quantum computer with thousands of errors corrected logical qubits made themselves of thousands, if not more, of physical qubits. These would enable molecular simulations as well as factoring 2048 RSA bit keys among other use cases taken from the intractable classical computing problems book. How far are we from this? Less than 15 years according to many predictions. We will see in this paper that Moore's empirical law cannot easily be translated to an equivalent in quantum computing. Qubits have various figures of merit that won't progress magically thanks to some new manufacturing technique capacity. However, some equivalents of Moore's law may be at play inside and outside the quantum realm like with quantum computers enabling technologies, cryogeny and control electronics. Algorithms, software tools and engineering also play a key role as enablers of quantum computing progress. While much of quantum computing future outcomes depends on qubit fidelities, it is progressing rather slowly, particularly at scale. We will finally see that other figures of merit will come into play and potentially change the landscape like the quality of computed results and the energetics of quantum computing. Although scientific and technological in nature, this inventory has broad business implications, on investment, education and cybersecurity related decision-making processes.Comment: 32 pages, 24 figure

    Energy reconstruction on the LHC ATLAS TileCal upgraded front end: feasibility study for a sROD co-processing unit

    Get PDF
    Dissertation presented in ful lment of the requirements for the degree of: Master of Science in Physics 2016The Phase-II upgrade of the Large Hadron Collider at CERN in the early 2020s will enable an order of magnitude increase in the data produced, unlocking the potential for new physics discoveries. In the ATLAS detector, the upgraded Hadronic Tile Calorimeter (TileCal) Phase-II front end read out system is currently being prototyped to handle a total data throughput of 5.1 TB/s, from the current 20.4 GB/s. The FPGA based Super Read Out Driver (sROD) prototype must perform an energy reconstruction algorithm on 2.88 GB/s raw data, or 275 million events per second. Due to the very high level of pro ciency required and time consuming nature of FPGA rmware development, it may be more e ective to implement certain complex energy reconstruction and monitoring algorithms on a general purpose, CPU based sROD co-processor. Hence, the feasibility of a general purpose ARM System on Chip based co-processing unit (PU) for the sROD is determined in this work. A PCI-Express test platform was designed and constructed to link two ARM Cortex-A9 SoCs via their PCI-Express Gen-2 x1 interfaces. Test results indicate that the latency of the PCI-Express interface is su ciently low and the data throughput is superior to that of alternative interfaces such as Ethernet, for use as an interconnect for the SoCs to the sROD. CPU performance benchmarks were performed on ve ARM development platforms to determine the CPU integer, oating point and memory system performance as well as energy e ciency. To complement the benchmarks, Fast Fourier Transform and Optimal Filtering (OF) applications were also tested. Based on the test results, in order for the PU to process 275 million events per second with OF, within the 6 s timing budget of the ATLAS triggering system, a cluster of three Tegra-K1, Cortex-A15 SoCs connected to the sROD via a Gen-2 x8 PCI-Express interface would be suitable. A high level design for the PU is proposed which surpasses the requirements for the sROD co-processor and can also be used in a general purpose, high data throughput system, with 80 Gb/s Ethernet and 15 GB/s PCI-Express throughput, using four X-Gene SoCs

    DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

    Full text link
    Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques to more memory-centric techniques, thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at https://github.com/CMU-SAFARI/DAMO

    Enabling the use of embedded and mobile technologies for high-performance computing

    Get PDF
    In the late 1990s, powerful economic forces led to the adoption of commodity desktop processors in High-Performance Computing(HPC). This transformation has been so effective that the November 2016 TOP500 list is still dominated by x86 architecture. In 2016, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smartphones andtablets, most of which are built with ARM-based Systems on Chips (SoC). This suggests that once mobile SoCs deliver sufficient performance, mobile SoCs can help reduce the cost of HPC. This thesis addresses this question in detail.We analyze the trend in mobile SoC performance, comparing it with the similar trend in the 1990s. Through development of real system prototypes and their performance analysis we assess the feasibility of building an HPCsystem based on mobile SoCs. Through simulation of the future mobile SoC, we identify the missing features and suggest improvements that would enable theuse of future mobile SoCs in HPC environment. Thus, we present design guidelines for future generations mobile SoCs, and HPC systems built around them, enabling the newclass of cheap supercomputers.A finales de la década de los 90, razones económicas llevaron a la adopción de procesadores de uso general en sistemas de Computación de Altas Prestaciones (HPC). Esta transformación ha sido tan efectiva que la lista TOP500 de noviembre de 2016 sigue aun dominada por la arquitectura x86. En 2016, el mayor mercado de productos básicos en computación no son los ordenadores de sobremesa o los servidores, sino la computación móvil, que incluye teléfonos inteligentes y tabletas, la mayoría de los cuales están construidos con sistemas en chip(SoC) de arquitectura ARM. Esto sugiere que una vez que los SoC móviles ofrezcan un rendimiento suficiente, podrán utilizarse para reducir el costo desistemas HPC. Esta tesis aborda esta cuestión en detalle. Analizamos la tendencia del rendimiento de los SoC para móvil, comparándola con la tendencia similar ocurrida en los añosnoventa. A través del desarrollo de prototipos de sistemas reales y su análisis de rendimiento, evaluamos la factibilidad de construir unsistema HPC basado en SoCs móviles. A través de la simulación de SoCs móviles futuros, identificamos las características que faltan y sugerimos mejoras quepermitirían su uso en entornos HPC. Por lo tanto, presentamos directrices de diseño para futuras generaciones de SoCs móviles y sistemas HPC construidos a sualrededor, para permitir la construcción de una nueva clase de supercomputadores de coste reducido

    Evaluation of low-power architectures in a scientific computing environment

    Get PDF
    HPC (High Performance Computing) represents, together with theory and experiments, the third pillar of science. Through HPC, scientists can simulate phenomena otherwise impossible to study. The need of performing larger and more accurate simulations requires to HPC to improve every day. HPC is constantly looking for new computational platforms that can improve cost and power efficiency. The Mont-Blanc project is a EU funded research project that targets to study new hardware and software solutions that can improve efficiency of HPC systems. The vision of the project is to leverage the fast growing market of mobile devices to develop the next generation supercomputers. In this work we contribute to the objectives of the Mont-Blanc project by evaluating performance of production scientific applications on innovative low power architectures. In order to do so, we describe our experiences porting and evaluating sate of the art scientific applications on the Mont-Blanc prototype, the first HPC system built with commodity low power embedded technology. We then extend our study to compare off-the-shelves ARMv8 platforms. We finally discuss the most impacting issues encountered during the development of the Mont-Blanc prototype system
    corecore