605 research outputs found
CXL Memory as Persistent Memory for Disaggregated HPC: A Practical Approach
In the landscape of High-Performance Computing (HPC), the quest for efficient
and scalable memory solutions remains paramount. The advent of Compute Express
Link (CXL) introduces a promising avenue with its potential to function as a
Persistent Memory (PMem) solution in the context of disaggregated HPC systems.
This paper presents a comprehensive exploration of CXL memory's viability as a
candidate for PMem, supported by physical experiments conducted on cutting-edge
multi-NUMA nodes equipped with CXL-attached memory prototypes. Our study not
only benchmarks the performance of CXL memory but also illustrates the seamless
transition from traditional PMem programming models to CXL, reinforcing its
practicality.
To substantiate our claims, we establish a tangible CXL prototype using an
FPGA card embodying CXL 1.1/2.0 compliant endpoint designs (Intel FPGA CXL IP).
Performance evaluations, executed through the STREAM and STREAM-PMem
benchmarks, showcase CXL memory's ability to mirror PMem characteristics in
App-Direct and Memory Mode while achieving impressive bandwidth metrics with
Intel 4th generation Xeon (Sapphire Rapids) processors.
The results elucidate the feasibility of CXL memory as a persistent memory
solution, outperforming previously established benchmarks. In contrast to
published DCPMM results, our CXL-DDR4 memory module offers comparable bandwidth
to local DDR4 memory configurations, albeit with a moderate decrease in
performance. The modified STREAM-PMem application underscores the ease of
transitioning programming models from PMem to CXL, thus underscoring the
practicality of adopting CXL memory.Comment: 12 pages, 9 figure
Dependability investigation of wireless short range embedded systems: hardware platform oriented approach
A new direction in short-range wireless applications has appeared in the form of high-speed data communication devices for distances of hundreds meters. Behind these embedded applications, a complex heterogeneous architecture is built. Moreover, these short range communications are introduced into critical applications, where the dependability/reliability is mandatory. Thus, dependability concerns around reliability evaluation become a major challenge in these systems, and pose several questions to answer. Obviously, in such systems, the attribute reliability has to be investigated for various components and at different abstraction levels. In this paper, we discuss the investigation of dependability in wireless short range systems. We present a hardware platform for wireless system dependability analysis as an alternative for the time consuming simulation techniques. The platform is built using several instances of one of the commercial FPGA platforms available on the market place. We describe the different steps of building the wireless hardware platform for short range systems dependability analysis. Then, we show how this HW platform based dependability investigation framework can be a very interactive approach. Based on this platform we introduce a new methodology and a flow to investigate the different parts of system dependability at different abstraction levels. The benefits to use the proposed framework are three fold: first, it takes care of the whole system (HW/SW -digital part, mixed RF part, and wireless part); Second, the hardware platform enables to explore the application’s reliability under real environmental conditions taking into account the effect of the environment threats on the system; And last, the wireless platform built for dependability investigation present a fast investigation approach in comparison with the time consuming co-simulation technique
Management and Service-aware Networking Architectures (MANA) for Future Internet Position Paper: System Functions, Capabilities and Requirements
Future Internet (FI) research and development threads have recently been gaining momentum all over the world and as such the international race to create a new generation Internet is in full swing: GENI, Asia Future Internet, Future Internet Forum Korea, European Union Future Internet Assembly (FIA). This is a position paper identifying the research orientation with a time horizon of 10 years, together with the key challenges for the capabilities in the Management and Service-aware Networking Architectures (MANA) part of the Future Internet (FI) allowing for parallel and federated Internet(s)
Is there a Moore's law for quantum computing?
There is a common wisdom according to which many technologies can progress
according to some exponential law like the empirical Moore's law that was
validated for over half a century with the growth of transistors number in
chipsets. As a still in the making technology with a lot of potential promises,
quantum computing is supposed to follow the pack and grow inexorably to
maturity. The Holy Grail in that domain is a large quantum computer with
thousands of errors corrected logical qubits made themselves of thousands, if
not more, of physical qubits. These would enable molecular simulations as well
as factoring 2048 RSA bit keys among other use cases taken from the intractable
classical computing problems book. How far are we from this? Less than 15 years
according to many predictions. We will see in this paper that Moore's empirical
law cannot easily be translated to an equivalent in quantum computing. Qubits
have various figures of merit that won't progress magically thanks to some new
manufacturing technique capacity. However, some equivalents of Moore's law may
be at play inside and outside the quantum realm like with quantum computers
enabling technologies, cryogeny and control electronics. Algorithms, software
tools and engineering also play a key role as enablers of quantum computing
progress. While much of quantum computing future outcomes depends on qubit
fidelities, it is progressing rather slowly, particularly at scale. We will
finally see that other figures of merit will come into play and potentially
change the landscape like the quality of computed results and the energetics of
quantum computing. Although scientific and technological in nature, this
inventory has broad business implications, on investment, education and
cybersecurity related decision-making processes.Comment: 32 pages, 24 figure
Energy reconstruction on the LHC ATLAS TileCal upgraded front end: feasibility study for a sROD co-processing unit
Dissertation presented in ful lment of the requirements for the degree of:
Master of Science in Physics
2016The Phase-II upgrade of the Large Hadron Collider at CERN in the early 2020s
will enable an order of magnitude increase in the data produced, unlocking the
potential for new physics discoveries. In the ATLAS detector, the upgraded Hadronic
Tile Calorimeter (TileCal) Phase-II front end read out system is currently being
prototyped to handle a total data throughput of 5.1 TB/s, from the current 20.4 GB/s.
The FPGA based Super Read Out Driver (sROD) prototype must perform an energy
reconstruction algorithm on 2.88 GB/s raw data, or 275 million events per second.
Due to the very high level of pro ciency required and time consuming nature of
FPGA rmware development, it may be more e ective to implement certain complex
energy reconstruction and monitoring algorithms on a general purpose, CPU based
sROD co-processor. Hence, the feasibility of a general purpose ARM System on Chip
based co-processing unit (PU) for the sROD is determined in this work.
A PCI-Express test platform was designed and constructed to link two ARM
Cortex-A9 SoCs via their PCI-Express Gen-2 x1 interfaces. Test results indicate that
the latency of the PCI-Express interface is su ciently low and the data throughput is
superior to that of alternative interfaces such as Ethernet, for use as an interconnect
for the SoCs to the sROD. CPU performance benchmarks were performed on ve ARM
development platforms to determine the CPU integer,
oating point and memory
system performance as well as energy e ciency. To complement the benchmarks,
Fast Fourier Transform and Optimal Filtering (OF) applications were also tested.
Based on the test results, in order for the PU to process 275 million events per
second with OF, within the 6 s timing budget of the ATLAS triggering system, a
cluster of three Tegra-K1, Cortex-A15 SoCs connected to the sROD via a Gen-2 x8
PCI-Express interface would be suitable. A high level design for the PU is proposed
which surpasses the requirements for the sROD co-processor and can also be used
in a general purpose, high data throughput system, with 80 Gb/s Ethernet and
15 GB/s PCI-Express throughput, using four X-Gene SoCs
DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks
Data movement between the CPU and main memory is a first-order obstacle
against improving performance, scalability, and energy efficiency in modern
systems. Computer systems employ a range of techniques to reduce overheads tied
to data movement, spanning from traditional mechanisms (e.g., deep multi-level
cache hierarchies, aggressive hardware prefetchers) to emerging techniques such
as Near-Data Processing (NDP), where some computation is moved close to memory.
Our goal is to methodically identify potential sources of data movement over a
broad set of applications and to comprehensively compare traditional
compute-centric data movement mitigation techniques to more memory-centric
techniques, thereby developing a rigorous understanding of the best techniques
to mitigate each source of data movement.
With this goal in mind, we perform the first large-scale characterization of
a wide variety of applications, across a wide range of application domains, to
identify fundamental program properties that lead to data movement to/from main
memory. We develop the first systematic methodology to classify applications
based on the sources contributing to data movement bottlenecks. From our
large-scale characterization of 77K functions across 345 applications, we
select 144 functions to form the first open-source benchmark suite (DAMOV) for
main memory data movement studies. We select a diverse range of functions that
(1) represent different types of data movement bottlenecks, and (2) come from a
wide range of application domains. Using NDP as a case study, we identify new
insights about the different data movement bottlenecks and use these insights
to determine the most suitable data movement mitigation mechanism for a
particular application. We open-source DAMOV and the complete source code for
our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at
https://github.com/CMU-SAFARI/DAMO
Enabling the use of embedded and mobile technologies for high-performance computing
In the late 1990s, powerful economic forces led to the adoption of commodity desktop processors in High-Performance Computing(HPC). This transformation has been so effective that the November 2016 TOP500 list is still dominated by x86 architecture.
In 2016, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smartphones andtablets, most of which are built with ARM-based Systems on Chips (SoC). This suggests that once mobile SoCs deliver sufficient performance, mobile SoCs can help reduce the cost of HPC.
This thesis addresses this question in detail.We analyze the trend in mobile SoC performance, comparing it with the similar trend in the 1990s. Through development of real system prototypes and their performance analysis we assess the feasibility of building an HPCsystem based on mobile SoCs. Through simulation of the future mobile SoC, we identify the missing features and suggest improvements that would enable theuse of future mobile SoCs in HPC environment.
Thus, we present design guidelines for future generations mobile SoCs, and HPC systems built around them, enabling the newclass of cheap supercomputers.A finales de la década de los 90, razones económicas llevaron a la adopción de procesadores de uso general en sistemas de Computación de Altas Prestaciones (HPC). Esta transformación ha sido tan efectiva que la lista TOP500 de noviembre de 2016 sigue aun dominada por la arquitectura x86. En 2016, el mayor mercado de productos básicos en computación no son los ordenadores de sobremesa o los servidores, sino la computación móvil, que incluye teléfonos inteligentes y tabletas, la mayoría de los cuales están construidos con sistemas en chip(SoC) de arquitectura ARM. Esto sugiere que una vez que los SoC móviles ofrezcan un rendimiento suficiente, podrán utilizarse para reducir el costo desistemas HPC. Esta tesis aborda esta cuestión en detalle. Analizamos la tendencia del rendimiento de los SoC para móvil, comparándola con la tendencia similar ocurrida en los añosnoventa. A través del desarrollo de prototipos de sistemas reales y su análisis de rendimiento, evaluamos la factibilidad de construir unsistema HPC basado en SoCs móviles. A través de la simulación de SoCs móviles futuros, identificamos las características que faltan y sugerimos mejoras quepermitirían su uso en entornos HPC. Por lo tanto, presentamos directrices de diseño para futuras generaciones de SoCs móviles y sistemas HPC construidos a sualrededor, para permitir la construcción de una nueva clase de supercomputadores de coste reducido
Evaluation of low-power architectures in a scientific computing environment
HPC (High Performance Computing) represents, together with theory and experiments,
the third pillar of science. Through HPC, scientists can simulate phenomena
otherwise impossible to study. The need of performing larger and more accurate
simulations requires to HPC to improve every day.
HPC is constantly looking for new computational platforms that can improve cost
and power efficiency. The Mont-Blanc project is a EU funded research project that
targets to study new hardware and software solutions that can improve efficiency of
HPC systems. The vision of the project is to leverage the fast growing market of
mobile devices to develop the next generation supercomputers.
In this work we contribute to the objectives of the Mont-Blanc project by evaluating
performance of production scientific applications on innovative low power architectures.
In order to do so, we describe our experiences porting and evaluating sate of
the art scientific applications on the Mont-Blanc prototype, the first HPC system
built with commodity low power embedded technology. We then extend our study
to compare off-the-shelves ARMv8 platforms. We finally discuss the most impacting
issues encountered during the development of the Mont-Blanc prototype system
- …