112 research outputs found

    Using Graph Properties to Speed-up GPU-based Graph Traversal: A Model-driven Approach

    Get PDF
    While it is well-known and acknowledged that the performance of graph algorithms is heavily dependent on the input data, there has been surprisingly little research to quantify and predict the impact the graph structure has on performance. Parallel graph algorithms, running on many-core systems such as GPUs, are no exception: most research has focused on how to efficiently implement and tune different graph operations on a specific GPU. However, the performance impact of the input graph has only been taken into account indirectly as a result of the graphs used to benchmark the system. In this work, we present a case study investigating how to use the properties of the input graph to improve the performance of the breadth-first search (BFS) graph traversal. To do so, we first study the performance variation of 15 different BFS implementations across 248 graphs. Using this performance data, we show that significant speed-up can be achieved by combining the best implementation for each level of the traversal. To make use of this data-dependent optimization, we must correctly predict the relative performance of algorithms per graph level, and enable dynamic switching to the optimal algorithm for each level at runtime. We use the collected performance data to train a binary decision tree, to enable high-accuracy predictions and fast switching. We demonstrate empirically that our decision tree is both fast enough to allow dynamic switching between implementations, without noticeable overhead, and accurate enough in its prediction to enable significant BFS speedup. We conclude that our model-driven approach (1) enables BFS to outperform state of the art GPU algorithms, and (2) can be adapted for other BFS variants, other algorithms, or more specific datasets

    Efficient trimming for strongly connected components calculation

    Get PDF
    Strongly Connected Components (SCCs) are useful for many applications, such as community detection and personalized recommendation. Determining the SCCs of a graph, however, can be very expensive, and parallelization is not an easy way out: the paral-lelization itself is challenging, and its performance impact varies non-trivially with the input graph structure. This variability is due to trivial components, i.e., SCCs consisting of a single vertex, which lead to significant workload imbalance. Trimming is an effective method to remove trivial components, but is inefficient when used on graphs with few trivial components. In this work, we propose FB-AI-Trim, a parallel SCC algorithm with selective trimming. Our algorithm decides dynamically, at runtime, based on the input graph how to trim the graph. To this end, we train a neural network to predict, using topological graph information, whether trimming is beneficial for performance. We evaluate FB-AI-Trim using 173 unseen graphs, and compare it against four different static trimming models. Our results demonstrate that, over the set of graphs, FB-AI-Trim is the fastest algorithm. Furthermore, FB-AI-Trim is, in 80% of the cases, less than 10% slower than the best performing model on a single graph. Finally, FB-AI-Trim shows significant performance degradation in less than 3% of the graphs.</p

    Graph-Optimizer:Towards Predictable Large-Scale Graph Processing Workloads

    Get PDF
    Graph and hardware-specific optimisations lead to orders of magnitude improvements in performance, energy, and cost over conventional graph processing methods. Typical big data platforms, such as Apache MapReduce and Apache Spark, rely on generic primitives, exhibiting poor performance and high financial and environmental costs. Even optimised basic graph operations (BGOs) lack the tools to combine them towards real-world applications. Furthermore, graph topology and dynamics (i.e., changing the number and content of vertices and edges) lead to high variability in computational needs. Primitive predictive models demonstrate they can enable algorithm selection and advanced auto-scaling techniques to ensure better performance, but no such models exist for energy consumption. In this work, we present the Graph-Optimizer tool. Graph-Optimizer uses optimised BGOs and composition rules to capture and model the workload. It combines the workload model with hardware and infrastructure models, predicting performance and energy consumption. Combined with design space exploration, such predictions select codesigned workload implementations to fit a requested performance objective and guarantee their performance bounds during execution.</p

    Lessons Learned Migrating CUDA to SYCL:A HEP Case Study with ROOT RDataFrame

    Get PDF
    The world's largest particle accelerator, located at CERN, produces petabytes of data that need to be analysed efficiently, to study the fundamental structures of our universe. ROOT is an open-source C++ data analysis framework, developed for this purpose. Its high-level data analysis interface, RDataFrame, currently only supports CPU parallelism. Given the increasing heterogeneity in computing facilities, it becomes crucial to efficiently support GPGPUs to take advantage of the available resources. SYCL allows for a single-source implementation, which enables support for different architectures. In this paper, we describe a CUDA implementation and the migration process to SYCL, focusing on a core high energy physics operation in RDataFrame -- histogramming. We detail the challenges that we faced when integrating SYCL into a large and complex code base. Furthermore, we perform an extensive comparative performance analysis of two SYCL compilers, AdaptiveCpp and DPC++, and the reference CUDA implementation. We highlight the performance bottlenecks that we encountered, and the methodology used to detect these. Based on our findings, we provide actionable insights for developers of SYCL applications

    Model Parallelism on Distributed Infrastructure:A Literature Review from Theory to LLM Case-Studies

    Get PDF
    Neural networks have become a cornerstone of machine learning. As the trend for these to get more and more complex continues, so does the underlying hardware and software infrastructure for training and deployment. In this survey we answer three research questions: "What types of model parallelism exist?", "What are the challenges of model parallelism?", and "What is a modern use-case of model parallelism?" We answer the first question by looking at how neural networks can be parallelised and expressing these as operator graphs while exploring the available dimensions. The dimensions along which neural networks can be parallelised are intra-operator and inter-operator. We answer the second question by collecting and listing both implementation challenges for the types of parallelism, as well as the problem of optimally partitioning the operator graph. We answer the last question by collecting and listing how parallelism is applied in modern multi-billion parameter transformer networks, to the extend that this is possible with the limited information shared about these networks

    Lessons Learned Migrating CUDA to SYCL: A HEP Case Study with ROOT RDataFrame

    Full text link
    The world's largest particle accelerator, located at CERN, produces petabytes of data that need to be analysed efficiently, to study the fundamental structures of our universe. ROOT is an open-source C++ data analysis framework, developed for this purpose. Its high-level data analysis interface, RDataFrame, currently only supports CPU parallelism. Given the increasing heterogeneity in computing facilities, it becomes crucial to efficiently support GPGPUs to take advantage of the available resources. SYCL allows for a single-source implementation, which enables support for different architectures. In this paper, we describe a CUDA implementation and the migration process to SYCL, focusing on a core high energy physics operation in RDataFrame -- histogramming. We detail the challenges that we faced when integrating SYCL into a large and complex code base. Furthermore, we perform an extensive comparative performance analysis of two SYCL compilers, AdaptiveCpp and DPC++, and the reference CUDA implementation. We highlight the performance bottlenecks that we encountered, and the methodology used to detect these. Based on our findings, we provide actionable insights for developers of SYCL applications

    Performance Engineering for Graduate Students:a View from Amsterdam

    Get PDF
    HPC relies on experts to design, implement, and tune (computational science) applications that can efficiently use current (super)computing systems. As such, we strongly believe we must educate our students to ensure their ability to drive these activities, together with the domain experts. To this end, in 2017, we have designed a performance engineering course that, inspired by several conference-like tutorials, covers the principles and practice of performance engineering: benchmarking, performance modeling, and performance improvement. In this paper, we describe the goals, learning objectives, and structure of the course, share students feedback and evaluation data, and discuss the lessons learned. After teaching the course seven times, our results show that the course is tough (as expected) but very well received, with high-scores and several students continuing on the path of performance engineering during and after their master studies.</p

    Reduced Simulations for High-Energy Physics, a Middle Ground for Data-Driven Physics Research

    Get PDF
    Subatomic particle track reconstruction (tracking) is a vital task in High-Energy Physics experiments. Tracking is exceptionally computationally challenging and fielded solutions, relying on traditional algorithms, do not scale linearly. Machine Learning (ML) assisted solutions are a promising answer. We argue that a complexity-reduced problem description and the data representing it, will facilitate the solution exploration workflow. We provide the REDuced VIrtual Detector (REDVID) as a complexity-reduced detector model and particle collision event simulator combo. REDVID is intended as a simulation-in-the-loop, to both generate synthetic data efficiently and to simplify the challenge of ML model design. The fully parametric nature of our tool, with regards to system-level configuration, while in contrast to physics-accurate simulations, allows for the generation of simplified data for research and education, at different levels. Resulting from the reduced complexity, we showcase the computational efficiency of REDVID by providing the computational cost figures for a multitude of simulation benchmarks. As a simulation and a generative tool for ML-assisted solution design, REDVID is highly flexible, reusable and open-source. Reference data sets generated with REDVID are publicly available. Data generated using REDVID has enabled rapid development of multiple novel ML model designs, which is currently ongoing
    • …
    corecore