957 research outputs found

    LIPIcs, Volume 261, ICALP 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 261, ICALP 2023, Complete Volum

    Constraint-based simulation of virtual crowds

    Get PDF
    Central to simulating pedestrian crowds is their motion and behaviour. It is required to understand how pedestrians move to simulate and predict scenarios with crowds of people. Pedestrian behaviours enhance the range of motions people can demonstrate, resulting in greater variety, believability, and accuracy. Models with complex computations and motion have difficulty in being extended with additional behaviours. This is because the structure of these models are not designed in a way that is generally compatible with collision avoidance behaviours. To address this issue, this thesis will research a possible pedestrian model that can simulate collision response with a wide range of additional behaviours. The model will do so by using constraints, a limit on the velocity of a person's movement. The proposed model will use constraints as the core computation. By describing behaviours in terms of constraints, these behaviours can be combined with the proposed model. Pedestrian simulations strike a balance between model complexity and runtime speed. Some models focus entirely on the complexity and accuracy of people, while other models focus on creating believable yet lightweight and performant simulations. Believable crowds look realistic to human observation, but do not match up to numerical analysis under scrutiny. The larger the population, and the more complex the motion of people, the slower the simulation will run. One route for improving performance of software is by using Graphical Processing Units (GPUs). GPUs are devices with theoretical performance that far outperforms equivalent multi-core CPUs. Research literature tends to focus on either the accuracy, or the performance optimisations of pedestrian crowd simulations. This suggests that there is opportunity to create more accurate models that run relatively quickly. Real time is a useful measure of model runtime. A simulation that runs in real time can be interactive and respond live to user input. By increasing the performance of the model, larger and more complex models can be simulated. This in turn increases the range of applications the model can represent. This thesis will develop a performant pedestrian simulation that runs in real time. It will explore how suitable the model is for GPU acceleration, and what performance gains can be obtained by implementing the model on the GPU

    Analysing and Reducing Costs of Deep Learning Compiler Auto-tuning

    Get PDF
    Deep Learning (DL) is significantly impacting many industries, including automotive, retail and medicine, enabling autonomous driving, recommender systems and genomics modelling, amongst other applications. At the same time, demand for complex and fast DL models is continually growing. The most capable models tend to exhibit highest operational costs, primarily due to their large computational resource footprint and inefficient utilisation of computational resources employed by DL systems. In an attempt to tackle these problems, DL compilers and auto-tuners emerged, automating the traditionally manual task of DL model performance optimisation. While auto-tuning improves model inference speed, it is a costly process, which limits its wider adoption within DL deployment pipelines. The high operational costs associated with DL auto-tuning have multiple causes. During operation, DL auto-tuners explore large search spaces consisting of billions of tensor programs, to propose potential candidates that improve DL model inference latency. Subsequently, DL auto-tuners measure candidate performance in isolation on the target-device, which constitutes the majority of auto-tuning compute-time. Suboptimal candidate proposals, combined with their serial measurement in an isolated target-device lead to prolonged optimisation time and reduced resource availability, ultimately reducing cost-efficiency of the process. In this thesis, we investigate the reasons behind prolonged DL auto-tuning and quantify their impact on the optimisation costs, revealing directions for improved DL auto-tuner design. Based on these insights, we propose two complementary systems: Trimmer and DOPpler. Trimmer improves tensor program search efficacy by filtering out poorly performing candidates, and controls end-to-end auto-tuning using cost objectives, monitoring optimisation cost. Simultaneously, DOPpler breaks long-held assumptions about the serial candidate measurements by successfully parallelising them intra-device, with minimal penalty to optimisation quality. Through extensive experimental evaluation of both systems, we demonstrate that they significantly improve cost-efficiency of autotuning (up to 50.5%) across a plethora of tensor operators, DL models, auto-tuners and target-devices

    2023-2024 Boise State University Undergraduate Catalog

    Get PDF
    This catalog is primarily for and directed at students. However, it serves many audiences, such as high school counselors, academic advisors, and the public. In this catalog you will find an overview of Boise State University and information on admission, registration, grades, tuition and fees, financial aid, housing, student services, and other important policies and procedures. However, most of this catalog is devoted to describing the various programs and courses offered at Boise State

    General Course Catalog [2022/23 academic year]

    Get PDF
    General Course Catalog, 2022/23 academic yearhttps://repository.stcloudstate.edu/undergencat/1134/thumbnail.jp

    ControlPULP: A RISC-V On-Chip Parallel Power Controller for Many-Core HPC Processors with FPGA-Based Hardware-In-The-Loop Power and Thermal Emulation

    Full text link
    High-Performance Computing (HPC) processors are nowadays integrated Cyber-Physical Systems demanding complex and high-bandwidth closed-loop power and thermal control strategies. To efficiently satisfy real-time multi-input multi-output (MIMO) optimal power requirements, high-end processors integrate an on-die power controller system (PCS). While traditional PCSs are based on a simple microcontroller (MCU)-class core, more scalable and flexible PCS architectures are required to support advanced MIMO control algorithms for managing the ever-increasing number of cores, power states, and process, voltage, and temperature variability. This paper presents ControlPULP, an open-source, HW/SW RISC-V parallel PCS platform consisting of a single-core MCU with fast interrupt handling coupled with a scalable multi-core programmable cluster accelerator and a specialized DMA engine for the parallel acceleration of real-time power management policies. ControlPULP relies on FreeRTOS to schedule a reactive power control firmware (PCF) application layer. We demonstrate ControlPULP in a power management use-case targeting a next-generation 72-core HPC processor. We first show that the multi-core cluster accelerates the PCF, achieving 4.9x speedup compared to single-core execution, enabling more advanced power management algorithms within the control hyper-period at a shallow area overhead, about 0.1% the area of a modern HPC CPU die. We then assess the PCS and PCF by designing an FPGA-based, closed-loop emulation framework that leverages the heterogeneous SoCs paradigm, achieving DVFS tracking with a mean deviation within 3% the plant's thermal design power (TDP) against a software-equivalent model-in-the-loop approach. Finally, we show that the proposed PCF compares favorably with an industry-grade control algorithm under computational-intensive workloads.Comment: 33 pages, 11 figure

    Time series analysis acceleration with advanced vectorization extensions

    Get PDF
    Time series analysis is an important research topic and a key step in monitoring and predicting events in many felds. Recently, the Matrix Profle method, and particularly two of its Euclidean-distance-based implementations—SCRIMP and SCAMP—have become the state-of-the-art approaches in this feld. Those algorithms bring the possibility of obtaining exact motifs and discords from a time series, which can be used to infer events, predict outcomes, detect anomalies and more. While matrix profle is embarrassingly parallelizable, we fnd that auto-vectorization techniques fail to fully exploit the SIMD capabilities of modern CPU architectures. In this paper, we develop custom-vectorized SCRIMP and SCAMP implementations based on AVX2 and AVX-512 extensions, which we combine with multithreading techniques aimed at exploiting the potential of the underneath architectures. Our experimental evaluation, conducted using real data, shows a performance improvement of more than 4× with respect to the auto-vectorization.Funding for open access publishing: Universidad Málaga/CBU

    Parallel Longest Increasing Subsequence and van Emde Boas Trees

    Full text link
    This paper studies parallel algorithms for the longest increasing subsequence (LIS) problem. Let nn be the input size and kk be the LIS length of the input. Sequentially, LIS is a simple problem that can be solved using dynamic programming (DP) in O(nlogn)O(n\log n) work. However, parallelizing LIS is a long-standing challenge. We are unaware of any parallel LIS algorithm that has optimal O(nlogn)O(n\log n) work and non-trivial parallelism (i.e., O~(k)\tilde{O}(k) or o(n)o(n) span). This paper proposes a parallel LIS algorithm that costs O(nlogk)O(n\log k) work, O~(k)\tilde{O}(k) span, and O(n)O(n) space, and is much simpler than the previous parallel LIS algorithms. We also generalize the algorithm to a weighted version of LIS, which maximizes the weighted sum for all objects in an increasing subsequence. To achieve a better work bound for the weighted LIS algorithm, we designed parallel algorithms for the van Emde Boas (vEB) tree, which has the same structure as the sequential vEB tree, and supports work-efficient parallel batch insertion, deletion, and range queries. We also implemented our parallel LIS algorithms. Our implementation is light-weighted, efficient, and scalable. On input size 10910^9, our LIS algorithm outperforms a highly-optimized sequential algorithm (with O(nlogk)O(n\log k) cost) on inputs with k3×105k\le 3\times 10^5. Our algorithm is also much faster than the best existing parallel implementation by Shen et al. (2022) on all input instances.Comment: to be published in Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '23

    Time series analysis acceleration with advanced vectorization extensions

    Get PDF
    Time series analysis is an important research topic and a key step in monitoring and predicting events in many fields. Recently, the Matrix Profile method, and particularly two of its Euclidean-distance-based implementations—SCRIMP and SCAMP—have become the state-of-the-art approaches in this field. Those algorithms bring the possibility of obtaining exact motifs and discords from a time series, which can be used to infer events, predict outcomes, detect anomalies and more. While matrix profile is embarrassingly parallelizable, we find that auto-vectorization techniques fail to fully exploit the SIMD capabilities of modern CPU architectures. In this paper, we develop custom-vectorized SCRIMP and SCAMP implementations based on AVX2 and AVX-512 extensions, which we combine with multithreading techniques aimed at exploiting the potential of the underneath architectures. Our experimental evaluation, conducted using real data, shows a performance improvement of more than 4× with respect to the auto-vectorization.This work has been supported by the Government of Spain under project PID2019-105396RB-I00, and Junta de Andalucía under projects P18-FR-3433, and UMA18-FEDERJA-197.Peer ReviewedPostprint (published version

    Approaches to Conflict-free Replicated Data Types

    Full text link
    Conflict-free Replicated Data Types (CRDTs) allow optimistic replication in a principled way. Different replicas can proceed independently, being available even under network partitions, and always converging deterministically: replicas that have received the same updates will have equivalent state, even if received in different orders. After a historical tour of the evolution from sequential data types to CRDTs, we present in detail the two main approaches to CRDTs, operation-based and state-based, including two important variations, the pure operation-based and the delta-state based. Intended as a tutorial for prospective CRDT researchers and designers, it provides solid coverage of the essential concepts, clarifying some misconceptions which frequently occur, but also presents some novel insights gained from considerable experience in designing both specific CRDTs and approaches to CRDTs.Comment: 36 page
    corecore