775 research outputs found

    Multirate as a hardware paradigm

    Get PDF
    Journal ArticleArchitecture and circuit design are the two most effective means of reducing power in CMOS VLSI. Mathematical manipulations, based on applying ideas from multirate signal processing have been applied to create high performance, low power architectures. To illustrate this approach, two case studies are presented - one concerns the design of a fast Fourier transforms(FFT) device, while the other one is concerned with the design of analog-to-digital converters

    Improving MPI Threading Support for Current Hardware Architectures

    Get PDF
    Threading support for Message Passing Interface (MPI) has been defined in the MPI standard for more than twenty years. While many standard-compliance MPI implementations fully support multithreading, the threading support in MPI still cannot provide the optimal performance on the same level as the non-threading environment. The performance disparity leads to low adoption rate from applications, and eventually, lesser interest in optimizing MPI threading support. However, with the current advancement in computation hardware, the number of CPU core per packet is growing drastically. Using shared-memory MPI communication has become more costly. MPI threading without local communication is one of the alternatives and the some interests are shifting back toward threading to MPI.In this work, we investigate different approaches to leverage the power of thread parallelism and tools to help us to raise the multi-threaded MPI performance to reasonable level. We propose a novel multi-threaded MPI benchmark with multiple communication patterns to stress multiple points of the MPI implementation, with the ability to switch between using MPI process and threads for quick comparison between two modes. Enabling the us, and the others MPI developers to stress test their implementation design.We address the interoperability between MPI implementation and threading frameworks by introducing the thread-synchronization object, an object that gives the MPI implementation more control over user-level thread, allowing for more thread utilization in MPI. In our implementation, the synchronization object relieves the lock contention on the internal progress engine and able to achieve up to 7x the performance of the original implementation. Moving forward, we explore the possibility of harnessing the true thread concurrency. We proposed several strategies to address the bottlenecks in MPI implementation. From our evaluation, with our novel threading optimization, we can achieve up to 22x the performance comparing to the legacy MPI designs

    Algorithms and architectures for the multirate additive synthesis of musical tones

    Get PDF
    In classical Additive Synthesis (AS), the output signal is the sum of a large number of independently controllable sinusoidal partials. The advantages of AS for music synthesis are well known as is the high computational cost. This thesis is concerned with the computational optimisation of AS by multirate DSP techniques. In note-based music synthesis, the expected bounds of the frequency trajectory of each partial in a finite lifecycle tone determine critical time-invariant partial-specific sample rates which are lower than the conventional rate (in excess of 40kHz) resulting in computational savings. Scheduling and interpolation (to suppress quantisation noise) for many sample rates is required, leading to the concept of Multirate Additive Synthesis (MAS) where these overheads are minimised by synthesis filterbanks which quantise the set of available sample rates. Alternative AS optimisations are also appraised. It is shown that a hierarchical interpretation of the QMF filterbank preserves AS generality and permits efficient context-specific adaptation of computation to required note dynamics. Practical QMF implementation and the modifications necessary for MAS are discussed. QMF transition widths can be logically excluded from the MAS paradigm, at a cost. Therefore a novel filterbank is evaluated where transition widths are physically excluded. Benchmarking of a hypothetical orchestral synthesis application provides a tentative quantitative analysis of the performance improvement of MAS over AS. The mapping of MAS into VLSI is opened by a review of sine computation techniques. Then the functional specification and high-level design of a conceptual MAS Coprocessor (MASC) is developed which functions with high autonomy in a loosely-coupled master- slave configuration with a Host CPU which executes filterbanks in software. Standard hardware optimisation techniques are used, such as pipelining, based upon the principle of an application-specific memory hierarchy which maximises MASC throughput

    Design and Implementation of a Motif-based Compression Algorithm for Biometric Signals

    Get PDF
    Wearable devices are becoming a natural and economic means to gather biometric data from users: this thesis is centered around lossy data compression techniques, whose aim is to minimize the amount of information that is to be stored on their onboard memory and subsequently transmitted over wireless interfaces. A new class of codebook based (CB) compression algorithms is proposed, designed to be energy efficient, online and amenable to any type of signal exhibiting recurrent patternsope

    The Wavelet Transform for Image Processing Applications

    Get PDF

    Providing Insight into the Performance of Distributed Applications Through Low-Level Metrics

    Get PDF
    The field of high-performance computing (HPC) has always dealt with the bleeding edge of computational hardware and software to achieve the maximum possible performance for a wide variety of workloads. When dealing with brand new technologies, it can be difficult to understand how these technologies work and why they work the way they do. One of the more prevalent approaches to providing insight into modern hardware and software is to provide tools that allow developers to access low-level metrics about their performance. The modern HPC ecosystem supports a wide array of technologies, but in this work, I will be focusing on two particularly influential technologies: The Message Passing Interface (MPI), and Graphical Processing Units (GPUs).For many years, MPI has been the dominant programming paradigm in HPC. Indeed, over 90% of applications that are a part of the U.S. Exascale Computing Project plan to use MPI in some fashion. The MPI Standard provides programmers with a wide variety of methods to communicate between processes, along with several other capabilities. The high-level MPI Profiling Interface has been the primary method for profiling MPI applications since the inception of the MPI Standard, and more recently the low-level MPI Tool Information Interface was introduced.Accelerators like GPUs have been increasingly adopted as the primary computational workhorse for modern supercomputers. GPUs provide more parallelism than traditional CPUs through a hierarchical grid of lightweight processing cores. NVIDIA provides profiling tools for their GPUs that give access to low-level hardware metrics.In this work, I propose research in applying low-level metrics to both the MPI and GPU paradigms in the form of an implementation of low-level metrics for MPI, and a new method for analyzing GPU load imbalance with a synthetic efficiency metric. I introduce Software-based Performance Counters (SPCs) to expose internal metrics of the Open MPI implementation along with a new interface for exposing these counters to users and tool developers. I also analyze a modified load imbalance formula for GPU-based applications that uses low-level hardware metrics provided through nvprof in a hierarchical approach to take the internal load imbalance of the GPU into account

    Time in SCCharts

    Get PDF
    Synchronous languages, such as the recently proposed SCCharts language, have been designed for the rigorous specification of real-time systems. Their sound semantics, which builds on an abstraction from physical execution time, make these languages appealing, in particular for safety-critical systems. However, they traditionally lack built-in support for physical time. This makes it rather cumbersome to express things like time-outs or periodic executions within the language. We here propose several mechanisms to reconcile the synchronous paradigm with physical time. Specifically, we propose extensions to the SCCharts language to express clocks and execution periods within the model. We draw on several sources, in particular timed automata, the Clock Constraint Specification Language, and the recently proposed concept of dynamic ticks. We illustrate how these extensions can be mapped to the SCChart language core, with minimal requirements on the run-time system, and we argue that the same concepts could be applied to other synchronous languages such as Esterel, Lustre or SCADE
    • …
    corecore