6,878 research outputs found

    Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms

    Full text link
    Dense linear algebra kernels are critical for wireless applications, and the oncoming proliferation of 5G only amplifies their importance. Many such matrix algorithms are inductive, and exhibit ample amounts of fine-grain ordered parallelism -- when multiple computations flow with fine-grain producer/consumer dependences, and where the iteration domain is not easily tileable. Synchronization overheads make multi-core parallelism ineffective and the non-tileable iterations make the vector-VLIW approach less effective, especially for the typically modest-sized matrices. Because CPUs and DSPs lose order-of-magnitude performance/hardware utilization, costly and inflexible ASICs are often employed in signal processing pipelines. A programmable accelerator with similar performance/power/area would be highly desirable. We find that fine-grain ordered parallelism can be exploited by supporting: 1. fine-grain stream-based communication/synchronization; 2. inductive data-reuse and memory access patterns; 3. implicit vector-masking for partial vectors; 4. hardware specialization of dataflow criticality. In this work, we propose, REVEL, as a next-generation DSP architecture. It supports the above features in its ISA and microarchitecture, and further uses a novel vector-stream control paradigm to reduce control overheads. Across a suite of linear algebra kernels, REVEL outperforms equally provisioned DSPs by 4.6x-37x in latency and achieves a performance per mm 2 of 8.3x. It is only 2.2x higher power to achieve the same performance as ideal ASICs, at about 55% of the combined area

    A Scalable Platform for Distributed Object Tracking across a Many-camera Network

    Full text link
    Advances in deep neural networks (DNN) and computer vision (CV) algorithms have made it feasible to extract meaningful insights from large-scale deployments of urban cameras. Tracking an object of interest across the camera network in near real-time is a canonical problem. However, current tracking platforms have two key limitations: 1) They are monolithic, proprietary and lack the ability to rapidly incorporate sophisticated tracking models; and 2) They are less responsive to dynamism across wide-area computing resources that include edge, fog and cloud abstractions. We address these gaps using Anveshak, a runtime platform for composing and coordinating distributed tracking applications. It provides a domain-specific dataflow programming model to intuitively compose a tracking application, supporting contemporary CV advances like query fusion and re-identification, and enabling dynamic scoping of the camera network's search space to avoid wasted computation. We also offer tunable batching and data-dropping strategies for dataflow blocks deployed on distributed resources to respond to network and compute variability. These balance the tracking accuracy, its real-time performance and the active camera-set size. We illustrate the concise expressiveness of the programming model for 44 tracking applications. Our detailed experiments for a network of 1000 camera-feeds on modest resources exhibit the tunable scalability, performance and quality trade-offs enabled by our dynamic tracking, batching and dropping strategies

    On reconfigurable tiled multi-core programming

    Get PDF
    For a generic flexible efficient array antenna receiver platform a hierarchical reconfigurable tiled architecture has been proposed. The architecture provides a flexible reconfigurable solution, but partitioning, mapping, modelling and programming such systems remains an issue. A semantic model has been presented to allow the development of the model for the specification, design and implementation. The semantic model is used for partitioning the application, evaluating the consequences and mapping to an architectures. Design space exploration allows us to adapt the partitioning and mapping to an architecture or visa-versa.\ud \ud With tiled reconfigurable cores as basis for the architecture, this paper explores the different options for processing cores and its suitability with respect to the design flow of the semantic model approach. Trade-offs with respect to granularity depending on flexibility and efficiency allow interesting design evaluations, especially for programability. This work therefore represent an important step forward in the design flow for designing and using multi-core tiled architectures

    Parallel Wavelet Schemes for Images

    Full text link
    In this paper, we introduce several new schemes for calculation of discrete wavelet transforms of images. These schemes reduce the number of steps and, as a consequence, allow to reduce the number of synchronizations on parallel architectures. As an additional useful property, the proposed schemes can reduce also the number of arithmetic operations. The schemes are primarily demonstrated on CDF 5/3 and CDF 9/7 wavelets employed in JPEG 2000 image compression standard. However, the presented method is general, and it can be applied on any wavelet transform. As a result, our scheme requires only two memory barriers for 2-D CDF 5/3 transform compared to four barriers in the original separable form or three barriers in the non-separable scheme recently published. Our reasoning is supported by exhaustive experiments on high-end graphics cards.Comment: This is a preprint of the article that appeared in the Journal of Real-Time Image Processing. The final publication is available at Springer via http://doi.org/10.1007/s11554-016-0646-

    Current Trends and Future Research Directions for Interactive Music

    Full text link
    In this review, it is explained and compared different software and formalisms used in music interaction: sequencers, computer-assisted improvisation, meta- instruments, score-following, asynchronous dataflow languages, synchronous dataflow languages, process calculi, temporal constraints and interactive scores. Formal approaches have the advantage of providing rigorous semantics of the behavior of the model and proving correctness during execution. The main disadvantage of formal approaches is lack of commercial tools

    A Coprocessor for Accelerating Visual Information Processing

    Full text link
    Visual information processing will play an increasingly important role in future electronics systems. In many applications, e.g. video surveillance cameras, data throughput of microprocessors is not sufficient and power consumption is too high. Instruction profiling on a typical test algorithm has shown that pixel address calculations are the dominant operations to be optimized. Therefore AddressLib, a structured scheme for pixel addressing was developed, that can be accelerated by AddressEngine, a coprocessor for visual information processing. In this paper, the architectural design of AddressEngine is described, which in the first step supports a subset of the AddressLib. Dataflow and memory organization are optimized during architectural design. AddressEngine was implemented in a FPGA and was tested with MPEG-7 Global Motion Estimation algorithm. Results on processing speed and circuit complexity are given and compared to a pure software implementation. The next step will be the support for the full AddressLib, including segment addressing. An outlook on further investigations on dynamic reconfiguration capabilities is given.Comment: Submitted on behalf of EDAA (http://www.edaa.com/

    Exploring the Equivalence between Dynamic Dataflow Model and Gamma - General Abstract Model for Multiset mAnipulation

    Full text link
    With the increase of the search for computational models where the expression of parallelism occurs naturally, some paradigms arise as options for the next generation of computers. In this context, dynamic Dataflow and Gamma - General Abstract Model for Multiset mAnipulation) - emerge as interesting computational models choices. In the dynamic Dataflow model, operations are performed as soon as their associated operators are available, without rely on a Program Counter to dictate the execution order of instructions. The Gamma paradigm is based on a parallel multiset rewriting scheme. It provides a non-deterministic execution model inspired by an abstract chemical machine metaphor, where operations are formulated as reactions that occur freely among matching elements belonging to the multiset. In this work, equivalence relations between the dynamic Dataflow and Gamma paradigms are exposed and explored, while methods to convert from Dataflow to Gamma paradigm and vice versa are provided. It is shown that vertices and edges of a dynamic Dataflow graph can correspond, respectively, to reactions and multiset elements in the Gamma paradigm. Implementation aspects of execution environments that could be mutually beneficial to both models are also discussed. This work provides the scientific community with the possibility of taking profit of both parallel programming models, contributing with a versatility component to researchers and developers. Finally, it is important to state that, to the best of our knowledge, the similarity relations between both dynamic Dataflow and Gamma models presented here have not been reported in any previous work.Comment: Study submitted to the IPDPS 2019 - IEEE International Parallel and Distributed Processing Symposiu

    Research Challenges for Heterogeneous CPS Design

    Full text link
    Heterogeneous computing is widely used at all levels of computing from data center to edge due to its power/performance characteristics. However, heterogeneity presents challenges. Interoperability---the management of workloads across heterogeneous resources---requires more careful design than is the case for homogeneous platforms. Cyber-physical systems present additional challenges. This article considers research challenges in heterogeneous CPS design, including interoperability, physical modeling, models of computation, self-awareness and adaptation, architecture, and scheduling.Comment: This is a pre-publication version of a paper that has been accepted for publication in IEEE Computer. The official/final version of the paper will be posted on IEEE Xplor

    Multiprocessor Scheduling of a Multi-mode Dataflow Graph Considering Mode Transition Delay

    Full text link
    Synchronous Data Flow (SDF) model is widely used for specifying signal processing or streaming applications. Since modern embedded applications become more complex with dynamic behavior changes at run-time, several extensions of the SDF model have been proposed to specify the dynamic behavior changes while preserving static analyzability of the SDF model. They assume that an application has a finite number of behaviors (or modes) and each behavior (mode) is represented by an SDF graph. They are classified as multi-mode dataflow models in this paper. While there exist several scheduling techniques for multi-mode dataflow models, no one allows task migration between modes. By observing that the resource requirement can be additionally reduced if task migration is allowed, we propose a multiprocessor scheduling technique of a multi-mode dataflow graph considering task migration between modes. Based on a genetic algorithm, the proposed technique schedules all SDF graphs in all modes simultaneously to minimize the resource requirement. To satisfy the throughput constraint, the proposed technique calculates the actual throughput requirement of each mode and the output buffer size for tolerating throughput jitter. We compare the proposed technique with a method which analyzes SDF graphs in each execution mode separately and a method that does not allow task migration for synthetic examples and three real applications: H.264 decoder, vocoder, and LTE receiver algorithms

    TensorFlow: A system for large-scale machine learning

    Full text link
    TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with particularly strong support for training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model in contrast to existing systems, and demonstrate the compelling performance that TensorFlow achieves for several real-world applications.Comment: 18 pages, 9 figures; v2 has a spelling correction in the metadat
    corecore