27 research outputs found

    Execution-time Prediction for Dynamic Streaming Applications with Task-level Parallelism

    No full text
    Programmable multiprocessor systems-on-chip are becoming the preferred implementation platform for embedded streaming applications. This enables using more software components, which leads to large and frequent dynamic variations of data-dependent execution times. In this context, accurate and conservative prediction of execution times helps in maintaining good audio/video quality and reducing energy consumption by dynamic evaluation of the amount of on-chip resources needed by applications. To be effective, multiprocessor systems have to employ the available parallelism. The combination of task-level parallelism and task delay variations makes predicting execution times a very hard problem. So far, under these conditions, no appropriate techniques exist for the conservative prediction of execution times with the required accuracy. In this paper, we present a novel technique for this problem, exploiting the concept of scenario-based prediction, and taking into account the transient and periodic behavior of scenarios and the effect of scenario transitions. In our MPEG-4 shape-decoder case study, we observe no more than 11% average overestimation

    Heterogeneous multiprocessor for the management of real-time video and graphics streams

    Get PDF
    This paper presents an application domain driven approach to the design of embedded systems on silicon, and it shows how this approach is used to design a chip for a multi-window TV application. We discuss all major design steps in a logical order starting with an application domain analysis. This leads to the choice of Kahn data flow graphs as the programming paradigm for high-throughput signal applications. Based on this analysis we designed a multiprocessor architecture which uses a run-time reconfiguration. Finally, attention is directed towards the physical implementation and the deep-submicron problems we had to solve. The result is a chip that can manage up to 25 internal real-time video streams. The chip combines the flexibility of a programmable solution with the cost effectiveness of a consumer produc

    A scalable implementation of a reconfigurable WCDMA rake receiver

    No full text
    The demands in terms of processing performance, communication bandwidth and real-time throughput of new generation mobile communication applications (mobile and base-stations) are much higher than today's programmable processing architectures can deliver. On the other hand standards and market uncertainties, non-recurring, engineering costs, and lack of access to (or knowledge of) application IP will require the next generation of embedded computing platforms to be fully programmable. In terms of silicon cost and power, practical yet fully programmable embedded computing platforms are enabled by reconfigurable processors that replace fixed ASICs in current standard platforms. This paper explains the concepts behind a novel reconfigurable WCDMA Rake receiver and gives benchmark results. The proposed Rake receiver enables a high performance, yet flexible computing platform for WCDMA

    Constraint analysis for DSP code generation

    Get PDF
    Code generation methods for digital signal processing (DSP) applications are hampered by the combination of tight timing constraints imposed by the performance requirements of DSP algorithms and resource constraints imposed by a hardware architecture. In this paper, we present a method for register binding and instruction scheduling based on the exploitation and analysis of the combination of resource and timing constraints. The analysis identifies implicit sequencing relations between operations in addition to the preceding constraints. Without the explicit modeling of these sequencing constraints, a scheduler is often not capable of finding a solution that satisfies the timing and resource constraints. The presented approach results in an efficient method to obtain high-quality instruction schedules with low register requirement

    On resource estimation of MPEG-4 video decoding for a multiprocessor architecture.

    Get PDF
    This paper addresses an efficient implementation of new emerging video algorithms like the coding of arbitrarily shaped video objects in the new MPEG-4 standard. This type of advanced multimedia applications pose challenging requirements on embedded systems design with respect to decomposition and scalability, in order to meet real-time constraints. We study the design of networks-on-chip (NoC), which intrinsically satisfies these requirements [5]. A job scheduler needs to know the worst-case execution time (WCET) of a starting job to ensure that the job can meet its timing constraints. For the purpose of timing analysis, such as computing the WCET, a timing model has been applied which has a linear dependence on a set of inputdependent data parameters. We derive a linear timing model for MPEG-4 video object decoding from a running executable specification. Our timing model is computed and verified with an instruction-set simulator of a RISC processor element containing a flat local memory model. The derived model is accurate within 6% for the average execution time

    Parallel implementation of arbitrary-shaped MPEG-4 decoder for multiprocessor systems

    No full text
    MPEG-4 is the first standard that combines synthetic objects, like 2D/3D graphics objects, with natural rectangular and non-rectangular video objects. The independent access to individual synthetic video objects for further manipulation creates a large space for future applications. This paper addresses the optimization of such complex multimedia algorithms for implementation on multiprocessor platforms. It is shown that when choosing the correct granularity of processing for enhanced parallelism and splitting time-critical tasks, a substantial improvement in processing efficiency can be obtained. In our work, we focus on non-rectangular (also called arbitrary-shaped) video objects decoder. In previous work, we motivated the use of a multiprocessor System-on-Chip(SoC) setup that satisfies the requirements on the overall computation capacity. We propose the optimization of the MPEG-4 algorithm to increase the decoding throughput and a more efficient usage of the multiprocessor architecture. First, we present a modification of the Repetitive Padding to increase the pipelining at block level. We identified the part of the padding algorithm that can be executed in parallel with the DCT-coefficient decoding and modified the original algorithm into two communicating tasks. Second, we introduce a synchronization mechanism that allows the processing for the Extended Padding and postprocessing (Deblocking & Deringing) filters at block level. The first optimization results in about 58% decrease of the original Repetitive-Padding task computational requirements. By introducing the previously proposed data-level parallelism and exploiting the inherent parallelism between the separated color components (Y, Cr, Cb), the computational savings are about 72% on the average. Moreover, the proposed optimizations marginalize the processing latency from frame size to slice order-of-magnitude

    Efficient timing constraint derivation for optimally retiming high speed processing units

    No full text
    Retiming, including pipelining, is applied to make the processing units (PUs) run at a required throughput rate with a minimum number of registers. In the first step, a timing analysis of a PU is performed which results in inequality constraints on the operations' retimings. The constraints, together with a cost function expressing the number of registers in a retimed PU, form an instance of an integer linear programming problem, which is solved to optimality in the second step. In this paper, we concentrate on the constraint derivation task. We present two new constraint derivation algorithms, one of which is more memory efficient and the other more run-time efficient. We show that the run-time efficient algorithm makes it possible to minimize the area of a huge standard cell network, possibly representing a complete IC, within acceptable run-time limits
    corecore