20 research outputs found

    ePUMA: A novel embedded parallel DSP platform for predictable computing

    Full text link

    Stream processing -arkkitehtuuri

    Get PDF
    Tietokoneen toiminta perustuu sen prosessoriin, toiselta nimeltään keskusyksikköön. Keskusyksikön rakenne on pääperiaatteiltaan pysynyt samankaltaisena pitkän aikaa. Kuitenkin keskusyksikön rinnalle on viime vuosina tullut grafiikkasuoritin, joka on erikoistunut raskaisiin laskentatehtäviin. Grafiikkasuorittimen toiminta perustuu stream processing -arkkitehtuuriin, joka on hyvin erilainen kuin keskusyksikön arkkitehtuuri. Tämän työn tarkoituksena on selvittää stream processing -arkkitehtuurin toimintaperiaate ja sen käyttökohteet. Työ on luonteeltaan kirjallisuuskatsaus. Työssä käsitellään olemassaolevan kirjallisuuden perusteella ensin keskusyksikön rakennetta ja sen jälkeen stream processing -arkkitehtuuria. Työssä selvitetään rakenteiden eroja ja samankaltaisuuksia. Lisäksi perehdytään niiden vahvuuksiin ja heikkouksiin. Lopuksi tutkitaan, millaisissa sovelluksissa stream processing -arkkitehtuuria kannattaa hyödyntää. Työn tuloksena selvisi, että tietokoneen keskusyksikkö soveltuu rakenteensa ansiosta kaikenlaisiin tehtäviin. Stream processing -arkkitehtuuri on kuitenkin ylivertainen raskaissa laskentatehtävissä, joissa tehdään miljoonia laskutoimituksia. Stream processing -arkkitehtuuriin perustuva prosessori suoriutuu tällaisista tehtävistä merkittävästi nopeammin ja tehokkaammin kuin keskusyksikkö. Stream processing -arkkitehtuurin teho perustuu sen kykyyn hyödyntää algoritmien suorittamien laskutoimitusten riippumattomuutta toisistaan. Riippumattomuuden ansiosta laskentayksiköitä voidaan sijoittaa suuri määrä rinnakkain niin, että ne ovat kaikki samanaikaisesti käytössä. Näin saavutetaan hyvin suuri laskentateho. Laskentayksiköiden määrän kasvattaminen on mahdollista, koska datan ohjaukseen ei tarvitse käyttää yhtä paljon pinta-alaa kuin keskusyksikön tapauksessa. Datan ohjaukseen tarvitaan vähemmän pinta-alaa, koska stream processing -arkkitehtuuria hyödyntävää prosessoria käytetään vain sellaisissa tehtävissä, joissa sen vahvuuksia voidaan hyödyntää. Stream processing -arkkitehtuuria hyödynnetään pääasiassa grafiikkasuorittimissa, mutta sitä voidaan käyttää myös erikoistuneissa prosessoreissa, jotka on suunniteltu jotain tiettyä sovellusta varten

    P2IP: A novel low-latency Programmable Pipeline Image Processor

    Get PDF
    International audienceThis paper presents a novel systolic Coarse-Grained Reconfigurable Architecture for real-time image and video processing called P 2 IP. The P 2 IP is a scalable architecture that combines the low-latency characteristic of systolic array architectures with a runtime reconfigurable datapath. Reconfigurabil-ity of the P 2 IP enables it to perform a wide range of image pre-processing tasks directly on a pixel stream. The versatility of the P 2 IP is demonstrated through three image processing algorithms mapped onto the architecture, implemented in an FPGA-based platform. The obtained results show that the P 2 IP can achieve up to 129 fps in Full HD 1080p and 32 fps in 4K 2160p what makes it suitable for modern high-definition applications

    Precision-Energy-Throughput Scaling Of Generic Matrix Multiplication and Convolution Kernels Via Linear Projections

    Get PDF
    Generic matrix multiplication (GEMM) and one-dimensional convolution/cross-correlation (CONV) kernels often constitute the bulk of the compute- and memory-intensive processing within image/audio recognition and matching systems. We propose a novel method to scale the energy and processing throughput of GEMM and CONV kernels for such error-tolerant multimedia applications by adjusting the precision of computation. Our technique employs linear projections to the input matrix or signal data during the top-level GEMM and CONV blocking and reordering. The GEMM and CONV kernel processing then uses the projected inputs and the results are accumulated to form the final outputs. Throughput and energy scaling takes place by changing the number of projections computed by each kernel, which in turn produces approximate results, i.e. changes the precision of the performed computation. Results derived from a voltage- and frequency-scaled ARM Cortex A15 processor running face recognition and music matching algorithms demonstrate that the proposed approach allows for 280%~440% increase of processing throughput and 75%~80% decrease of energy consumption against optimized GEMM and CONV kernels without any impact in the obtained recognition or matching accuracy. Even higher gains can be obtained if one is willing to tolerate some reduction in the accuracy of the recognition and matching applications

    Stream Processor Development using Multi-Threshold NULL Convention Logic Asynchronous Design Methodology

    Get PDF
    Decreasing transistor feature size has led to an increase in the number of transistors in integrated circuits (IC), allowing for the implementation of more complex logic. However, such logic also requires more complex clock tree synthesis (CTS) to avoid timing violations as the clock must reach many more gates over larger areas. Thus, timing analysis requires significantly more computing power and designer involvement than in the past. For these reasons, IC designers have been pushed to nix conventional synchronous (SYNC) architecture and explore novel methodologies such as asynchronous, self-timed architecture. This dissertation evaluates the nominal active energy, voltage-scaled active energy, and leakage power dissipation across two cores of a stream processor: Smoothing Filter (SF) and Histogram Equalization (HEQ). Both cores were implemented in Multi-Threshold NULL Convention Logic (MTNCL) and clock-gated synchronous methodologies using a gate-level netlist to avoid any architectural discrepancies while guaranteeing impartial comparisons. MTNCL designs consumed more active energy than their synchronous counterparts due to the dual-rail encoding system; however, high-threshold-voltage (High-Vt) transistors used in MTNCL threshold gates reduced leakage power dissipation by up to 227%. During voltage-scaling simulations, MTNCL circuits showed a high level of robustness as the output results were logically valid across all voltage sweeps without any additional circuitry. SYNC circuits, however, needed extra logic, such as a DVS controller, to adjust the circuit’s speed when VDD changed. Although SYNC circuits still consumed less average energy, MTNCL circuit power gains accelerated when switching to lower voltage domains

    Stream Processor Development using Multi-Threshold NULL Convention Logic Asynchronous Design Methodology

    Get PDF
    Decreasing transistor feature size has led to an increase in the number of transistors in integrated circuits (IC), allowing for the implementation of more complex logic. However, such logic also requires more complex clock tree synthesis (CTS) to avoid timing violations as the clock must reach many more gates over larger areas. Thus, timing analysis requires significantly more computing power and designer involvement than in the past. For these reasons, IC designers have been pushed to nix conventional synchronous (SYNC) architecture and explore novel methodologies such as asynchronous, self-timed architecture. This dissertation evaluates the nominal active energy, voltage-scaled active energy, and leakage power dissipation across two cores of a stream processor: Smoothing Filter (SF) and Histogram Equalization (HEQ). Both cores were implemented in Multi-Threshold NULL Convention Logic (MTNCL) and clock-gated synchronous methodologies using a gate-level netlist to avoid any architectural discrepancies while guaranteeing impartial comparisons. MTNCL designs consumed more active energy than their synchronous counterparts due to the dual-rail encoding system; however, high-threshold-voltage (High-Vt) transistors used in MTNCL threshold gates reduced leakage power dissipation by up to 227%. During voltage-scaling simulations, MTNCL circuits showed a high level of robustness as the output results were logically valid across all voltage sweeps without any additional circuitry. SYNC circuits, however, needed extra logic, such as a DVS controller, to adjust the circuit’s speed when VDD changed. Although SYNC circuits still consumed less average energy, MTNCL circuit power gains accelerated when switching to lower voltage domains

    Error tolerant multimedia stream processing: There's plenty of room at the top (of the system stack)

    Get PDF
    There is a growing realization that the expected fault rates and energy dissipation stemming from increases in CMOS integration will lead to the abandonment of traditional system reliability in favor of approaches that offer reliability to hardware-induced errors across the application, runtime support, architecture, device and integrated-circuit (IC) layers. Commercial stakeholders of multimedia stream processing (MSP) applications, such as information retrieval, stream mining systems, and high-throughput image and video processing systems already feel the strain of inadequate system-level scaling and robustness under the always-increasing user demand. While such applications can tolerate certain imprecision in their results, today's MSP systems do not support a systematic way to exploit this aspect for cross-layer system resilience. However, research is currently emerging that attempts to utilize the error-tolerant nature of MSP applications for this purpose. This is achieved by modifications to all layers of the system stack, from algorithms and software to the architecture and device layer, and even the IC digital logic synthesis itself. Unlike conventional processing that aims for worst-case performance and accuracy guarantees, error-tolerant MSP attempts to provide guarantees for the expected performance and accuracy. In this paper we review recent advances in this field from an MSP and a system (layer-by-layer) perspective, and attempt to foresee some of the components of future cross-layer error-tolerant system design that may influence the multimedia and the general computing landscape within the next ten years. © 1999-2012 IEEE
    corecore