17 research outputs found

    A memory-centric approach to enable timing-predictability within embedded many-core accelerators

    Get PDF
    There is an increasing interest among real-time systems architects for multi- and many-core accelerated platforms. The main obstacle towards the adoption of such devices within industrial settings is related to the difficulties in tightly estimating the multiple interferences that may arise among the parallel components of the system. This in particular concerns concurrent accesses to shared memory and communication resources. Existing worst-case execution time analyses are extremely pessimistic, especially when adopted for systems composed of hundreds-tothousands of cores. This significantly limits the potential for the adoption of these platforms in real-time systems. In this paper, we study how the predictable execution model (PREM), a memory-aware approach to enable timing-predictability in realtime systems, can be successfully adopted on multi- and manycore heterogeneous platforms. Using a state-of-the-art multi-core platform as a testbed, we validate that it is possible to obtain an order-of-magnitude improvement in the WCET bounds of parallel applications, if data movements are adequately orchestrated in accordance with PREM. We identify which system parameters mostly affect the tremendous performance opportunities offered by this approach, both on average and in the worst case, moving the first step towards predictable many-core systems

    Acceleration of Data Compression with Parallel Architectures

    Get PDF
    Tato bakalářská práce se zabývá využitím paralelních architektur, zejména GPU, pro akceleraci vybraných bezeztrátových komprimačních algoritmů, založených na statistické metodě, a transformací, měnící entropii vstupních dat pro dosažení lepšího kompresního poměru. V této bakalářské práci jsou také teoreticky shrnuty obecné informace o paralelních architekturách a možností programování na nich, hlavně pomocí technologií NVIDIA CUDA a OpenCL.This bachelor thesis deals with the use of parallel architectures, in particular the GPU, for acceleration of selected lossless compression algorithms, based on a statistical method, and transformations, which change the entropy of the input data to achieve better compression ratio. In this work there are in theory summarized general information about parallel architectures and programming options for them, mainly using NVIDIA CUDA and OpenCL.

    Multibranch Autocorrelation Method for Doppler Estimation in Underwater Acoustic Channels

    Get PDF
    In underwater acoustic (UWA) communications, Doppler estimation is one of the major stages in a receiver. Two Doppler estimation methods are often used: the cross-ambiguity function (CAF) method and the single-branch autocorrelation (SBA) method. The former results in accurate estimation but with a high complexity, whereas the latter is less complicated but also less accurate. In this paper, we propose and investigate a multibranch autocorrelation (MBA) Doppler estimation method. The proposed method can be used in communication systems with periodically transmitted pilot signals or repetitive data transmission. For comparison of the Doppler estimation methods, we investigate an orthogonal frequency-division multiplexing (OFDM) communication system in multiple dynamic scenarios using the Waymark simulator, allowing virtual UWA signal transmission between moving transmitter and receiver. For the comparison, we also use the OFDM signals recorded in a sea trial. The comparison shows that the receiver with the proposed MBA Doppler estimation method outperforms the receiver with the SBA method and its detection performance is close to that of the receiver with the CAF method, but with a significantly lower complexity

    Scalable Hierarchical Instruction Cache for Ultra-Low-Power Processors Clusters

    Full text link
    High Performance and Energy Efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly-coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultra-low-power tightly-coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures' performance and energy efficiency for parallel ultra-low-power (ULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20\% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17\% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications.Comment: 14 page

    Scalable Hierarchical Instruction Cache for Ultralow-Power Processors Clusters

    Get PDF
    High performance and energy efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultralow-power (ULP) tightly coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private (PR) caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures' performance and energy efficiency for parallel ULP (PULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications

    Multicore implementation of a fixed-complexity tree-search detector for MIMO communications

    Full text link
    [EN] Multicore systems allow the efficient implementation of signal processing algorithms for communication systems due to their high parallel processing capabilities. In this paper, we present a high-throughput multicore implementation of a fixed-complexity tree-search-based detector interesting for MIMO wireless communication systems. Experimental results confirm that this implementation allows to accelerate the data detection stage for different constellation sizes and number of subcarriers.This work was supported by the TEC2009-13741 project of the Spanish Ministry of Science, by the PROMETEO/2009/013 project and ACOMP/2012/076 of the Generalitat Valenciana, and the Vicerrectorado de Investigacion de la UPV through Programa de Apoyo a la Investigacion y desarrollo (PAID-05-11-2898).Ramiro Sánchez, C.; Roger Varea, S.; Gonzalez, A.; Almenar Terré, V.; Vidal Maciá, AM. (2013). Multicore implementation of a fixed-complexity tree-search detector for MIMO communications. The Journal of Supercomputing (Online). 65(3):1010-1019. https://doi.org/10.1007/s11227-012-0839-xS10101019653Paulraj AJ, Gore DA, Nabar RU, Bölcskei H (2004) An overview of MIMO communications—a key to gigabit wireless. Proc IEEE 92(2):198–218Jiang M, Hanzo L (2007) Multiuser MIMO-OFDM for next-generation wireless systems. Proc IEEE 95(7):1430–14693GPP TS 36.201, V10.0.0, Evolved Universal Terrestrial Radio Access (E-UTRA); Physical layer—general description, December 2010Lin Y, Lee H, Woh M, Harel Y, Mahlke S, Mudge T, Chakrabarti C, Flautner K (2007) SODA: a high-performance DSP architecture for software-defined radio. IEEE MICRO 27(1):114–123Yang C-H, Markovic D (2008) A multi-core sphere decoder VLSI architecture for MIMO communications. In: Global telecommunications conference, November, pp 1–6Wu D, Eilert J, Liu D (2011) Implementation of a high-speed MIMO soft-output symbol detector for software defined radio. J Signal Process Syst 63(1):27–37Tan K, Liu H, Zhang J, Zhang Y, Fang J, Voelker GM (2011) Sora: high-performance software radio using general-purpose multi-core processors. Communun ACM 54(1):99–107Roger S, Ramiro C, Gonzalez A, Almenar V, Vidal AM (2012) An efficient GPU implementation of fixed-complexity sphere decoders for MIMO wireless systems. Integr Comput-Aided Eng 19(4):341–350Chen Y-K et al (2009) Signal processing on platforms with multiple cores: Part 1-Overview and methodologies. IEEE Signal Proc Mag 6:24–25Karam LJ, AlKamal I, Gatherer A, Frantz GA, Anderson DV, Evans BL (2009) Trends in multicore DSP platforms. IEEE Signal Process Mag 26(6):38–49Barbero LG, Thompson JS (2008) Fixing the complexity of the sphere decoder for MIMO detection. IEEE Trans Wirel Commun 7(6):2131–2142Hassibi B, Vikalo H (2005) On sphere decoding algorithm. Part I, The expected complexity. IEEE Trans Signal Process 54(5):2806–2818Agrell E, Eriksson T, Vardy A, Zeger K (2002) Closest point search in lattices. IEEE Trans Inf Theory 48(8):2201–2214OpenMP v3.0, http://www.openmp.org/mp-documents/spec30.pdf , May 200

    Algorithm/Architecture Co-Exploration of Visual Computing: Overview and Future Perspectives

    Get PDF
    Concurrently exploring both algorithmic and architectural optimizations is a new design paradigm. This survey paper addresses the latest research and future perspectives on the simultaneous development of video coding, processing, and computing algorithms with emerging platforms that have multiple cores and reconfigurable architecture. As the algorithms in forthcoming visual systems become increasingly complex, many applications must have different profiles with different levels of performance. Hence, with expectations that the visual experience in the future will become continuously better, it is critical that advanced platforms provide higher performance, better flexibility, and lower power consumption. To achieve these goals, algorithm and architecture co-design is significant for characterizing the algorithmic complexity used to optimize targeted architecture. This paper shows that seamless weaving of the development of previously autonomous visual computing algorithms and multicore or reconfigurable architectures will unavoidably become the leading trend in the future of video technology

    On the Hard-Real-Time Scheduling of Embedded Streaming Applications

    Get PDF
    Computer Systems, Imagery and Medi
    corecore