17 research outputs found
A memory-centric approach to enable timing-predictability within embedded many-core accelerators
There is an increasing interest among real-time systems architects for multi- and many-core accelerated platforms. The main obstacle towards the adoption of such devices within industrial settings is related to the difficulties in tightly estimating the multiple interferences that may arise among the parallel components of the system. This in particular concerns concurrent accesses to shared memory and communication resources. Existing worst-case execution time analyses are extremely pessimistic, especially when adopted for systems composed of hundreds-tothousands of cores. This significantly limits the potential for the adoption of these platforms in real-time systems. In this paper, we study how the predictable execution model (PREM), a memory-aware approach to enable timing-predictability in realtime systems, can be successfully adopted on multi- and manycore heterogeneous platforms. Using a state-of-the-art multi-core platform as a testbed, we validate that it is possible to obtain an order-of-magnitude improvement in the WCET bounds of parallel applications, if data movements are adequately orchestrated in accordance with PREM. We identify which system parameters mostly affect the tremendous performance opportunities offered by this approach, both on average and in the worst case, moving the first step towards predictable many-core systems
Acceleration of Data Compression with Parallel Architectures
Tato bakalářská práce se zabĂ˝vá vyuĹľitĂm paralelnĂch architektur, zejmĂ©na GPU, pro akceleraci vybranĂ˝ch bezeztrátovĂ˝ch komprimaÄŤnĂch algoritmĹŻ, zaloĹľenĂ˝ch na statistickĂ© metodÄ›, a transformacĂ, mÄ›nĂcĂ entropii vstupnĂch dat pro dosaĹľenĂ lepšĂho kompresnĂho pomÄ›ru. V tĂ©to bakalářskĂ© práci jsou takĂ© teoreticky shrnuty obecnĂ© informace o paralelnĂch architekturách a moĹľnostĂ programovánĂ na nich, hlavnÄ› pomocĂ technologiĂ NVIDIA CUDA a OpenCL.This bachelor thesis deals with the use of parallel architectures, in particular the GPU, for acceleration of selected lossless compression algorithms, based on a statistical method, and transformations, which change the entropy of the input data to achieve better compression ratio. In this work there are in theory summarized general information about parallel architectures and programming options for them, mainly using NVIDIA CUDA and OpenCL.
Multibranch Autocorrelation Method for Doppler Estimation in Underwater Acoustic Channels
In underwater acoustic (UWA) communications, Doppler estimation is one of the major stages in a receiver. Two Doppler estimation methods are often used: the cross-ambiguity function (CAF) method and the single-branch autocorrelation (SBA) method. The former results in accurate estimation but with a high complexity, whereas the latter is less complicated but also less accurate. In this paper, we propose and investigate a multibranch autocorrelation (MBA) Doppler estimation method. The proposed method can be used in communication systems with periodically transmitted pilot signals or repetitive data transmission. For comparison of the Doppler estimation methods, we investigate an orthogonal frequency-division multiplexing (OFDM) communication system in multiple dynamic scenarios using the Waymark simulator, allowing virtual UWA signal transmission between moving transmitter and receiver. For the comparison, we also use the OFDM signals recorded in a sea trial. The comparison shows that the receiver with the proposed MBA Doppler estimation method outperforms the receiver with the SBA method and its detection performance is close to that of the receiver with the CAF method, but with a significantly lower complexity
Scalable Hierarchical Instruction Cache for Ultra-Low-Power Processors Clusters
High Performance and Energy Efficiency are critical requirements for Internet
of Things (IoT) end-nodes. Exploiting tightly-coupled clusters of programmable
processors (CMPs) has recently emerged as a suitable solution to address this
challenge. One of the main bottlenecks limiting the performance and energy
efficiency of these systems is the instruction cache architecture due to its
criticality in terms of timing (i.e., maximum operating frequency), bandwidth,
and power. We propose a hierarchical instruction cache tailored to
ultra-low-power tightly-coupled processor clusters where a relatively large
cache (L1.5) is shared by L1 private caches through a two-cycle latency
interconnect. To address the performance loss caused by the L1 capacity misses,
we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to
L1.5. We optimize the core instruction fetch (IF) stage by removing the
critical core-to-L1 combinational path. We present a detailed comparison of
instruction cache architectures' performance and energy efficiency for parallel
ultra-low-power (ULP) clusters. Focusing on the implementation, our two-level
instruction cache provides better scalability than existing shared caches,
delivering up to 20\% higher operating frequency. On average, the proposed
two-level cache improves maximum performance by up to 17\% compared to the
state-of-the-art while delivering similar energy efficiency for most relevant
applications.Comment: 14 page
Scalable Hierarchical Instruction Cache for Ultralow-Power Processors Clusters
High performance and energy efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultralow-power (ULP) tightly coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private (PR) caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures' performance and energy efficiency for parallel ULP (PULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications
Multicore implementation of a fixed-complexity tree-search detector for MIMO communications
[EN] Multicore systems allow the efficient implementation of signal processing algorithms for communication systems due to their high parallel processing capabilities. In this paper, we present a high-throughput multicore implementation of a fixed-complexity tree-search-based detector interesting for MIMO wireless communication systems. Experimental results confirm that this implementation allows to accelerate the data detection stage for different constellation sizes and number of subcarriers.This work was supported by the TEC2009-13741 project of the Spanish Ministry of Science, by the PROMETEO/2009/013 project and ACOMP/2012/076 of the Generalitat Valenciana, and the Vicerrectorado de Investigacion de la UPV through Programa de Apoyo a la Investigacion y desarrollo (PAID-05-11-2898).Ramiro Sánchez, C.; Roger Varea, S.; Gonzalez, A.; Almenar Terré, V.; Vidal Maciá, AM. (2013). Multicore implementation of a fixed-complexity tree-search detector for MIMO communications. The Journal of Supercomputing (Online). 65(3):1010-1019. https://doi.org/10.1007/s11227-012-0839-xS10101019653Paulraj AJ, Gore DA, Nabar RU, Bölcskei H (2004) An overview of MIMO communications—a key to gigabit wireless. Proc IEEE 92(2):198–218Jiang M, Hanzo L (2007) Multiuser MIMO-OFDM for next-generation wireless systems. Proc IEEE 95(7):1430–14693GPP TS 36.201, V10.0.0, Evolved Universal Terrestrial Radio Access (E-UTRA); Physical layer—general description, December 2010Lin Y, Lee H, Woh M, Harel Y, Mahlke S, Mudge T, Chakrabarti C, Flautner K (2007) SODA: a high-performance DSP architecture for software-defined radio. IEEE MICRO 27(1):114–123Yang C-H, Markovic D (2008) A multi-core sphere decoder VLSI architecture for MIMO communications. In: Global telecommunications conference, November, pp 1–6Wu D, Eilert J, Liu D (2011) Implementation of a high-speed MIMO soft-output symbol detector for software defined radio. J Signal Process Syst 63(1):27–37Tan K, Liu H, Zhang J, Zhang Y, Fang J, Voelker GM (2011) Sora: high-performance software radio using general-purpose multi-core processors. Communun ACM 54(1):99–107Roger S, Ramiro C, Gonzalez A, Almenar V, Vidal AM (2012) An efficient GPU implementation of fixed-complexity sphere decoders for MIMO wireless systems. Integr Comput-Aided Eng 19(4):341–350Chen Y-K et al (2009) Signal processing on platforms with multiple cores: Part 1-Overview and methodologies. IEEE Signal Proc Mag 6:24–25Karam LJ, AlKamal I, Gatherer A, Frantz GA, Anderson DV, Evans BL (2009) Trends in multicore DSP platforms. IEEE Signal Process Mag 26(6):38–49Barbero LG, Thompson JS (2008) Fixing the complexity of the sphere decoder for MIMO detection. IEEE Trans Wirel Commun 7(6):2131–2142Hassibi B, Vikalo H (2005) On sphere decoding algorithm. Part I, The expected complexity. IEEE Trans Signal Process 54(5):2806–2818Agrell E, Eriksson T, Vardy A, Zeger K (2002) Closest point search in lattices. IEEE Trans Inf Theory 48(8):2201–2214OpenMP v3.0, http://www.openmp.org/mp-documents/spec30.pdf , May 200
Algorithm/Architecture Co-Exploration of Visual Computing: Overview and Future Perspectives
Concurrently exploring both algorithmic and architectural optimizations is a new design paradigm. This survey paper addresses the latest research and future perspectives on the simultaneous development of video coding, processing, and computing algorithms with emerging platforms that have multiple cores and reconfigurable architecture. As the algorithms in forthcoming visual systems become increasingly complex, many applications must have different profiles with different levels of performance. Hence, with expectations that the visual experience in the future will become continuously better, it is critical that advanced platforms provide higher performance, better flexibility, and lower power consumption. To achieve these goals, algorithm and architecture co-design is significant for characterizing the algorithmic complexity used to optimize targeted architecture. This paper shows that seamless weaving of the development of previously autonomous visual computing algorithms and multicore or reconfigurable architectures will unavoidably become the leading trend in the future of video technology
On the Hard-Real-Time Scheduling of Embedded Streaming Applications
Computer Systems, Imagery and Medi