Search CORE

17 research outputs found

A memory-centric approach to enable timing-predictability within embedded many-core accelerators

Author: Bertogna Marko
Burgio Paolo
Marongiu Andrea
Valente Paolo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

There is an increasing interest among real-time systems architects for multi- and many-core accelerated platforms. The main obstacle towards the adoption of such devices within industrial settings is related to the difficulties in tightly estimating the multiple interferences that may arise among the parallel components of the system. This in particular concerns concurrent accesses to shared memory and communication resources. Existing worst-case execution time analyses are extremely pessimistic, especially when adopted for systems composed of hundreds-tothousands of cores. This significantly limits the potential for the adoption of these platforms in real-time systems. In this paper, we study how the predictable execution model (PREM), a memory-aware approach to enable timing-predictability in realtime systems, can be successfully adopted on multi- and manycore heterogeneous platforms. Using a state-of-the-art multi-core platform as a testbed, we validate that it is possible to obtain an order-of-magnitude improvement in the WCET bounds of parallel applications, if data movements are adequately orchestrated in accordance with PREM. We identify which system parameters mostly affect the tremendous performance opportunities offered by this approach, both on average and in the worst case, moving the first step towards predictable many-core systems

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Acceleration of Data Compression with Parallel Architectures

Author: Juránek Luboš
Publication venue: Vysoké učení technické v Brně. Fakulta informačních technologií
Publication date: 01/01/2014
Field of study

Tato bakalářská práce se zabývá využitím paralelních architektur, zejména GPU, pro akceleraci vybraných bezeztrátových komprimačních algoritmů, založených na statistické metodě, a transformací, měnící entropii vstupních dat pro dosažení lepšího kompresního poměru. V této bakalářské práci jsou také teoreticky shrnuty obecné informace o paralelních architekturách a možností programování na nich, hlavně pomocí technologií NVIDIA CUDA a OpenCL.This bachelor thesis deals with the use of parallel architectures, in particular the GPU, for acceleration of selected lossless compression algorithms, based on a statistical method, and transformations, which change the entropy of the input data to achieve better compression ratio. In this work there are in theory summarized general information about parallel architectures and programming options for them, mainly using NVIDIA CUDA and OpenCL.

Digital library of Brno University of Technology

National Repository of Grey Literature

Low-complexity UAC modem and data packet structure

Author: Henson Benjamin Thomas
Mitchell Paul Daniel
Morozs Nils
Shen Lu
Tozer Timothy Conrad
Yuan Fei
Zakharov Yuriy
Publication venue
Publication date
Field of study

White Rose Research Online

Multibranch Autocorrelation Method for Doppler Estimation in Underwater Acoustic Channels

Author: Henson Benjamin Thomas
Li Jianghui
Zakharov Yuriy
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/10/2018
Field of study

In underwater acoustic (UWA) communications, Doppler estimation is one of the major stages in a receiver. Two Doppler estimation methods are often used: the cross-ambiguity function (CAF) method and the single-branch autocorrelation (SBA) method. The former results in accurate estimation but with a high complexity, whereas the latter is less complicated but also less accurate. In this paper, we propose and investigate a multibranch autocorrelation (MBA) Doppler estimation method. The proposed method can be used in communication systems with periodically transmitted pilot signals or repetitive data transmission. For comparison of the Doppler estimation methods, we investigate an orthogonal frequency-division multiplexing (OFDM) communication system in multiple dynamic scenarios using the Waymark simulator, allowing virtual UWA signal transmission between moving transmitter and receiver. For the comparison, we also use the OFDM signals recorded in a sea trial. The comparison shows that the receiver with the proposed MBA Doppler estimation method outperforms the receiver with the SBA method and its detection performance is close to that of the receiver with the CAF method, but with a significantly lower complexity

Southampton (e-Prints Soton)

Crossref

White Rose Research Online

Scalable Hierarchical Instruction Cache for Ultra-Low-Power Processors Clusters

Author: Benini Luca
Chen Jie
Flamand Eric
Loi Igor
Rossi Davide
Tagliavini Giuseppe
Publication venue
Publication date: 03/09/2023
Field of study

High Performance and Energy Efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly-coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultra-low-power tightly-coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures' performance and energy efficiency for parallel ultra-low-power (ULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20\% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17\% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications.Comment: 14 page

arXiv.org e-Print Archive

Scalable Hierarchical Instruction Cache for Ultralow-Power Processors Clusters

Author: Benini L.
Chen J.
Flamand E.
Loi I.
Rossi D.
Tagliavini G.
Publication venue
Publication date: 01/01/2023
Field of study

High performance and energy efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultralow-power (ULP) tightly coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private (PR) caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures' performance and energy efficiency for parallel ULP (PULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Multicore implementation of a fixed-complexity tree-search detector for MIMO communications

Author: AJ Paulraj
Alberto Gonzalez
Antonio M. Vidal
B Hassibi
C-H Yang
Carla Ramiro
D Wu
E Agrell
K Tan
LG Barbero
LJ Karam
M Jiang
S Roger
Sandra Roger
Vicenc Almenar
Y Lin
Y-K Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

[EN] Multicore systems allow the efficient implementation of signal processing algorithms for communication systems due to their high parallel processing capabilities. In this paper, we present a high-throughput multicore implementation of a fixed-complexity tree-search-based detector interesting for MIMO wireless communication systems. Experimental results confirm that this implementation allows to accelerate the data detection stage for different constellation sizes and number of subcarriers.This work was supported by the TEC2009-13741 project of the Spanish Ministry of Science, by the PROMETEO/2009/013 project and ACOMP/2012/076 of the Generalitat Valenciana, and the Vicerrectorado de Investigacion de la UPV through Programa de Apoyo a la Investigacion y desarrollo (PAID-05-11-2898).Ramiro Sánchez, C.; Roger Varea, S.; Gonzalez, A.; Almenar Terré, V.; Vidal Maciá, AM. (2013). Multicore implementation of a fixed-complexity tree-search detector for MIMO communications. The Journal of Supercomputing (Online). 65(3):1010-1019. https://doi.org/10.1007/s11227-012-0839-xS10101019653Paulraj AJ, Gore DA, Nabar RU, Bölcskei H (2004) An overview of MIMO communications—a key to gigabit wireless. Proc IEEE 92(2):198–218Jiang M, Hanzo L (2007) Multiuser MIMO-OFDM for next-generation wireless systems. Proc IEEE 95(7):1430–14693GPP TS 36.201, V10.0.0, Evolved Universal Terrestrial Radio Access (E-UTRA); Physical layer—general description, December 2010Lin Y, Lee H, Woh M, Harel Y, Mahlke S, Mudge T, Chakrabarti C, Flautner K (2007) SODA: a high-performance DSP architecture for software-defined radio. IEEE MICRO 27(1):114–123Yang C-H, Markovic D (2008) A multi-core sphere decoder VLSI architecture for MIMO communications. In: Global telecommunications conference, November, pp 1–6Wu D, Eilert J, Liu D (2011) Implementation of a high-speed MIMO soft-output symbol detector for software defined radio. J Signal Process Syst 63(1):27–37Tan K, Liu H, Zhang J, Zhang Y, Fang J, Voelker GM (2011) Sora: high-performance software radio using general-purpose multi-core processors. Communun ACM 54(1):99–107Roger S, Ramiro C, Gonzalez A, Almenar V, Vidal AM (2012) An efficient GPU implementation of fixed-complexity sphere decoders for MIMO wireless systems. Integr Comput-Aided Eng 19(4):341–350Chen Y-K et al (2009) Signal processing on platforms with multiple cores: Part 1-Overview and methodologies. IEEE Signal Proc Mag 6:24–25Karam LJ, AlKamal I, Gatherer A, Frantz GA, Anderson DV, Evans BL (2009) Trends in multicore DSP platforms. IEEE Signal Process Mag 26(6):38–49Barbero LG, Thompson JS (2008) Fixing the complexity of the sphere decoder for MIMO detection. IEEE Trans Wirel Commun 7(6):2131–2142Hassibi B, Vikalo H (2005) On sphere decoding algorithm. Part I, The expected complexity. IEEE Trans Signal Process 54(5):2806–2818Agrell E, Eriksson T, Vardy A, Zeger K (2002) Closest point search in lattices. IEEE Trans Inf Theory 48(8):2201–2214OpenMP v3.0, http://www.openmp.org/mp-documents/spec30.pdf , May 200

Crossref

RiuNet

Data packet structure and modem design for dynamic underwater acoustic channels

Author: Diamant Roee
Henson Benjamin Thomas
Mitchell Paul Daniel
Morozs Nils
Shen Lu
Tozer Timothy Conrad
Yuan Fei
Zakharov Yuriy
Publication venue
Publication date
Field of study

White Rose Research Online

Algorithm/Architecture Co-Exploration of Visual Computing: Overview and Future Perspectives

Author: Chen Yen-Kuang
Lee Gwo Giun (Chris)
Mattavelli Marco
S. Jang Euee
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/01/2010
Field of study

Concurrently exploring both algorithmic and architectural optimizations is a new design paradigm. This survey paper addresses the latest research and future perspectives on the simultaneous development of video coding, processing, and computing algorithms with emerging platforms that have multiple cores and reconfigurable architecture. As the algorithms in forthcoming visual systems become increasingly complex, many applications must have different profiles with different levels of performance. Hence, with expectations that the visual experience in the future will become continuously better, it is critical that advanced platforms provide higher performance, better flexibility, and lower power consumption. To achieve these goals, algorithm and architecture co-design is significant for characterizing the algorithmic complexity used to optimize targeted architecture. This paper shows that seamless weaving of the development of previously autonomous visual computing algorithms and multicore or reconfigurable architectures will unavoidably become the leading trend in the future of video technology

Infoscience - École polytechnique fédérale de Lausanne

On the Hard-Real-Time Scheduling of Embedded Streaming Applications

Author: Bamakhrama M.A.M.
Stefanov T.P.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Computer Systems, Imagery and Medi

Leiden University Scholary Publications