975 research outputs found

    LIPIcs, Volume 251, ITCS 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 251, ITCS 2023, Complete Volum

    Guided rewriting and constraint satisfaction for parallel GPU code generation

    Get PDF
    Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise. This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only. Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings. The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation. A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation

    The Sparse Abstract Machine

    Full text link
    We propose the Sparse Abstract Machine (SAM), an abstract machine model for targeting sparse tensor algebra to reconfigurable and fixed-function spatial dataflow accelerators. SAM defines a streaming dataflow abstraction with sparse primitives that encompass a large space of scheduled tensor algebra expressions. SAM dataflow graphs naturally separate tensor formats from algorithms and are expressive enough to incorporate arbitrary iteration orderings and many hardware-specific optimizations. We also present Custard, a compiler from a high-level language to SAM that demonstrates SAM's usefulness as an intermediate representation. We automatically bind from SAM to a streaming dataflow simulator. We evaluate the generality and extensibility of SAM, explore the performance space of sparse tensor algebra optimizations using SAM, and show SAM's ability to represent dataflow hardware.Comment: 18 pages, 17 figures, 3 table

    A Wait-free Queue with Poly-logarithmic Worst-case Step Complexity

    Get PDF
    In this work, we introduce a novel linearizable wait-free queue implementation. Linearizability and lock-freedom are standard requirements for designing shared data structures. To the best of our knowledge, all of the existing linearizable lock-free queues in the literature have a common problem in their worst case, called the CAS Retry Problem. We show that our algorithm avoids this problem with the helping mechanism which we use and has a worst-case running time better than prior lock-free queues. The amortized number of steps for an Enqueue or Dequeue in our algorithm is O(log^2 p + log q), where p is the number of processes and q is the size of the queue when the operation is linearized

    LIPIcs, Volume 261, ICALP 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 261, ICALP 2023, Complete Volum

    Modern data analytics in the cloud era

    Get PDF
    Cloud Computing ist die dominante Technologie des letzten Jahrzehnts. Die Benutzerfreundlichkeit der verwalteten Umgebung in Kombination mit einer nahezu unbegrenzten Menge an Ressourcen und einem nutzungsabhängigen Preismodell ermöglicht eine schnelle und kosteneffiziente Projektrealisierung für ein breites Nutzerspektrum. Cloud Computing verändert auch die Art und Weise wie Software entwickelt, bereitgestellt und genutzt wird. Diese Arbeit konzentriert sich auf Datenbanksysteme, die in der Cloud-Umgebung eingesetzt werden. Wir identifizieren drei Hauptinteraktionspunkte der Datenbank-Engine mit der Umgebung, die veränderte Anforderungen im Vergleich zu traditionellen On-Premise-Data-Warehouse-Lösungen aufweisen. Der erste Interaktionspunkt ist die Interaktion mit elastischen Ressourcen. Systeme in der Cloud sollten Elastizität unterstützen, um den Lastanforderungen zu entsprechen und dabei kosteneffizient zu sein. Wir stellen einen elastischen Skalierungsmechanismus für verteilte Datenbank-Engines vor, kombiniert mit einem Partitionsmanager, der einen Lastausgleich bietet und gleichzeitig die Neuzuweisung von Partitionen im Falle einer elastischen Skalierung minimiert. Darüber hinaus führen wir eine Strategie zum initialen Befüllen von Puffern ein, die es ermöglicht, skalierte Ressourcen unmittelbar nach der Skalierung auszunutzen. Cloudbasierte Systeme sind von fast überall aus zugänglich und verfügbar. Daten werden häufig von zahlreichen Endpunkten aus eingespeist, was sich von ETL-Pipelines in einer herkömmlichen Data-Warehouse-Lösung unterscheidet. Viele Benutzer verzichten auf die Definition von strikten Schemaanforderungen, um Transaktionsabbrüche aufgrund von Konflikten zu vermeiden oder um den Ladeprozess von Daten zu beschleunigen. Wir führen das Konzept der PatchIndexe ein, die die Definition von unscharfen Constraints ermöglichen. PatchIndexe verwalten Ausnahmen zu diesen Constraints, machen sie für die Optimierung und Ausführung von Anfragen nutzbar und bieten effiziente Unterstützung bei Datenaktualisierungen. Das Konzept kann auf beliebige Constraints angewendet werden und wir geben Beispiele für unscharfe Eindeutigkeits- und Sortierconstraints. Darüber hinaus zeigen wir, wie PatchIndexe genutzt werden können, um fortgeschrittene Constraints wie eine unscharfe Multi-Key-Partitionierung zu definieren, die eine robuste Anfrageperformance bei Workloads mit unterschiedlichen Partitionsanforderungen bietet. Der dritte Interaktionspunkt ist die Nutzerinteraktion. Datengetriebene Anwendungen haben sich in den letzten Jahren verändert. Neben den traditionellen SQL-Anfragen für Business Intelligence sind heute auch datenwissenschaftliche Anwendungen von großer Bedeutung. In diesen Fällen fungiert das Datenbanksystem oft nur als Datenlieferant, während der Rechenaufwand in dedizierten Data-Science- oder Machine-Learning-Umgebungen stattfindet. Wir verfolgen das Ziel, fortgeschrittene Analysen in Richtung der Datenbank-Engine zu verlagern und stellen das Grizzly-Framework als DataFrame-zu-SQL-Transpiler vor. Auf dieser Grundlage identifizieren wir benutzerdefinierte Funktionen (UDFs) und maschinelles Lernen (ML) als wichtige Aufgaben, die von einer tieferen Integration in die Datenbank-Engine profitieren würden. Daher untersuchen und bewerten wir Ansätze für die datenbankinterne Ausführung von Python-UDFs und datenbankinterne ML-Inferenz.Cloud computing has been the groundbreaking technology of the last decade. The ease-of-use of the managed environment in combination with nearly infinite amount of resources and a pay-per-use price model enables fast and cost-efficient project realization for a broad range of users. Cloud computing also changes the way software is designed, deployed and used. This thesis focuses on database systems deployed in the cloud environment. We identify three major interaction points of the database engine with the environment that show changed requirements compared to traditional on-premise data warehouse solutions. First, software is deployed on elastic resources. Consequently, systems should support elasticity in order to match workload requirements and be cost-effective. We present an elastic scaling mechanism for distributed database engines, combined with a partition manager that provides load balancing while minimizing partition reassignments in the case of elastic scaling. Furthermore we introduce a buffer pre-heating strategy that allows to mitigate a cold start after scaling and leads to an immediate performance benefit using scaling. Second, cloud based systems are accessible and available from nearly everywhere. Consequently, data is frequently ingested from numerous endpoints, which differs from bulk loads or ETL pipelines in a traditional data warehouse solution. Many users do not define database constraints in order to avoid transaction aborts due to conflicts or to speed up data ingestion. To mitigate this issue we introduce the concept of PatchIndexes, which allow the definition of approximate constraints. PatchIndexes maintain exceptions to constraints, make them usable in query optimization and execution and offer efficient update support. The concept can be applied to arbitrary constraints and we provide examples of approximate uniqueness and approximate sorting constraints. Moreover, we show how PatchIndexes can be exploited to define advanced constraints like an approximate multi-key partitioning, which offers robust query performance over workloads with different partition key requirements. Third, data-centric workloads changed over the last decade. Besides traditional SQL workloads for business intelligence, data science workloads are of significant importance nowadays. For these cases the database system might only act as data delivery, while the computational effort takes place in data science or machine learning (ML) environments. As this workflow has several drawbacks, we follow the goal of pushing advanced analytics towards the database engine and introduce the Grizzly framework as a DataFrame-to-SQL transpiler. Based on this we identify user-defined functions (UDFs) and machine learning inference as important tasks that would benefit from a deeper engine integration and investigate approaches to push these operations towards the database engine

    Využití softwarově definovaného rádia v oblasti SMART technologii

    Get PDF
    Modern telecommunication systems are rapidly evolving. This rapid development requires constant research and fast prototyping. This dissertation thesis focusses on deployment of software defined radio (SDR) in multiple application areas, including SMART technologies. SDR itself is a tool behind many breakthroughs in modern telecommunications, due to its major adaptability. It offers a comprehensive way of fast prototyping, which rely on suitable software platform. The field of telecommunications is ever-changing, due to the constant pressure on innovation. For this reason, it is desirable to test some of the alternative communication technologies. Visible light communication (VLC) system based on combination of virtual instrumentation and software defined radios was chosen for experimentation. This dissertation focusses on multiple versions of VLC system that were developed over the years. Each version is further discussed, and their advantages and disadvantages are presented. A draft of fourth and newest version is mentioned along with possible directions of the research. Results from multiple application areas are presented, which show the adaptability of the whole platform to different use cases including but not limited to: SMART technologies, automotive, nuclear waste disposal sites, or industry. It is demonstrated that the newest version of the system, which is based on OFDM modulation, can communicate up to 50 meters in closed environments and up to 35 meters in outdoor scenarios. This opens further research directions such as truck platooning or underwater communications.Moderní komunikační systémy jsou jednou z nejrychleji se rozvíjejících oblastí. Takového markantního posunu lze dosáhnout pouze skrze nový vývoj a aplikaci metodiky fast prototypingu. Tato disertace se zaměřuje na nasazení technologie softwarově definovaného rádia (SDR) v různých aplikačních oblastech. Samotné SDR je díky své adaptabilitě nástrojem, který stál na pozadí rozvoje mnoha moderních telekomunikačních systémů. Jedná se o ucelenou platformu pro fast prototyping, která se opírá o robustní softwarovou základnu. Právě telekomunikace jsou oblastí, kde je takové zařízení nedocenitelné, právě kvůli neustálému tlaku na inovace. Právě to je důvodem, proč je vhodné také testovat různé alternativní technologie pro přenos dat. Jednou z takových je komunikace viditelným spektrem světla (VLC), která je náplní této práce. Součástí praktické části je vývoj a popis několika verzí VLC systému založených na virtuální instrumentaci a SDR, které vznikly během autorova studia. Každá verze je samostatně popsána včetně výhod a nevýhod, které poskytují. Součástí je též první náčrt čtvrté verze, která bude součástí budoucího výzkumu. Prezentované výsledky z různých aplikačních oblastí jasně ukazují, že je celou platformu možné použít v různých aplikačních oblastech, včetně SMART technologií, automotive, úložišti jaderného odpadu anebo Průmyslu 4.0. Součástí jsou též výsledky z poslední verze, které dokazují, že je systém ve vnitřních prostorech komunikovat až na vzdálenost 50 metrů, zatímco ve venkovních podmínkách je to 35 metrů. Díky tomu je možné vytyčit nové oblasti výzkumu jako je například platooning (tandemová jízda) anebo podvodní komunikace.450 - Katedra kybernetiky a biomedicínského inženýrstvívyhově

    Modelling, Dimensioning and Optimization of 5G Communication Networks, Resources and Services

    Get PDF
    This reprint aims to collect state-of-the-art research contributions that address challenges in the emerging 5G networks design, dimensioning and optimization. Designing, dimensioning and optimization of communication networks resources and services have been an inseparable part of telecom network development. The latter must convey a large volume of traffic, providing service to traffic streams with highly differentiated requirements in terms of bit-rate and service time, required quality of service and quality of experience parameters. Such a communication infrastructure presents many important challenges, such as the study of necessary multi-layer cooperation, new protocols, performance evaluation of different network parts, low layer network design, network management and security issues, and new technologies in general, which will be discussed in this book

    Low Power Memory/Memristor Devices and Systems

    Get PDF
    This reprint focusses on achieving low-power computation using memristive devices. The topic was designed as a convenient reference point: it contains a mix of techniques starting from the fundamental manufacturing of memristive devices all the way to applications such as physically unclonable functions, and also covers perspectives on, e.g., in-memory computing, which is inextricably linked with emerging memory devices such as memristors. Finally, the reprint contains a few articles representing how other communities (from typical CMOS design to photonics) are fighting on their own fronts in the quest towards low-power computation, as a comparison with the memristor literature. We hope that readers will enjoy discovering the articles within

    Improving Data Locality in Applications through Execution Delegation

    Get PDF
    With the slowing or even death of Moore’s Law, computer system architectures are trending toward more CPU cores. This trend has driven systems researchers to explore novel ways of utilizing this computational power for improved efficiency and performance. One such approach is to use this power to help alleviate the memory wall problem through execution delegation. The memory wall problem describes the issue whereby system performance hits a wall that is dictated by the latency of accessing main memory. Using execution delegation, the execution of the application on one core is delegated to another core. The desired result is that the cores of the system are specialized to access mostly disjoint sets of data. In this way, data locality and, therefore, performance are improved. The aim of this work is to develop tools and methods for predicting situations in which execution delegation via user thread migration is useful for improving an application’s data locality. To this end, a microbenchmarking tool named Accesstest is used to perform a systematic study of execution delegation via user thread migration. Further, an approach, which makes use of a working set characterization tool named Accessprof, is developed to predict the qualitative impact of delegating an execution sequence. This prediction approach is verified and used to improve the Apache HTTP server’s performance by as much as 11%
    corecore