8 research outputs found
How Fast Can We Play Tetris Greedily With Rectangular Pieces?
Consider a variant of Tetris played on a board of width and infinite
height, where the pieces are axis-aligned rectangles of arbitrary integer
dimensions, the pieces can only be moved before letting them drop, and a row
does not disappear once it is full. Suppose we want to follow a greedy
strategy: let each rectangle fall where it will end up the lowest given the
current state of the board. To do so, we want a data structure which can always
suggest a greedy move. In other words, we want a data structure which maintains
a set of rectangles, supports queries which return where to drop the
rectangle, and updates which insert a rectangle dropped at a certain position
and return the height of the highest point in the updated set of rectangles. We
show via a reduction to the Multiphase problem [P\u{a}tra\c{s}cu, 2010] that on
a board of width , if the OMv conjecture [Henzinger et al., 2015]
is true, then both operations cannot be supported in time
simultaneously. The reduction also implies polynomial bounds from the 3-SUM
conjecture and the APSP conjecture. On the other hand, we show that there is a
data structure supporting both operations in time on
boards of width , matching the lower bound up to a factor.Comment: Correction of typos and other minor correction
Improving Model-Based Software Synthesis: A Focus on Mathematical Structures
Computer hardware keeps increasing in complexity. Software design needs to keep up with this. The right models and abstractions empower developers to leverage the novelties of modern hardware. This thesis deals primarily with Models of Computation, as a basis for software design, in a family of methods called software synthesis.
We focus on Kahn Process Networks and dataflow applications as abstractions, both for programming and for deriving an efficient execution on heterogeneous multicores. The latter we accomplish by exploring the design space of possible mappings of computation and data to hardware resources. Mapping algorithms are not at the center of this thesis, however. Instead, we examine the mathematical structure of the mapping
space, leveraging its inherent symmetries or geometric properties to improve mapping methods in general.
This thesis thoroughly explores the process of model-based design, aiming to go beyond the more established software synthesis on dataflow applications. We starting with the problem of assessing these methods through benchmarking, and go on to formally examine the general goals of benchmarks. In this context, we also consider the role modern machine learning methods play in benchmarking.
We explore different established semantics, stretching the limits of Kahn Process Networks. We also discuss novel models, like Reactors, which are designed to be a deterministic, adaptive model with time as a first-class citizen. By investigating abstractions and transformations in the Ohua language for implicit dataflow programming, we also focus on programmability.
The focus of the thesis is in the models and methods, but we evaluate them in diverse use-cases, generally centered around Cyber-Physical Systems. These include the 5G telecommunication standard, automotive and signal processing domains. We even go beyond embedded systems and discuss use-cases in GPU programming and microservice-based architectures
Compilation techniques for automatic extraction of parallelism and locality in heterogeneous architectures
[Abstract]
High performance computing has become a key enabler for innovation in science
and industry. This fact has unleashed a continuous demand of more computing
power that the silicon industry has satisfied with parallel and heterogeneous
architectures, and complex memory hierarchies. As a consequence, software
developers have been challenged to write new codes and rewrite the old
ones to be efficient in these new systems. Unfortunately, success cases are scarce
and require huge investments in human workforce. Current compilers generate
peak-peformance binary code in monocore architectures. Following this victory,
this thesis explores new ideas in compiler design to overcome this challenge with
the automatic extraction of parallelism and locality. First, we present a new compiler
intermediate representation based on diKernels named KIR, which is insensitive
to syntactic variations in the source code and exposes multiple levels of
parallelism. On top of the KIR, we build a source-to-source approach that generates
parallel code annotated with compiler directives: OpenMP for multicores
and OpenHMPP for GPUs. Finally, we model program behavior from the point
of view of the memory accesses through the reconstruction of affine loops for sequential
and parallel codes. The experimental evaluations throughout the thesis
corroborate the effectiveness and efficiency of the proposed solutions.[Resumen]La computación de altas prestaciones se ha convertido en un habilitador clave
para la innovación en la ciencia y la industria. Este hecho ha propiciado una
demanda continua de más poder computacional que la industria del silicio ha
satisfecho con arquitecturas paralelas y heterogéneas, y jerarquías de memoria
complejas. Como consecuencia, los desarrolladores de software han sido desafiados
a escribir códigos nuevos y reescribir los antiguos para que sean eficientes
en estos nuevos sistemas. Desafortunadamente, los casos de éxito son escasos y
requieren inversiones enormes en fuerza de trabajo. Los compiladores actuales
generan código binario con rendimiento máximo en las arquitecturas mononúcleo.
Siguiendo esta victoria, esta tesis explora nuevas ideas en el diseño de compiladores
para superar este reto con la extracción automática de paralelismo y
localidad. En primer lugar, presentamos una nueva representación intermedia de
compilador basada en diKernels denominada KIR, la cual es insensible a variaciones
sintácticas en el código de fuente y expone múltiples niveles de paralelismo.
Sobre la KIR, construimos una aproximación fuente-a-fuente que genera código
paralelo anotado con directivas: OpenMP para multinúcleos y OpenHMPP para
GPUs. Finalmente, modelamos el comportamiento del programa desde el punto
de vista de los accesos de memoria a través de la reconstrucción de bucles afines
para códigos secuenciales y paralelos. Las evaluaciones experimentales a lo largo
de la tesis corroboran la efectividad y eficacia de las soluciones propuestas.[Resumo]A computación de altas prestacións converteuse nun habilitador clave para a innovación
na ciencia e na industria. Este feito propiciou unha demanda continua
de máis poder computacional que a industria do silicio satisfixo con arquitecturas
paralelas e heteroxéneas, e xerarquías de memoria complexas. Como consecuencia,
os desenvolvedores de software foron desafiados a escribir códigos novos e
reescribir os antigos para que sexan eficientes nestes novos sistemas. Desafortunadamente,
os casos de éxito son escasos e requiren investimentos enormes en
forza de traballo. Os compiladores actuais xeran código binario con rendemento
máximo nas arquitecturas mononúcleo. Seguindo esta vitoria, esta tese explora
novas ideas no deseño de compiladores para superar este reto coa extracción automática
de paralelismo e localidade. En primeiro lugar, presentamos unha nova
representación intermedia de compilador baseada en diKernels denominada KIR,
a cal é insensible a variacións sintácticas no código fonte e expón múltiples niveis
de paralelismo. Sobre a KIR, construímos unha aproximación fonte-a-fonte
que xera código paralelo anotado con directivas: OpenMP para multinúcleos e
OpenHMPP para GPUs. Finalmente, modelamos o comportamento do programa
desde o punto de vista dos accesos de memoria a través da reconstrución de bucles
afíns para códigos secuenciais e paralelos. As avaliacións experimentais ao
longo da tese corroboran a efectividade e eficacia das solucións propostas
Design Space Exploration and Resource Management of Multi/Many-Core Systems
The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends
Design and Code Optimization for Systems with Next-generation Racetrack Memories
With the rise of computationally expensive application domains such as machine learning, genomics, and fluids simulation, the quest for performance and energy-efficient computing has gained unprecedented momentum. The significant increase in computing and memory devices in modern systems has resulted in an unsustainable surge in energy consumption, a substantial portion of which is attributed to the memory system. The scaling of conventional memory technologies and their suitability for the next-generation system is also questionable. This has led to the emergence and rise of nonvolatile memory ( NVM ) technologies. Today, in different development stages, several NVM technologies are competing for their rapid access to the market.
Racetrack memory ( RTM ) is one such nonvolatile memory technology that promises SRAM -comparable latency, reduced energy consumption, and unprecedented density compared to other technologies. However, racetrack memory ( RTM ) is sequential in nature, i.e., data in an RTM cell needs to be shifted to an access port before it can be accessed. These shift operations incur performance and energy penalties. An ideal RTM , requiring at most one shift per access, can easily outperform SRAM . However, in the worst-cast shifting scenario, RTM can be an order of magnitude slower than SRAM .
This thesis presents an overview of the RTM device physics, its evolution, strengths and challenges, and its application in the memory subsystem. We develop tools that allow the programmability and modeling of RTM -based systems. For shifts minimization, we propose a set of techniques including optimal, near-optimal, and evolutionary algorithms for efficient scalar and instruction placement in RTMs . For array accesses, we explore schedule and layout transformations that eliminate the longer overhead shifts in RTMs . We present an automatic compilation framework that analyzes static control flow programs and transforms the loop traversal order and memory layout to maximize accesses to consecutive RTM locations and minimize shifts. We develop a simulation framework called RTSim that models various RTM parameters and enables accurate architectural level simulation.
Finally, to demonstrate the RTM potential in non-Von-Neumann in-memory computing paradigms, we exploit its device attributes to implement logic and arithmetic operations. As a concrete use-case, we implement an entire hyperdimensional computing framework in RTM to accelerate the language recognition problem. Our evaluation shows considerable performance and energy improvements compared to conventional Von-Neumann models and state-of-the-art accelerators
Generalized strictly periodic scheduling analysis, resource optimization, and implementation of adaptive streaming applications
This thesis focuses on addressing four research problems in designing embedded streaming systems. Embedded streaming systems are those systems thatprocess a stream of input data coming from the environment and generate a stream of output data going into the environment. For many embeddedstreaming systems, the timing is a critical design requirement, in which the correct behavior depends on both the correctness of output data and on the time at which the data is produced. An embedded streaming system subjected to such a timing requirement is called a real-time system. Some examples of real-time embedded streaming systems can be found in various autonomous mobile systems, such as planes, self-driving cars, and drones. To handle the tight timing requirements of such real-time embedded streaming systems, modern embedded systems have been equipped with hardware platforms, the so-called Multi-Processor Systems-on-Chip (MPSoC), that contain multiple processors, memories, interconnections, and other hardware peripherals on a single chip, to benefit from parallel execution. To efficiently exploit the computational capacity of an MPSoC platform, a streaming application which is going to be executed on the MPSoC platform must be expressed primarily in a parallel fashion, i.e., the application is represented as a set of parallel executing and communicating tasks. Then, the main challenge is how to schedule the tasks spatially, i.e., task mapping, and temporally, i.e., task scheduling, on the MPSoC platform such that all timing requirements are satisfied while making efficient utilization of available resources (e.g, processors, memory, energy, etc.) on the platform. Another challenge is how to implement and run the mapped and scheduled application tasks on the MPSoC platform. This thesis proposes several techniques to address the aforementioned two challenges.NWOComputer Systems, Imagery and Medi