250 research outputs found
On the fly type specialization without type analysis
Les langages de programmation typĂ©s dynamiquement tels que JavaScript et Python repoussent la vĂ©rification de typage jusquâau moment de lâexĂ©cution. Afin dâoptimiser la performance de ces langages, les implĂ©mentations de machines virtuelles pour langages dynamiques doivent tenter dâĂ©liminer les tests de typage dynamiques redondants. Cela se fait habituellement en utilisant une analyse dâinfĂ©rence de types. Cependant, les analyses de ce genre sont souvent coĂ»teuses et impliquent des compromis entre le temps de compilation et la prĂ©cision des rĂ©sultats obtenus. Ceci a conduit Ă la conception dâarchitectures de VM de plus en plus complexes.
Nous proposons le versionnement paresseux de blocs de base, une technique de compilation Ă la volĂ©e simple qui Ă©limine efficacement les tests de typage dynamiques redondants sur les chemins dâexĂ©cution critiques. Cette nouvelle approche gĂ©nĂšre paresseusement des versions spĂ©cialisĂ©es des blocs de base tout en propageant de lâinformation de typage contextualisĂ©e. Notre technique ne nĂ©cessite pas lâutilisation dâanalyses de programme coĂ»teuses, nâest pas contrainte par les limitations de prĂ©cision des analyses dâinfĂ©rence de types traditionnelles et Ă©vite la complexitĂ© des techniques dâoptimisation spĂ©culatives.
Trois extensions sont apportĂ©es au versionnement de blocs de base afin de lui donner des capacitĂ©s dâoptimisation interprocĂ©durale. Une premiĂšre extension lui donne la possibilitĂ© de joindre des informations de typage aux propriĂ©tĂ©s des objets et aux variables globales. Puis, la spĂ©cialisation de points dâentrĂ©e lui permet de passer de lâinformation de typage des fonctions appellantes aux fonctions appellĂ©es. Finalement, la spĂ©cialisation des continuations dâappels permet de transmettre le type des valeurs de retour des fonctions appellĂ©es aux appellants sans coĂ»t dynamique. Nous dĂ©montrons empiriquement que ces extensions permettent au versionnement de blocs de base dâĂ©liminer plus de tests de typage dynamiques que toute analyse dâinfĂ©rence de typage statique.Dynamically typed programming languages such as JavaScript and Python defer type checking to run time. In order to maximize performance, dynamic language virtual
machine implementations must attempt to eliminate redundant dynamic type checks. This is typically done using type inference analysis. However, type inference analyses
are often costly and involve tradeoffs between compilation time and resulting precision. This has lead to the creation of increasingly complex multi-tiered VM architectures.
We introduce lazy basic block versioning, a simple just-in-time compilation technique which effectively removes redundant type checks from critical code paths. This
novel approach lazily generates type-specialized versions of basic blocks on the fly while propagating context-dependent type information. This does not require the use of costly
program analyses, is not restricted by the precision limitations of traditional type analyses and avoids the implementation complexity of speculative optimization techniques.
Three extensions are made to the basic block versioning technique in order to give it interprocedural optimization capabilities. Typed object shapes give it the ability to
attach type information to object properties and global variables. Entry point specialization allows it to pass type information from callers to callees, and call continuation
specialization makes it possible to pass return value type information back to callers without dynamic overhead. We empirically demonstrate that these extensions enable
basic block versioning to exceed the capabilities of static whole-program type analyses
Recommended from our members
Complete spatial safety for C and C++ using CHERI capabilities
Lack of memory safety in commonly used systems-level languages such as C and C++ results in a constant stream of new exploitable software vulnerabilities and exploit techniques. Many exploit mitigations have been proposed and deployed over the years, yet none address the root issue: lack of memory safety. Most C and C++ implementations assume a memory model based on a linear array of bytes rather than an object-centric view. Whilst more efficient on contemporary CPU architectures, linear addresses cannot encode the target object, thus permitting memory errors such as spatial safety violations (ignoring the bounds of an object). One promising mechanism to provide memory safety is CHERI
(Capability Hardware Enhanced RISC Instructions), which extends existing processor architectures with capabilities that provide hardware-enforced checks for all accesses and can be used to prevent spatial memory violations. This dissertation prototypes and evaluates a pure-capability programming model (using CHERI capabilities for all pointers) to provide complete spatial memory protection for traditionally unsafe languages.
As the first step towards memory safety, all language-visible pointers can be implemented as capabilities. I analyse the programmer-visible impact of this change and refine the pure-capability programming model to provide strong source-level compatibility with existing code. Second, to provide robust spatial safety, language-invisible pointers (mostly arising from program linkage) such as those used for functions calls and global variable accesses must also be protected. In doing so, I highlight trade-offs between performance and privilege minimization for implicit and programmer-visible pointers. Finally, I present
CheriSH, a novel and highly compatible technique that protects against buffer overflows between fields of the same object, hereby ensuring that the CHERI spatial memory protection is complete.
I find that the byte-granular spatial safety provided by CHERI pure-capability code is not only stronger than most other approaches, but also incurs almost negligible performance overheads in common cases (0.1% geometric mean) and a worst-case overhead of only 23.3% compared to the insecure MIPS baseline. Moreover, I show that the pure-capability programming model provides near-complete source-level compatibility with existing programs. I evaluate this based on porting large widely used open-source applications such as PostgreSQL and WebKit with only minimal changes: fewer than 0.1% of source lines.
I conclude that pure-capability CHERI C/C++ is an eminently viable programming environment offering strong memory protection, good source-level compatibility and low performance overheads
Scaling finite difference methods in large eddy simulation of jet engine noise to the petascale: numerical methods and their efficient and automated implementation
Reduction of jet engine noise has recently become a new arena of competition between aircraft manufacturers. As a relatively new field of research in computational fluid dynamics (CFD), computational aeroacoustics (CAA) prediction of jet engine noise based on large eddy simulation (LES) is a robust and accurate tool that complements the existing theoretical and experimental approaches. In order to satisfy the stringent requirements of CAA on numerical accuracy, finite difference methods in LES-based jet engine noise prediction rely on the implicitly formulated compact spatial partial differentiation and spatial filtering schemes, a crucial component of which is an embedded solver for tridiagonal linear systems spatially oriented along the three coordinate directions of the computational space. Traditionally, researchers and engineers in CAA have employed manually crafted implementations of solvers including the transposition method, the multiblock method and the Schur complement method. Algorithmically, these solvers force a trade-off between numerical accuracy and parallel scalability. Programmingwise, implementing them for each of the three coordinate directions is tediously repetitive and error-prone. ^ In this study, we attempt to tackle both of these two challenges faced by researchers and engineers. We first describe an accurate and scalable tridiagonal linear system solver as a specialization of the truncated SPIKE algorithm and strategies for efficient implementation of the compact spatial partial differentiation and spatial filtering schemes. We then elaborate on two programming models tailored for composing regular grid-based numerical applications including finite difference-based LES of jet engine noise, one based on generalized elemental subroutines and the other based on functional array programming, and the accompanying code optimization and generation methodologies. Through empirical experiments, we demonstrate that truncated SPIKE-based spatial partial differentiation and spatial filtering deliver the theoretically promised optimal scalability in weak scaling conditions and can be implemented using the two programming models with performance on par with handwritten code while significantly reducing the required programming effort
SUDS : automatic parallelization for raw processors
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.Includes bibliographical references (p. 177-181).A computer can never be too fast or too cheap. Computer systems pervade nearly every aspect of science, engineering, communications and commerce because they perform certain tasks at rates unachievable by any other kind of system built by humans. A computer system's throughput, however, is constrained by that system's ability to find concurrency. Given a particular target work load the computer architect's role is to design mechanisms to find and exploit the available concurrency in that work load. This thesis describes SUDS (Software Un-Do System), a compiler and runtime system that can automatically find and exploit the available concurrency of scalar operations in imperative programs with arbitrary unstructured and unpredictable control flow. The core compiler transformation that enables this is scalar queue conversion. Scalar queue conversion makes scalar renaming an explicit operation through a process similar to closure conversion, a technique traditionally used to compile functional languages. The scalar queue conversion compiler transformation is speculative, in the sense that it may introduce dynamic memory allocation operations into code that would not otherwise dynamically allocate memory. Thus, SUDS also includes a transactional runtime system that periodically checkpoints machine state, executes code speculatively, checks if the speculative execution produced results consistent with the original sequential program semantics, and then either commits or rolls back the speculative execution path. In addition to safely running scalar queue converted code, the SUDS runtime system safely permits threads to speculatively run in parallel and concurrently issue memory operations, even when the compiler is unable to prove that the reordered memory operations will always produce correct results.(cont.) Using this combination of compile time and runtime techniques, SUDS can find concurrency in programs where previous compiler based renaming techniques fail because the programs contain unstructured loops, and where Tomasulo's algorithm fails because it sequentializes mispredicted branches. Indeed, we describe three application programs, with unstructured control flow, where the prototype SUDS system, running in software on a Raw microprocessor, achieves speedups equivalent to, or better than, an idealized, and unrealizable, model of a hardware implementation of Tomasulo's algorithm.by Matthew Ian Frank.Ph.D
Low-Level Haskell Code: Measurements and Optimization Techniques
Haskell is a lazy functional language with a strong static type system and
excellent support for parallel programming. The language features of Haskell
make it easier to write correct and maintainable programs, but execution speed
often suffers from the high levels of abstraction. While much past research
focuses on high-level optimizations that take advantage of the functional
properties of Haskell, relatively little attention has been paid to the
optimization opportunities in the low-level imperative code generated during
translation to machine code. One problem with current low-level optimizations
is that their effectiveness is limited by the obscured control flow caused by
Haskell's high-level abstractions. My thesis is that trace-based optimization
techniques can be used to improve the effectiveness of low-level optimizations
for Haskell programs. I claim three unique contributions in this work.
The first contribution is to expose some properties of low-level Haskell codes
by looking at the mix of operations performed by the selected benchmark codes
and comparing them to the low-level codes coming from traditional programming
languages. The low-level measurements reveal that the control flow is obscured
by indirect jumps caused by the implementation of lazy evaluation,
higher-order functions, and the separately managed stacks used by Haskell
programs.
My second contribution is a study on the effectiveness of a dynamic binary
trace-based optimizer running on Haskell programs. My results show that while
viable program traces frequently occur in Haskell programs the overhead
associated with maintaing the traces in a dynamic optimization system outweigh
the benefits we get from running the traces. To reduce the runtime overheads,
I explore a way to find traces in a separate profiling step.
My final contribution is to build and evaluate a static trace-based optimizer
for Haskell programs. The static optimizer uses profiling data to find traces
in a Haskell program and then restructures the code around the traces to
increase the scope available to the low-level optimizer. My results show that
we can successfully build traces in Haskell programs, and the optimized code
yields a speedup over existing low-level optimizers of up to 86%
with an average speedup of 5% across 32 benchmarks
Recommended from our members
Scalable hardware memory disambiguation
This dissertation deals with one of the long-standing problems in Computer Architecture
â the problem of memory disambiguation. Microprocessors typically reorder
memory instructions during execution to improve concurrency. Such microprocessors
use hardware memory structures for memory disambiguation, known as LoadStore
Queues (LSQs), to ensure that memory instruction dependences are satisfied
even when the memory instructions execute out-of-order. A typical LSQ implementation
(circa 2006) holds all in-flight memory instructions in a physically centralized
LSQ and performs a fully associative search on all buffered instructions to ensure
that memory dependences are satisfied. These LSQ implementations do not scale
because they use large, fully associative structures, which are known to be slow and
power hungry. The increasing trend towards distributed microarchitectures further
exacerbates these problems. As on-chip wire delays increase and high-performance
processors become necessarily distributed, centralized structures such as the LSQ
can limit scalability.
This dissertation describes techniques to create scalable LSQs in both centralized
and distributed microarchitectures. The problems and solutions described
in this thesis are motivated and validated by real system designs. The dissertation
starts with a description of the partitioned primary memory system of the TRIPS
processor, of which the LSQ is an important component, and then through a series
of optimizations describes how the power, area, and centralization problems
of the LSQ can be solved with minor performance losses (if at all) even for large
number of in flight memory instructions. The four solutions described in this dissertation
â partitioning, filtering, late binding and efficient overflow management â
enable power-, area-efficient, distributed and scalable LSQs, which in turn enable
aggressive large-window processors capable of simultaneously executing thousands
of instructions.
To mitigate the power problem, we replaced the power-hungry, fully associative
search with a power-efficient hash table lookup using a simple address-based
Bloom filter. Bloom filters are probabilistic data structures used for testing set
membership and can be used to quickly check if an instruction with the same data
address is likely to be found in the LSQ without performing the associative search.
Bloom filters typically eliminate more than 80% of the associative searches and they
are highly effective because in most programs, it is uncommon for loads and stores
to have the same data address and be in execution simultaneously.
To rectify the area problem, we observe the fact that only a small fraction
of all memory instructions are dependent, that only such dependent instructions
need to be buffered in the LSQ, and that these instructions need to be in the LSQ
only for certain parts of the pipelined execution. We propose two mechanisms to
exploit these observations. The first mechanism, area filtering, is a hardware mechanism
that couples Bloom filters and dependence predictors to dynamically identify
and buffer only those instructions which are likely to be dependent. The second
mechanism, late binding, reduces the occupancy and hence size of the LSQ. Both of
these optimizations allows the number of LSQ slots to be reduced by up to one-half
compared to a traditional organization without any performance degradation.
Finally, we describe a new decentralized LSQ design for handling LSQ structural
hazards in distributed microarchitectures. Decentralization of LSQs, and to
a large extent distributed microarchitectures with memory speculation, has proved
to be impractical because of the high performance penalties associated with the
mechanisms for dealing with hazards. To solve this problem, we applied classic
flow-control techniques from interconnection networks for handling resource con-
flicts. The first method, memory-side buffering, buffers the overflowing instructions
in a separate buffer near the LSQs. The second scheme, execution-side NACKing,
sends the overflowing instruction back to the issue window from which it is later
re-issued. The third scheme, network buffering, uses the buffers in the interconnection
network between the execution units and memory to hold instructions when the
LSQ is full, and uses virtual channel flow control to avoid deadlocks. The network
buffering scheme is the most robust of all the overflow schemes and shows less than
1% performance degradation due to overflows for a subset of SPEC CPU 2000 and
EEMBC benchmarks on a cycle-accurate simulator that closely models the TRIPS
processor.
The techniques proposed in this dissertation are independent, architectureneutral
and their cumulative benefits result in LSQs that can be partitioned at a
fine granularity and have low design complexity. Each of these partitions selectively
buffers only memory instructions with true dependences and can be closely coupled
with the execution units thus minimizing power, area, and latency. Such LSQ
designs with near-ideal characteristics are well suited for microarchitectures with
thousands of instructions in-flight and may enable even more aggressive microarchitectures
in the future.Computer Science
Adaptive memory hierarchies for next generation tiled microarchitectures
Les Ășltimes dĂšcades el rendiment dels processadors i de les memĂČries ha millorat a diferent ritme, limitant el rendiment dels processadors i creant el conegut memory gap. Sol·lucionar aquesta diferĂšncia de rendiment Ă©s un camp d'investigaciĂł d'actualitat i que requereix de noves sol·lucions. Una sol·luciĂł a aquest problema sĂłn les memĂČries âcacheâ, que permeten reduĂŻr l'impacte d'unes latĂšncies de memĂČria creixents i que conformen la jerarquia de memĂČria. La majoria de d'organitzacions de les âcachesâ estan dissenyades per a uniprocessadors o multiprcessadors tradicionals. Avui en dia, perĂČ, el creixent nombre de transistors disponible per xip ha permĂšs l'apariciĂł de xips multiprocessador (CMPs). Aquests xips tenen diferents propietats i limitacions i per tant requereixen de jerarquies de memĂČria especĂfiques per tal de gestionar eficientment els recursos disponibles. En aquesta tesi ens hem centrat en millorar el rendiment i la eficiĂšncia energĂštica de la jerarquia de memĂČria per CMPs, des de les âcachesâ fins als controladors de memĂČria.
A la primera part d'aquesta tesi, s'han estudiat organitzacions tradicionals per les âcachesâ com les privades o compartides i s'ha pogut constatar que, tot i que funcionen bĂ© per a algunes aplicacions, un sistema que s'ajustĂ©s dinĂ micament seria mĂ©s eficient. TĂšcniques com el Cooperative Caching (CC) combinen els avantatges de les dues tĂšcniques perĂČ requereixen un mecanisme centralitzat de coherĂšncia que tĂ© un consum energĂštic molt elevat. Ăs per aixĂČ que en aquesta tesi es proposa el Distributed Cooperative Caching (DCC), un mecanisme que proporciona coherĂšncia en CMPs i aplica el concepte del cooperative caching de forma distribuĂŻda. Mitjançant l'Ășs de directoris distribuĂŻts s'obtĂ© una sol·luciĂł mĂ©s escalable i que, a mĂ©s, disposa d'un mecanisme de marcatge mĂ©s flexible i eficient energĂšticament.
A la segona part, es demostra que les aplicacions fan diferents usos de la âcacheâ i que si es realitza una distribuciĂł de recursos eficient es poden aprofitar els que estan infrautilitzats. Es proposa l'Elastic Cooperative Caching (ElasticCC), una organitzaciĂł capaç de redistribuĂŻr la memĂČria âcacheâ dinĂ micament segons els requeriments de cada aplicaciĂł. Una de les contribucions mĂ©s importants d'aquesta tĂšcnica Ă©s que la reconfiguraciĂł es decideix completament a travĂ©s del maquinari i que tots els mecanismes utilitzats es basen en estructures distribuĂŻdes, permetent una millor escalabilitat. ElasticCC no nomĂ©s Ă©s capaç de reparticionar les âcachesâ segons els requeriments de cada aplicaciĂł, sinĂł que, a mĂ©s a mĂ©s, Ă©s capaç d'adaptar-se a les diferents fases d'execuciĂł de cada una d'elles. La nostra avaluaciĂł tambĂ© demostra que la reconfiguraciĂł dinĂ mica de l'ElasticCC Ă©s tant eficient que gairebĂ© proporciona la mateixa taxa de fallades que una configuraciĂł amb el doble de memĂČria.Finalment, la tesi es centra en l'estudi del comportament de les memĂČries DRAM i els seus controladors en els CMPs. Es demostra que, tot i que els controladors tradicionals funcionen eficientment per uniprocessadors, en CMPs els diferents patrons d'accĂ©s obliguen a repensar com estan dissenyats aquests sistemes. S'han presentat mĂșltiples sol·lucions per CMPs perĂČ totes elles es veuen limitades per un compromĂs entre el rendiment global i l'equitat en l'assignaciĂł de recursos. En aquesta tesi es proposen els Thread Row Buffers (TRBs), una zona d'emmagatenament extra a les memĂČries DRAM que permetria guardar files de dades especĂfiques per a cada aplicaciĂł. Aquest mecanisme permet proporcionar un accĂ©s equitatiu a la memĂČria sense perjudicar el seu rendiment global.
En resum, en aquesta tesi es presenten noves organitzacions per la jerarquia de memĂČria dels CMPs centrades en la escalabilitat i adaptativitat als requeriments de les aplicacions. Els resultats presentats demostren que les tĂšcniques proposades proporcionen un millor rendiment i eficiĂšncia energĂštica que les millors tĂšcniques existents fins a l'actualitat.Processor performance and memory performance have improved at different rates during the last decades, limiting processor performance and creating the well known "memory gap". Solving this performance difference is an important research field and new solutions must be proposed in order to have better processors in the future. Several solutions exist, such as caches, that reduce the impact of longer memory accesses and conform the system memory hierarchy. However, most of the existing memory hierarchy organizations were designed for single processors or traditional multiprocessors. Nowadays, the increasing number of available transistors has allowed the apparition of chip multiprocessors, which have different constraints and require new ad-hoc memory systems able to efficiently manage memory resources. Therefore, in this thesis we have focused on improving the performance and energy efficiency of the memory hierarchy of chip multiprocessors, ranging from caches to DRAM memories.
In the first part of this thesis we have studied traditional cache organizations such as shared or private caches and we have seen that they behave well only for some applications and that an adaptive system would be desirable. State-of-the-art techniques such as Cooperative Caching (CC) take advantage of the benefits of both worlds. This technique, however, requires the usage of a centralized coherence structure and has a high energy consumption. Therefore we propose the Distributed Cooperative Caching (DCC), a mechanism to provide coherence to chip multiprocessors and apply the concept of cooperative caching in a distributed way. Through the usage of distributed directories we obtain a more scalable solution and, in addition, has a more flexible and energy-efficient tag allocation method.
We also show that applications make different uses of cache and that an efficient allocation can take advantage of unused resources. We propose Elastic Cooperative Caching (ElasticCC), an adaptive cache organization able to redistribute cache resources dynamically depending on application requirements. One of the most important contributions of this technique is that adaptivity is fully managed by hardware and that all repartitioning mechanisms are based on distributed structures, allowing a better scalability. ElasticCC not only is able to repartition cache sizes to application requirements, but also is able to dynamically adapt to the different execution phases of each thread. Our experimental evaluation also has shown that the cache partitioning provided by ElasticCC is efficient and is almost able to match the off-chip miss rate of a configuration that doubles the cache space.
Finally, we focus in the behavior of DRAM memories and memory controllers in chip multiprocessors. Although traditional memory schedulers work well for uniprocessors, we show that new access patterns advocate for a redesign of some parts of DRAM memories. Several organizations exist for multiprocessor DRAM schedulers, however, all of them must trade-off between memory throughput and fairness. We propose Thread Row Buffers, an extended storage area in DRAM memories able to store a data row for each thread. This mechanism enables a fair memory access scheduling without hurting memory throughput.
Overall, in this thesis we present new organizations for the memory hierarchy of chip multiprocessors which focus on the scalability and of the proposed structures and adaptivity to application behavior. Results show that the presented techniques provide a better performance and energy-efficiency than existing state-of-the-art solutions
Compiling Scala for Performance
Scala is a new programming language bringing together object-oriented and functional programming. Its defining features are uniformity and extensibility. Scala offers great flexibility for programmers, allowing them to grow the language through libraries. Oftentimes what seems like a language feature is in fact implemented in a library, effectively giving programmers the power of language designers. The downside of this flexibility is that familiar looking code may hide unexpected performance costs. It is important for Scala compilers to bring down this cost as much as possible. We identify several areas of impact for Scala performance: higher-order functions and closures, and generic containers used with primitive types. We present two complementary approaches for improving performance in these areas: optimizations and specialization. Compiler optimization can bring down the cost through a combination of aggressive inlining of higher-order functions, an extended version of copy-propagation and dead-code elimination. Both anonymous functions and boxing can be eliminated by this approach. We show on a number of benchmarks that these language features can be up to 5 times faster when properly optimized, on current day JVMs. We propose a new approach to compiling parametric polymorphism for performance at primitive types. We mix a homogeneous translation scheme with user-directed specialization for primitive types. Type parameters may be annotated to require specialization of code depending on them. We propose definition-site specialization for primitive types, achieving separate compilation and no boxing when both the definition and call site are specialized. Specialized classes are compatible with unspecialized code, and specialization agnostic code can work with specialized instances, meaning that specialization is opportunistic. We present a formalism of a small subset of Scala with specialization and prove that specialization preserves types. We implemented this translation in the Scala compiler and report on improvements on a set of benchmarks, showing that specialization can make programs more than two times faster
Design and optimisation of scientific programs in a categorical language
This thesis presents an investigation into the use of advanced computer languages for scientific computing, an examination of performance issues that arise from using such languages for such a task, and a step toward achieving portable performance from compilers by attacking these problems in a way that compensates for the complexity of and differences between modern computer architectures. The language employed is Aldor, a functional language from computer algebra, and the scientific computing area is a subset of the family of iterative linear equation solvers applied to sparse systems. The linear equation solvers that are considered have much common structure, and this is factored out and represented explicitly in the lan-guage as a framework, by means of categories and domains. The flexibility introduced by decomposing the algorithms and the objects they act on into separate modules has a strong performance impact due to its negative effect on temporal locality. This necessi-tates breaking the barriers between modules to perform cross-component optimisation. In this instance the task reduces to one of collective loop fusion and array contrac
- âŠ