274 research outputs found
End-to-End Application Cloning for Distributed Cloud Microservices with Ditto
We present Ditto, an automated framework for cloning end-to-end cloud
applications, both monolithic and microservices, which captures I/O and network
activity, as well as kernel operations, in addition to application logic. Ditto
takes a hierarchical approach to application cloning, starting with capturing
the dependency graph across distributed services, to recreating each tier's
control/data flow, and finally generating system calls and assembly that mimics
the individual applications. Ditto does not reveal the logic of the original
application, facilitating publicly sharing clones of production services with
hardware vendors, cloud providers, and the research community.
We show that across a diverse set of single- and multi-tier applications,
Ditto accurately captures their CPU and memory characteristics as well as their
high-level performance metrics, is portable across platforms, and facilitates a
wide range of system studies
DIA: A complexity-effective decoding architecture
Fast instruction decoding is a true challenge for the design of CISC microprocessors implementing variable-length instructions. A well-known solution to overcome this problem is caching decoded instructions in a hardware buffer. Fetching already decoded instructions avoids the need for decoding them again, improving processor performance. However, introducing such special--purpose storage in the processor design involves an important increase in the fetch architecture complexity. In this paper, we propose a novel decoding architecture that reduces the fetch engine implementation cost. Instead of using a special-purpose hardware buffer, our proposal stores frequently decoded instructions in the memory hierarchy. The address where the decoded instructions are stored is kept in the branch prediction mechanism, enabling it to guide our decoding architecture. This makes it possible for the processor front end to fetch already decoded instructions from the memory instead of the original nondecoded instructions. Our results show that using our decoding architecture, a state-of-the-art superscalar processor achieves competitive performance improvements, while requiring less chip area and energy consumption in the fetch architecture than a hardware code caching mechanism.Peer ReviewedPostprint (published version
Opportunistic acceleration of array-centric Python computation in heterogeneous environments
Dynamic scripting languages, like Python, are growing in popularity and increasingly used by non-expert programmers. These languages provide high level abstractions such as safe memory management, dynamic type handling and array bounds checking. The reduction in boilerplate code enables the concise expression of computation compared to statically typed and compiled languages. This improves programmer productivity. Increasingly, scripting languages are used by domain experts to write numerically intensive code in a variety of domains (e.g. Economics, Zoology, Archaeology and Physics). These programs are often used not just for prototyping but also in deployment. However, such managed program execution comes with a significant performance penalty arising from the interpreter having to decode and dispatch based on dynamic type checking.
Modern computer systems are increasingly equipped with accelerators such as GPUs. However, the massive speedups that can be achieved by GPU accelerators come at the cost of program complexity. Directly programming a GPU requires a deep understanding of the computational model of the underlying hardware architecture. While the complexity of such devices is abstracted by programming languages specialised for heterogeneous devices such as CUDA and OpenCL, these are dialects of the low-level C systems programming language used primarily by expert programmers.
This thesis presents the design and implementation of ALPyNA, a loop parallelisation and GPU code generation framework. A novel staged parallelisation approach is used to aggressively parallelise each execution instance of a loop nest. Loop dependence relationships that cannot be inferred statically are deferred for runtime analysis. At runtime, these dependences are augmented with runtime information obtained by introspection and the loop nest is parallelised. Parallel GPU kernels are customised to the runtime dependence graph, JIT compiled and executed.
A systematic analysis of the execution speed of loop nests is performed using 12 standard loop intensive benchmarks. The evaluation is performed on two CPU–GPU machines. One is a server grade machine while the other is a typical desktop. ALPyNA’s GPU kernels achieve orders of magnitude speedup over the baseline interpreter execution time (up to 16435x) and large speedups (up to 179.55x) over JIT compiled CPU code.
The varied performance of JIT compiled GPU code motivates the need for a sophisticated cost model to select the device providing the best speedups at runtime for varying domain sizes. This thesis describes a novel lightweight analytical cost model to determine the fastest device to execute a loop nest at runtime. The ALPyNA Cost Model (ACM) adapts to runtime dependence analysis and is parameterised on the hardware characteristics of the underlying target CPU or GPU. The cost model also takes into account the relative rate at which the interpreter is able to supply the GPU with computational work. ACM is re-targetable to other accelerator devices and only requires minimal install time profiling
Software Performance Engineering using Virtual Time Program Execution
In this thesis we introduce a novel approach to software performance engineering that is based
on the execution of code in virtual time. Virtual time execution models the timing-behaviour
of unmodified applications by scaling observed method times or replacing them with results
acquired from performance model simulation. This facilitates the investigation of "what-if" performance predictions of applications comprising an arbitrary combination of real code and
performance models. The ability to analyse code and models in a single framework enables
performance testing throughout the software lifecycle, without the need to to extract performance
models from code. This is accomplished by forcing thread scheduling decisions to take
into account the hypothetical time-scaling or model-based performance specifications of each
method. The virtual time execution of I/O operations or multicore targets is also investigated.
We explore these ideas using a Virtual EXecution (VEX) framework, which provides performance
predictions for multi-threaded applications. The language-independent VEX core is
driven by an instrumentation layer that notifies it of thread state changes and method profiling events; it is then up to VEX to control the progress of application threads in virtual time on top of the operating system scheduler. We also describe a Java Instrumentation Environment
(JINE), demonstrating the challenges involved in virtual time execution at the JVM level.
We evaluate the VEX/JINE tools by executing client-side Java benchmarks in virtual time
and identifying the causes of deviations from observed real times. Our results show that VEX
and JINE transparently provide predictions for the response time of unmodified applications
with typically good accuracy (within 5-10%) and low simulation overheads (25-50% additional
time). We conclude this thesis with a case study that shows how models and code can be
integrated, thus illustrating our vision on how virtual time execution can support performance
testing throughout the software lifecycle
APOLLO: Automatic speculative POLyhedral Loop Optimizer
International audienceA few weeks ago, we were glad to announce the first release of Apollo, the Automatic speculative POLyhedral Loop Opti-mizer. Apollo applies polyhedral optimizations on-the-fly to loop nests, whose control flow and memory access patterns cannot be determined at compile-time. In contrast to existing tools, Apollo can handle any kind of loop nest, whose memory accesses can be performed through pointers and in-directions. At runtime, Apollo builds a predictive polyhedral model, which is used for speculative optimization including parallelization. Being a dynamic system, Apollo can even apply the polyhedral model to nonlinear loops. This paper describes Apollo from the perspective of a user, as well as some of its main contributions and mechanisms, including the just-in-time polyhedral compilation, that significantly extends the scope of polyhedral techniques
- …