2,652 research outputs found
C++ Design Patterns for Low-latency Applications Including High-frequency Trading
This work aims to bridge the existing knowledge gap in the optimisation of
latency-critical code, specifically focusing on high-frequency trading (HFT)
systems. The research culminates in three main contributions: the creation of a
Low-Latency Programming Repository, the optimisation of a market-neutral
statistical arbitrage pairs trading strategy, and the implementation of the
Disruptor pattern in C++. The repository serves as a practical guide and is
enriched with rigorous statistical benchmarking, while the trading strategy
optimisation led to substantial improvements in speed and profitability. The
Disruptor pattern showcased significant performance enhancement over
traditional queuing methods. Evaluation metrics include speed, cache
utilisation, and statistical significance, among others. Techniques like Cache
Warming and Constexpr showed the most significant gains in latency reduction.
Future directions involve expanding the repository, testing the optimised
trading algorithm in a live trading environment, and integrating the Disruptor
pattern with the trading algorithm for comprehensive system benchmarking. The
work is oriented towards academics and industry practitioners seeking to
improve performance in latency-sensitive applications
Simple, safe, and efficient memory management using linear pointers
Efficient and safe memory management is a hard problem. Garbage collection promises automatic memory management but comes with the cost of increased memory footprint, reduced parallelism in multi-threaded programs, unpredictable pause time, and intricate tuning parameters balancing the program's workload and designated memory usage in order for an application to perform reasonably well. Existing research mitigates the above problems to some extent, but programmer error could still cause memory leak by erroneously keeping memory references when they are no longer needed. We need a methodology for programmers to become resource aware, so that efficient, scalable, predictable and high performance programs may be written without the fear of resource leak.
Linear logic has been recognized as the formalism of choice for resource tracking. It requires explicit introduction and elimination of resources and guarantees that a resource cannot be implicitly shared or abandoned, hence must be linear. Early languages based on linear logic focused on Curry-Howard correspondence. They began by limiting the expressive powers of the language and then reintroduced them by allowing controlled sharing which is necessary for recursive functions. However, only by deviating from Curry-Howard correspondence could later development actually address programming errors in resource usage.
The contribution of this dissertation is a simple, safe, and efficient approach introducing linear resource ownership semantics into C++ (which is still a widely used language after 30 years since inception) through linear pointer, a smart pointer inspired by linear logic. By implementing various linear data structures and a parallel, multi-threaded memory allocator based on these data structures, this work shows that linear pointer is practical and efficient in the real world, and that it is possible to build a memory management stack that is entirely leak free. The dissertation offers some closing remarks on the difficulties a formal system would encounter when reasoning about a concurrent linear data algorithm, and what might be done to solve these problems
Efficient Code Generation from SHIM Models
Programming concurrent systems is substantially more difficult than programming sequential systems, yet most embedded systems need concurrency. We believe this should be addressed through higher-level models of concurrency that eliminate many of the usual challenges, such as nondeterminism arising from races. The shim model of computation provides deterministic concurrency, and there already exist ways of implementing it in hardware and software. In this work, we describe how to produce more efficient C code from shim systems. We propose two techniques: a largely mechanical one that produces tail-recursive code for simulating concurrency, and a more clever one that statically analyzes the communication pattern of multiple processes to produce code with far less overhead. Experimentally, we find our tail-recursive technique produces code that runs roughly twice as fast as a baseline; our statically-scheduled code can run up to twelve times faster
The construction of high-performance virtual machines for dynamic languages
Dynamic languages, such as Python and Ruby, have become more widely used over the past decade. Despite this, the standard virtual machines for these languages have disappointing performance. These virtual machines are slow, not because methods for achieving better performance are unknown, but because their implementation is hard. What makes the implementation of high-performance virtual machines difficult is not that they are large pieces of software, but that there are fundamental and complex interdependencies between their components. In order to work together correctly, the interpreter, just-in-time compiler, garbage collector and library must all conform to the same precise low-level protocols.
In this dissertation I describe a method for constructing virtual machines for dynamic languages, and explain how to design a virtual machine toolkit by building it around an abstract machine. The design and implementation of such a toolkit, the Glasgow Virtual Machine Toolkit, is described. The Glasgow Virtual Machine Toolkit automatically generates a just-in-time compiler, integrates precise garbage collection into the virtual machine, and automatically manages the complex inter-dependencies between all the virtual machine components.
Two different virtual machines have been constructed using the GVMT. One is a minimal implementation of Scheme; which was implemented in under three weeks to demonstrate that toolkits like the GVMT can enable the easy construction of virtual machines. The second, the HotPy VM for Python, is a high-performance virtual machine; it demonstrates that a virtual machine built with a toolkit can be fast and that the use of a toolkit does not overly constrain the high-level design. Evaluation shows that HotPy outperforms the standard Python interpreter, CPython, by a large margin, and has performance on a par with PyPy, the fastest Python VM currently available
Recommended from our members
Program Synthesis for Software-Defined Networking
Software-defined networking (SDN) is revolutionizing the networking industry, but even the most advanced SDN programming platforms lack mechanisms for changing the global configuration (the set of all forwarding rules on the switches) correctly and automatically. This seemingly-simple notion of global configuration change (known as a network update) can be quite challenging for SDN programmers to implement by hand, because networks are distributed systems with hundreds or thousands of interacting nodes---even if the initial and final configurations are correct, naïvely updating individual nodes can lead to bugs in the intermediate configurations. Additionally, SDN programs must simultaneously describe both static forwarding behavior, and dynamic updates in response to events. These event-driven updates are critical to get right, but even more difficult to implement correctly due to interleavings of data packets and control messages. Existing SDN platforms offer only weak guarantees in this regard, also opening the door for incorrect behavior. As an added wrinkle, event-driven network programs are often physically distributed, running on several nodes of the network, and this distributed setting makes programming and debugging even more difficult. Bugs arising from any of these issues can cause serious incorrect transient behaviors, including loops, black holes, and access-control violations.This thesis presents a synthesis-based approach for solving these issues. First, I show how to automatically synthesize network updates that are guaranteed to preserve specified properties. I formalize the network updates problem and develop a synthesis algorithm based on counterexample-guided search and incremental model checking. Second, I add the ability to reason about transitions between configurations in response to events, by introducing event-driven consistent updates that are guaranteed to preserve well-defined behaviors in this context. I propose network event structures (NESs) to model constraints on updates, such as which events can be enabled simultaneously and causal dependencies between events. I define an extension of the NetKAT language with mutable state, give semantics to stateful programs using NESs, and discuss provably-correct strategies for implementing NESs in SDNs. Third, I propose a synchronization synthesis approach that allows correct "parallel composition" of several event-driven programs (processes)---the programmer can specify each sequential process, and add a declarative specification of paths that packets are allowed to take. The synthesizer then inserts synchronization among the distributed controller processes such that the declarative specification will be satisfied by all packets traversing the network. The key technical contribution here is a counterexample-guided synthesis algorithm that furnishes network processes with the synchronization required to prevent any races causing specification violations. An important component of this is an extension of network event structures to a more general programming model called event nets based on Petri nets. Finally, I describe an approach for implementing event nets in an efficient distributed way on modern SDN hardware. For each of the core components, I describe a prototype implementation, and present results from experiments on realistic topologies and properties, demonstrating that the tools handle real network programs, and scale to networks of 1000+ nodes
Doctor of Philosophy
dissertationHigh Performance Computing (HPC) on-node parallelism is of extreme importance to guarantee and maintain scalability across large clusters of hundreds of thousands of multicore nodes. HPC programming is dominated by the hybrid model "MPI + X", with MPI to exploit the parallelism across the nodes, and "X" as some shared memory parallel programming model to accomplish multicore parallelism across CPUs or GPUs. OpenMP has become the "X" standard de-facto in HPC to exploit the multicore architectures of modern CPUs. Data races are one of the most common and insidious of concurrent errors in shared memory programming models and OpenMP programs are not immune to them. The OpenMP-provided ease of use to parallelizing programs can often make it error-prone to data races which become hard to find in large applications with thousands lines of code. Unfortunately, prior tools are unable to impact practice owing to their poor coverage or poor scalability. In this work, we develop several new approaches for low overhead data race detection. Our approaches aim to guarantee high precision and accuracy of race checking while maintaining a low runtime and memory overhead. We present two race checkers for C/C++ OpenMP programs that target two different classes of programs. The first, ARCHER, is fast but requires large amount of memory, so it ideally targets applications that require only a small portion of the available on-node memory. On the other hand, SWORD strikes a balance between fast zero memory overhead data collection followed by offline analysis that can take a long time, but it often report most races quickly. Given that race checking was impossible for large OpenMP applications, our contributions are the best available advances in what is known to be a difficult NP-complete problem. We performed an extensive evaluation of the tools on existing OpenMP programs and HPC benchmarks. Results show that both tools guarantee to identify all the races of a program in a given run without reporting any false alarms. The tools are user-friendly, hence serve as an important instrument for the daily work of programmers to help them identify data races early during development and production testing. Furthermore, our demonstrated success on real-world applications puts these tools on the top list of debugging tools for scientists at large
Recommended from our members
Galois : a system for parallel execution of irregular algorithms
textA programming model which allows users to program with high productivity and which produces high performance executions has been a goal for decades. This dissertation makes progress towards this elusive goal by describing the design and implementation of the Galois system, a parallel programming model for shared-memory, multicore machines. Central to the design is the idea that scheduling of a program can be decoupled from the core computational operator and data structures. However, efficient programs often require application-specific scheduling to achieve best performance. To bridge this gap, an extensible and abstract scheduling policy language is proposed, which allows programmers to focus on selecting high-level scheduling policies while delegating the tedious task of implementing the policy to a scheduler synthesizer and runtime system. Implementations of deterministic and prioritized scheduling also are described. An evaluation of a well-studied benchmark suite reveals that factoring programs into operators, schedulers and data structures can produce significant performance improvements over unfactored approaches. Comparison of the Galois system with existing programming models for graph analytics shows significant performance improvements, often orders of magnitude more, due to (1) better support for the restrictive programming models of existing systems and (2) better support for more sophisticated algorithms and scheduling, which cannot be expressed in other systems.Computer Science
Weak persistency semantics from the ground up: formalising the persistency semantics of ARMv8 and transactional models
Emerging non-volatile memory (NVM) technologies promise the durability of disks with the performance of volatile memory (RAM). To describe the persistency guarantees of NVM, several memory persistency models have been proposed in the literature. However, the formal persistency semantics of mainstream hardware is unexplored to date. To close this gap, we present a formal declarative framework for describing concurrency models in the NVM context, and then develop the PARMv8 persistency model as an instance of our framework, formalising the persistency semantics of the ARMv8 architecture for the first time. To facilitate correct persistent programming, we study transactions as a simple abstraction for concurrency and persistency control. We thus develop the PSER (persistent serialisability) persistency model, formalising transactional semantics in the NVM context for the first time, and demonstrate that PSER correctly compiles to PARMv8. This then enables programmers to write correct, concurrent and persistent programs, without having to understand the low-level architecture-specific persistency semantics of the underlying hardware
- …