Search CORE

2,652 research outputs found

C++ Design Patterns for Low-latency Applications Including High-frequency Trading

Author: Bilokon Paul
Gunduz Burak
Publication venue
Publication date: 08/09/2023
Field of study

This work aims to bridge the existing knowledge gap in the optimisation of latency-critical code, specifically focusing on high-frequency trading (HFT) systems. The research culminates in three main contributions: the creation of a Low-Latency Programming Repository, the optimisation of a market-neutral statistical arbitrage pairs trading strategy, and the implementation of the Disruptor pattern in C++. The repository serves as a practical guide and is enriched with rigorous statistical benchmarking, while the trading strategy optimisation led to substantial improvements in speed and profitability. The Disruptor pattern showcased significant performance enhancement over traditional queuing methods. Evaluation metrics include speed, cache utilisation, and statistical significance, among others. Techniques like Cache Warming and Constexpr showed the most significant gains in latency reduction. Future directions involve expanding the repository, testing the optimised trading algorithm in a live trading environment, and integrating the Disruptor pattern with the trading algorithm for comprehensive system benchmarking. The work is oriented towards academics and industry practitioners seeking to improve performance in latency-sensitive applications

arXiv.org e-Print Archive

Simple, safe, and efficient memory management using linear pointers

Author: Liu Likai
Publication venue
Publication date: 22/01/2016
Field of study

Efficient and safe memory management is a hard problem. Garbage collection promises automatic memory management but comes with the cost of increased memory footprint, reduced parallelism in multi-threaded programs, unpredictable pause time, and intricate tuning parameters balancing the program's workload and designated memory usage in order for an application to perform reasonably well. Existing research mitigates the above problems to some extent, but programmer error could still cause memory leak by erroneously keeping memory references when they are no longer needed. We need a methodology for programmers to become resource aware, so that efficient, scalable, predictable and high performance programs may be written without the fear of resource leak. Linear logic has been recognized as the formalism of choice for resource tracking. It requires explicit introduction and elimination of resources and guarantees that a resource cannot be implicitly shared or abandoned, hence must be linear. Early languages based on linear logic focused on Curry-Howard correspondence. They began by limiting the expressive powers of the language and then reintroduced them by allowing controlled sharing which is necessary for recursive functions. However, only by deviating from Curry-Howard correspondence could later development actually address programming errors in resource usage. The contribution of this dissertation is a simple, safe, and efficient approach introducing linear resource ownership semantics into C++ (which is still a widely used language after 30 years since inception) through linear pointer, a smart pointer inspired by linear logic. By implementing various linear data structures and a parallel, multi-threaded memory allocator based on these data structures, this work shows that linear pointer is practical and efficient in the real world, and that it is possible to build a memory management stack that is entirely leak free. The dissertation offers some closing remarks on the difficulties a formal system would encounter when reasoning about a concurrent linear data algorithm, and what might be done to solve these problems

Boston University Institutional Repository (OpenBU)

Efficient Code Generation from SHIM Models

Author: Kahn Gilles
Olivier Tardieu
Stephen A. Edwards
Zhu Xiaohan
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2006
Field of study

Programming concurrent systems is substantially more difficult than programming sequential systems, yet most embedded systems need concurrency. We believe this should be addressed through higher-level models of concurrency that eliminate many of the usual challenges, such as nondeterminism arising from races. The shim model of computation provides deterministic concurrency, and there already exist ways of implementing it in hardware and software. In this work, we describe how to produce more efficient C code from shim systems. We propose two techniques: a largely mechanical one that produces tail-recursive code for simulating concurrency, and a more clever one that statically analyzes the communication pattern of multiple processes to produce code with far less overhead. Experimentally, we find our tail-recursive technique produces code that runs roughly twice as fast as a baseline; our statically-scheduled code can run up to twelve times faster

Crossref

Columbia University Academic Commons

The construction of high-performance virtual machines for dynamic languages

Author: Shannon Mark
Publication venue
Publication date: 01/01/2011
Field of study

Dynamic languages, such as Python and Ruby, have become more widely used over the past decade. Despite this, the standard virtual machines for these languages have disappointing performance. These virtual machines are slow, not because methods for achieving better performance are unknown, but because their implementation is hard. What makes the implementation of high-performance virtual machines difﬁcult is not that they are large pieces of software, but that there are fundamental and complex interdependencies between their components. In order to work together correctly, the interpreter, just-in-time compiler, garbage collector and library must all conform to the same precise low-level protocols. In this dissertation I describe a method for constructing virtual machines for dynamic languages, and explain how to design a virtual machine toolkit by building it around an abstract machine. The design and implementation of such a toolkit, the Glasgow Virtual Machine Toolkit, is described. The Glasgow Virtual Machine Toolkit automatically generates a just-in-time compiler, integrates precise garbage collection into the virtual machine, and automatically manages the complex inter-dependencies between all the virtual machine components. Two different virtual machines have been constructed using the GVMT. One is a minimal implementation of Scheme; which was implemented in under three weeks to demonstrate that toolkits like the GVMT can enable the easy construction of virtual machines. The second, the HotPy VM for Python, is a high-performance virtual machine; it demonstrates that a virtual machine built with a toolkit can be fast and that the use of a toolkit does not overly constrain the high-level design. Evaluation shows that HotPy outperforms the standard Python interpreter, CPython, by a large margin, and has performance on a par with PyPy, the fastest Python VM currently available

Glasgow Theses Service

OpenGrey Repository

Garbage collection and the case for high-level low-level programming

Author: Frampton Daniel
Publication venue
Publication date: 17/09/2018
Field of study

The Australian National University

Recommended from our members

Program Synthesis for Software-Defined Networking

Author: McClurg Jedidiah
Publication venue: University of Colorado Boulder
Publication date: 01/01/2018
Field of study

Software-defined networking (SDN) is revolutionizing the networking industry, but even the most advanced SDN programming platforms lack mechanisms for changing the global configuration (the set of all forwarding rules on the switches) correctly and automatically. This seemingly-simple notion of global configuration change (known as a network update) can be quite challenging for SDN programmers to implement by hand, because networks are distributed systems with hundreds or thousands of interacting nodes---even if the initial and final configurations are correct, naïvely updating individual nodes can lead to bugs in the intermediate configurations. Additionally, SDN programs must simultaneously describe both static forwarding behavior, and dynamic updates in response to events. These event-driven updates are critical to get right, but even more difficult to implement correctly due to interleavings of data packets and control messages. Existing SDN platforms offer only weak guarantees in this regard, also opening the door for incorrect behavior. As an added wrinkle, event-driven network programs are often physically distributed, running on several nodes of the network, and this distributed setting makes programming and debugging even more difficult. Bugs arising from any of these issues can cause serious incorrect transient behaviors, including loops, black holes, and access-control violations.This thesis presents a synthesis-based approach for solving these issues. First, I show how to automatically synthesize network updates that are guaranteed to preserve specified properties. I formalize the network updates problem and develop a synthesis algorithm based on counterexample-guided search and incremental model checking. Second, I add the ability to reason about transitions between configurations in response to events, by introducing event-driven consistent updates that are guaranteed to preserve well-defined behaviors in this context. I propose network event structures (NESs) to model constraints on updates, such as which events can be enabled simultaneously and causal dependencies between events. I define an extension of the NetKAT language with mutable state, give semantics to stateful programs using NESs, and discuss provably-correct strategies for implementing NESs in SDNs. Third, I propose a synchronization synthesis approach that allows correct "parallel composition" of several event-driven programs (processes)---the programmer can specify each sequential process, and add a declarative specification of paths that packets are allowed to take. The synthesizer then inserts synchronization among the distributed controller processes such that the declarative specification will be satisfied by all packets traversing the network. The key technical contribution here is a counterexample-guided synthesis algorithm that furnishes network processes with the synchronization required to prevent any races causing specification violations. An important component of this is an extension of network event structures to a more general programming model called event nets based on Petri nets. Finally, I describe an approach for implementing event nets in an efficient distributed way on modern SDN hardware. For each of the core components, I describe a prototype implementation, and present results from experiments on realistic topologies and properties, demonstrating that the tools handle real network programs, and scale to networks of 1000+ nodes

CU Scholar Institutional Repository

Doctor of Philosophy

Author: Atzeni Simone
Publication venue: University of Utah
Publication date: 01/01/2017
Field of study

dissertationHigh Performance Computing (HPC) on-node parallelism is of extreme importance to guarantee and maintain scalability across large clusters of hundreds of thousands of multicore nodes. HPC programming is dominated by the hybrid model "MPI + X", with MPI to exploit the parallelism across the nodes, and "X" as some shared memory parallel programming model to accomplish multicore parallelism across CPUs or GPUs. OpenMP has become the "X" standard de-facto in HPC to exploit the multicore architectures of modern CPUs. Data races are one of the most common and insidious of concurrent errors in shared memory programming models and OpenMP programs are not immune to them. The OpenMP-provided ease of use to parallelizing programs can often make it error-prone to data races which become hard to find in large applications with thousands lines of code. Unfortunately, prior tools are unable to impact practice owing to their poor coverage or poor scalability. In this work, we develop several new approaches for low overhead data race detection. Our approaches aim to guarantee high precision and accuracy of race checking while maintaining a low runtime and memory overhead. We present two race checkers for C/C++ OpenMP programs that target two different classes of programs. The first, ARCHER, is fast but requires large amount of memory, so it ideally targets applications that require only a small portion of the available on-node memory. On the other hand, SWORD strikes a balance between fast zero memory overhead data collection followed by offline analysis that can take a long time, but it often report most races quickly. Given that race checking was impossible for large OpenMP applications, our contributions are the best available advances in what is known to be a difficult NP-complete problem. We performed an extensive evaluation of the tools on existing OpenMP programs and HPC benchmarks. Results show that both tools guarantee to identify all the races of a program in a given run without reporting any false alarms. The tools are user-friendly, hence serve as an important instrument for the daily work of programmers to help them identify data races early during development and production testing. Furthermore, our demonstrated success on real-world applications puts these tools on the top list of debugging tools for scientists at large

The University of Utah: J. Willard Marriott Digital Library

Recommended from our members

Galois : a system for parallel execution of irregular algorithms

Author: Nguyen Donald Do
Publication venue
Publication date: 04/09/2015
Field of study

textA programming model which allows users to program with high productivity and which produces high performance executions has been a goal for decades. This dissertation makes progress towards this elusive goal by describing the design and implementation of the Galois system, a parallel programming model for shared-memory, multicore machines. Central to the design is the idea that scheduling of a program can be decoupled from the core computational operator and data structures. However, efficient programs often require application-specific scheduling to achieve best performance. To bridge this gap, an extensible and abstract scheduling policy language is proposed, which allows programmers to focus on selecting high-level scheduling policies while delegating the tedious task of implementing the policy to a scheduler synthesizer and runtime system. Implementations of deterministic and prioritized scheduling also are described. An evaluation of a well-studied benchmark suite reveals that factoring programs into operators, schedulers and data structures can produce significant performance improvements over unfactored approaches. Comparison of the Galois system with existing programming models for graph analytics shows significant performance improvements, often orders of magnitude more, due to (1) better support for the restrictive programming models of existing systems and (2) better support for more sophisticated algorithms and scheduling, which cannot be expressed in other systems.Computer Science

Texas ScholarWorks

Benchmarking and Extending SYCL Hierarchical Parallelism

Author: Alpay Aksel
Deakin Tom
Heuveline Vincent
McIntosh-Smith Simon N
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/12/2021
Field of study

Explore Bristol Research

Weak persistency semantics from the ground up: formalising the persistency semantics of ARMv8 and transactional models

Author: Raad A
Vafeiadis V
Wickerson J
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Emerging non-volatile memory (NVM) technologies promise the durability of disks with the performance of volatile memory (RAM). To describe the persistency guarantees of NVM, several memory persistency models have been proposed in the literature. However, the formal persistency semantics of mainstream hardware is unexplored to date. To close this gap, we present a formal declarative framework for describing concurrency models in the NVM context, and then develop the PARMv8 persistency model as an instance of our framework, formalising the persistency semantics of the ARMv8 architecture for the first time. To facilitate correct persistent programming, we study transactions as a simple abstraction for concurrency and persistency control. We thus develop the PSER (persistent serialisability) persistency model, formalising transactional semantics in the NVM context for the first time, and demonstrate that PSER correctly compiles to PARMv8. This then enables programmers to write correct, concurrent and persistent programs, without having to understand the low-level architecture-specific persistency semantics of the underlying hardware

Spiral - Imperial College Digital Repository

MPG.PuRe