9 research outputs found

    ROLLED: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees

    Get PDF
    Modern low power distributed systems tend to integrate machine learning algorithms. In resource-constrained setups, the execution of the models has to be optimized for performance and energy consumption. Racetrack memory (RTM) promises to achieve these goals by offering unprecedented integration density, smaller access latency, and reduced energy consumption. However, to access data in RTM, it needs to be shifted to the access port first. We investigate decision trees and develop placement strategies to reduce the total number of shifts in RTM. Decision trees allow profiling during training, resulting in tree paths' access probabilities. We map tree nodes to RTM so that the total number of shifts is minimal. Concretely, we present two different placement approaches: 1) where tree nodes are closely packed and placed uniformly in a single RTM location and 2) where decision tree nodes are decomposed to separate RTM blocks. We discuss theoretical cost models for both approaches, we formally prove an upper bound of 4× for the unified and an upper bound of 12× for the decomposed organization towards the optimal placement. We conduct a thorough experimental evaluation to compare our algorithms to the state-of-the-art placement strategies Our experimental evaluations show that the unified and decomposed solutions reduce the number of shifts by 58.1% and 80.1%, respectively, leading to a 53.8% and 46.3% reduction in the overall runtime and 52.6% and 61.7% reduction in the energy consumption, compared to a naive baseline

    Memory Carousel: LLVM-Based Bitwise Wear-Leveling for Non-Volatile Main Memory

    Get PDF
    Emerging non-volatile memory yields, alongside many advantages, technical shortcomings, such as reduced cell lifetime. Although many wear-leveling approaches exist to extend the lifetime of such memories, usually a trade-off for the granularity of wear-leveling has to be made. Due to iterative write schemes (repeatedly sense and write), wear-out of memory in certain systems is directly dependent on the written bit value and thus can be highly imbalanced, requiring dedicated bit-wise wear-leveling. Such a bit-wise wear-leveling so far has only be proposed together with a special hardware support. However, if no dedicated hardware solutions are available, especially for commercial off-the-shelf systems with non-volatile memories, a software solution can be crucial for the system lifetime. In this work, we propose entirely software-based bit-wise wearleveling, where the position of bits within CPU words in main memory is rotated on a regular basis. We leverage the LLVM intermediate representation to adjust load and store operations of the application with a custom compiler pass. Experimental evaluation shows that the lifetime by applying local rotation within the CPU word can be extended by a factor of up to 21×. We also show that our method can incorporate with coarser-grained wear-leveling, e.g. on block granularity and assist achievement of higher lifetime improvements

    Efficient Realization of Decision Trees for Real-Time Inference

    Get PDF
    For timing-sensitive edge applications, the demand for efficient lightweight machine learning solutions has increased recently. Tree ensembles are among the state-of-the-art in many machine learning applications. While single decision trees are comparably small, an ensemble of trees can have a significant memory footprint leading to cache locality issues, which are crucial to performance in terms of execution time. In this work, we analyze memory-locality issues of the two most common realizations of decision trees, i.e. native and if-else trees. We highlight, that both realizations demand a more careful memory layout to improve caching behavior and maximize performance. We adopt a probabilistic model of decision tree inference to find the best memory layout for each tree at the application layer. Further, we present an efficient heuristic to take architecture-dependent information into account thereby optimizing the given ensemble for a target computer architecture. Our code-generation framework, which is freely available on an open-source repository, produces optimized code sessions while preserving the structure and accuracy of the trees. With several real-world data sets, we evaluate the elapsed time of various tree realizations on server hardware as well as embedded systems for Intel and ARM processors. Our optimized memory layout achieves a reduction in execution time up to 75 % execution for server-class systems, and up to 70 % for embedded systems, respectively

    Immediate Split Trees: Immediate Encoding of Floating Point Split Values in Random Forests

    No full text
    Random forests and decision trees are increasingly interesting candidates for resource-constrained machine learning models. In order to make the execution of these models efficient under resource limitations, various optimized implementations have been proposed in the literature, usually implementing either native trees or if-else trees. While a certain motivation for the optimization of if-else trees is to benefit the behavior of dedicated instruction caches, in this work we highlight that if-else trees might also strongly depend on data caches. We identify one crucial issue of if-else tree implementations and propose an optimized implementation, which keeps the logic tree structure untouched and thus does not influence the accuracy, but eliminates the need to load comparison values from the data caches. Experimental evaluation of this implementation shows that we can greatly reduce the amount of data cache misses by up to 99%, while not increasing the amount of instruction cache misses in comparison to the state-of-the-art. We additionally highlight various scenarios, where the reduction of data cache misses draws important benefit on the allover execution time
    corecore