1,895 research outputs found

    Learning programs by learning from failures

    Full text link
    We describe an inductive logic programming (ILP) approach called learning from failures. In this approach, an ILP system (the learner) decomposes the learning problem into three separate stages: generate, test, and constrain. In the generate stage, the learner generates a hypothesis (a logic program) that satisfies a set of hypothesis constraints (constraints on the syntactic form of hypotheses). In the test stage, the learner tests the hypothesis against training examples. A hypothesis fails when it does not entail all the positive examples or entails a negative example. If a hypothesis fails, then, in the constrain stage, the learner learns constraints from the failed hypothesis to prune the hypothesis space, i.e. to constrain subsequent hypothesis generation. For instance, if a hypothesis is too general (entails a negative example), the constraints prune generalisations of the hypothesis. If a hypothesis is too specific (does not entail all the positive examples), the constraints prune specialisations of the hypothesis. This loop repeats until either (i) the learner finds a hypothesis that entails all the positive and none of the negative examples, or (ii) there are no more hypotheses to test. We introduce Popper, an ILP system that implements this approach by combining answer set programming and Prolog. Popper supports infinite problem domains, reasoning about lists and numbers, learning textually minimal programs, and learning recursive programs. Our experimental results on three domains (toy game problems, robot strategies, and list transformations) show that (i) constraints drastically improve learning performance, and (ii) Popper can outperform existing ILP systems, both in terms of predictive accuracies and learning times.Comment: Accepted for the machine learning journa

    Loop transformations for clustered VLIW architectures

    Get PDF
    With increasing demands for performance by embedded systems, especially by digital signal processing (DSP) applications, embedded processors must increase available instructionlevel parallelism (ILP) within significant constraints on power consumption and chip cost. Unfortunately, supporting a large amount of ILP on a processor while maintaining a single register file increases chip cost and potentially decreases overall performance due to increased cycle time. To address this problem, some modern embedded processors partition the register file into multiple low-ported register files, each directly connected with one or more functional units. These functional unit/register file groups are called clusters. Clustered VLIW (very long instruction word) architectures need extra copy operations or delays to transfer values among clusters. To take advantage of clustered architectures, the compiler must expose parallelism for maximal functional-unit utilization, and schedule instructions to reduce intercluster communication overhead. High-level loop transformations offer an excellent opportunity to enhance the abilities of low-level optimizers to generate code for clustered architectures. This dissertation investigates the effects of three loop transformations, i.e., loop fusion, loop unrolling, and unroll-and-jam, on clustered VLIW architectures. The objective is to achieve high performance with low communication overhead. This dissertation discusses the following techniques: Loop Fusion This research examines the impact of loop fusion on clustered architectures. A metric based upon communication costs for guiding loop fusion is developed and tested on DSP benchmarks. Unroll-and-jam and Loop Unrolling A new method that integrates a communication cost model with an integer-optimization problem is developed to determine unroll amounts for loop unrolling and unroll-and-jam automatically for a specific loop on a specific architecture. These techniques have been implemented and tested using DSP benchmarks on simulated, clustered VLIW architectures and a real clustered, embedded processor, the TI TMS320C64X. The results show that the new techniques achieve an average speedup of 1.72-1.89 on five different clustered architectures. These techniques have been implemented and tested using DSP benchmarks on simulated, clustered VLIW architectures and a real clustered, embedded processor, the TI TMS320C64X. The results show that the new techniques achieve an average speedup of 1.72-1.89 on five different clustered architectures

    Enhanced sharing analysis techniques: a comprehensive evaluation

    Get PDF
    Sharing, an abstract domain developed by D. Jacobs and A. Langen for the analysis of logic programs, derives useful aliasing information. It is well-known that a commonly used core of techniques, such as the integration of Sharing with freeness and linearity information, can significantly improve the precision of the analysis. However, a number of other proposals for refined domain combinations have been circulating for years. One feature that is common to these proposals is that they do not seem to have undergone a thorough experimental evaluation even with respect to the expected precision gains. In this paper we experimentally evaluate: helping Sharing with the definitely ground variables found using Pos, the domain of positive Boolean formulas; the incorporation of explicit structural information; a full implementation of the reduced product of Sharing and Pos; the issue of reordering the bindings in the computation of the abstract mgu; an original proposal for the addition of a new mode recording the set of variables that are deemed to be ground or free; a refined way of using linearity to improve the analysis; the recovery of hidden information in the combination of Sharing with freeness information. Finally, we discuss the issue of whether tracking compoundness allows the computation of more sharing information

    An OpenCL software compilation framework targeting an SoC-FPGA VLIW chip multiprocessor

    Get PDF
    Modern systems-on-chip augment their baseline CPU with coprocessors and accelerators to increase overall computational capability and power efficiency, and thus have evolved into heterogeneous multi-core systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This paper discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a highly configurable VLIW Chip Multiprocessor architecture known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on a number of hardware configurations of the LE1 CMP. The presented OpenCL framework fully automates the compilation flow and supports work-item coalescing which better maps onto the ILP processor cores of the LE1 architecture. This paper discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework by running 12 industry-standard OpenCL benchmarks drawn from the AMD SDK and the Rodinia suites. The benchmarks are executed on 40 LE1 configurations with 10 implemented on an SoC-FPGA and the remaining on a cycle-accurate simulator. Across 12 OpenCL benchmarks results demonstrate near-linear wall-clock performance improvement of 1.8x (using 2 dual-issue cores), up to 5.2x (using 8 dual-issue cores) and on one case, super-linear improvement of 8.4x (FixOffset kernel, 8 dual-issue cores). The number of OpenCL benchmarks evaluated makes this study one of the most complete in the literature

    Reducing the WCET and analysis time of systems with simple lockable instruction caches

    Get PDF
    One of the key challenges in real-time systems is the analysis of the memory hierarchy. Many Worst-Case Execution Time (WCET) analysis methods supporting an instruction cache are based on iterative or convergence algorithms, which are rather slow. Our goal in this paper is to reduce the WCET analysis time on systems with a simple lockable instruction cache, focusing on the Lock-MS method. First, we propose an algorithm to obtain a structure-based representation of the Control Flow Graph (CFG). It organizes the whole WCET problem as nested subproblems, which takes advantage of common branch-and-bound algorithms of Integer Linear Programming (ILP) solvers. Second, we add support for multiple locking points per task, each one with specific cache contents, instead of a given locked content for the whole task execution. Locking points are set heuristically before outer loops. Such simple heuristics adds no complexity, and reduces the WCET by taking profit of the temporal reuse found in loops. Since loops can be processed as isolated regions, the optimal contents to lock into cache for each region can be obtained, and the WCET analysis time is further reduced. With these two improvements, our WCET analysis is around 10 times faster than other approaches. Also, our results show that the WCET is reduced, and the hit ratio achieved for the lockable instruction cache is similar to that of a real execution with an LRU instruction cache. Finally, we analyze the WCET sensitivity to compiler optimization, showing for each benchmark the right choices and pointing out that O0 is always the worst option

    Avoiding Unnecessary Information Loss: Correct and Efficient Model Synchronization Based on Triple Graph Grammars

    Full text link
    Model synchronization, i.e., the task of restoring consistency between two interrelated models after a model change, is a challenging task. Triple Graph Grammars (TGGs) specify model consistency by means of rules that describe how to create consistent pairs of models. These rules can be used to automatically derive further rules, which describe how to propagate changes from one model to the other or how to change one model in such a way that propagation is guaranteed to be possible. Restricting model synchronization to these derived rules, however, may lead to unnecessary deletion and recreation of model elements during change propagation. This is inefficient and may cause unnecessary information loss, i.e., when deleted elements contain information that is not represented in the second model, this information cannot be recovered easily. Short-cut rules have recently been developed to avoid unnecessary information loss by reusing existing model elements. In this paper, we show how to automatically derive (short-cut) repair rules from short-cut rules to propagate changes such that information loss is avoided and model synchronization is accelerated. The key ingredients of our rule-based model synchronization process are these repair rules and an incremental pattern matcher informing about suitable applications of them. We prove the termination and the correctness of this synchronization process and discuss its completeness. As a proof of concept, we have implemented this synchronization process in eMoflon, a state-of-the-art model transformation tool with inherent support of bidirectionality. Our evaluation shows that repair processes based on (short-cut) repair rules have considerably decreased information loss and improved performance compared to former model synchronization processes based on TGGs.Comment: 33 pages, 20 figures, 3 table
    corecore