1,895 research outputs found
Learning programs by learning from failures
We describe an inductive logic programming (ILP) approach called learning
from failures. In this approach, an ILP system (the learner) decomposes the
learning problem into three separate stages: generate, test, and constrain. In
the generate stage, the learner generates a hypothesis (a logic program) that
satisfies a set of hypothesis constraints (constraints on the syntactic form of
hypotheses). In the test stage, the learner tests the hypothesis against
training examples. A hypothesis fails when it does not entail all the positive
examples or entails a negative example. If a hypothesis fails, then, in the
constrain stage, the learner learns constraints from the failed hypothesis to
prune the hypothesis space, i.e. to constrain subsequent hypothesis generation.
For instance, if a hypothesis is too general (entails a negative example), the
constraints prune generalisations of the hypothesis. If a hypothesis is too
specific (does not entail all the positive examples), the constraints prune
specialisations of the hypothesis. This loop repeats until either (i) the
learner finds a hypothesis that entails all the positive and none of the
negative examples, or (ii) there are no more hypotheses to test. We introduce
Popper, an ILP system that implements this approach by combining answer set
programming and Prolog. Popper supports infinite problem domains, reasoning
about lists and numbers, learning textually minimal programs, and learning
recursive programs. Our experimental results on three domains (toy game
problems, robot strategies, and list transformations) show that (i) constraints
drastically improve learning performance, and (ii) Popper can outperform
existing ILP systems, both in terms of predictive accuracies and learning
times.Comment: Accepted for the machine learning journa
Loop transformations for clustered VLIW architectures
With increasing demands for performance by embedded systems, especially by digital signal processing (DSP) applications, embedded processors must increase available instructionlevel parallelism (ILP) within significant constraints on power consumption and chip cost. Unfortunately, supporting a large amount of ILP on a processor while maintaining a single register file increases chip cost and potentially decreases overall performance due to increased cycle time. To address this problem, some modern embedded processors partition the register file into multiple low-ported register files, each directly connected with one or more functional units. These functional unit/register file groups are called clusters.
Clustered VLIW (very long instruction word) architectures need extra copy operations or delays to transfer values among clusters. To take advantage of clustered architectures, the compiler must expose parallelism for maximal functional-unit utilization, and schedule instructions to reduce intercluster communication overhead.
High-level loop transformations offer an excellent opportunity to enhance the abilities of low-level optimizers to generate code for clustered architectures. This dissertation investigates the effects of three loop transformations, i.e., loop fusion, loop unrolling, and unroll-and-jam, on clustered VLIW architectures. The objective is to achieve high performance with low communication overhead. This dissertation discusses the following techniques:
Loop Fusion This research examines the impact of loop fusion on clustered architectures. A metric based upon communication costs for guiding loop fusion is developed and tested on DSP benchmarks.
Unroll-and-jam and Loop Unrolling A new method that integrates a communication cost model with an integer-optimization problem is developed to determine unroll amounts for loop unrolling and unroll-and-jam automatically for a specific loop on a specific architecture. These techniques have been implemented and tested using DSP benchmarks on simulated, clustered VLIW architectures and a real clustered, embedded processor, the TI TMS320C64X. The results show that the new techniques achieve an average speedup of 1.72-1.89 on five different clustered architectures.
These techniques have been implemented and tested using DSP benchmarks on simulated, clustered VLIW architectures and a real clustered, embedded processor, the TI TMS320C64X. The results show that the new techniques achieve an average speedup of 1.72-1.89 on five different clustered architectures
Enhanced sharing analysis techniques: a comprehensive evaluation
Sharing, an abstract domain developed by D. Jacobs and A. Langen for the analysis of logic
programs, derives useful aliasing information. It is well-known that a commonly used core
of techniques, such as the integration of Sharing with freeness and linearity information, can
significantly improve the precision of the analysis. However, a number of other proposals for
refined domain combinations have been circulating for years. One feature that is common
to these proposals is that they do not seem to have undergone a thorough experimental
evaluation even with respect to the expected precision gains.
In this paper we experimentally
evaluate: helping Sharing with the definitely ground variables found using Pos, the domain
of positive Boolean formulas; the incorporation of explicit structural information; a full
implementation of the reduced product of Sharing and Pos; the issue of reordering the
bindings in the computation of the abstract mgu; an original proposal for the addition of
a new mode recording the set of variables that are deemed to be ground or free; a refined
way of using linearity to improve the analysis; the recovery of hidden information in the
combination of Sharing with freeness information. Finally, we discuss the issue of whether
tracking compoundness allows the computation of more sharing information
An OpenCL software compilation framework targeting an SoC-FPGA VLIW chip multiprocessor
Modern systems-on-chip augment their baseline CPU with coprocessors and accelerators to increase overall computational capability and power efficiency, and thus have evolved into heterogeneous multi-core systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This paper discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a highly configurable VLIW Chip Multiprocessor architecture known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on a number of hardware configurations of the LE1 CMP. The presented OpenCL framework fully automates the compilation flow and supports work-item coalescing which better maps onto the ILP processor cores of the LE1 architecture. This paper discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework by running 12 industry-standard OpenCL benchmarks drawn from the AMD SDK and the Rodinia suites. The benchmarks are executed on 40 LE1 configurations with 10 implemented on an SoC-FPGA and the remaining on a cycle-accurate simulator. Across 12 OpenCL benchmarks results demonstrate near-linear wall-clock performance improvement of 1.8x (using 2 dual-issue cores), up to 5.2x (using 8 dual-issue cores) and on one case, super-linear improvement of 8.4x (FixOffset kernel, 8 dual-issue cores). The number of OpenCL benchmarks evaluated makes this study one of the most complete in the literature
Reducing the WCET and analysis time of systems with simple lockable instruction caches
One of the key challenges in real-time systems is the analysis of the memory hierarchy. Many Worst-Case Execution Time (WCET) analysis methods supporting an instruction cache are based on iterative or convergence algorithms, which are rather slow. Our goal in this paper is to reduce the WCET analysis time on systems with a simple lockable instruction cache, focusing on the Lock-MS method. First, we propose an algorithm to obtain a structure-based representation of the Control Flow Graph (CFG). It organizes the whole WCET problem as nested subproblems, which takes advantage of common branch-and-bound algorithms of Integer Linear Programming (ILP) solvers. Second, we add support for multiple locking points per task, each one with specific cache contents, instead of a given locked content for the whole task execution. Locking points are set heuristically before outer loops. Such simple heuristics adds no complexity, and reduces the WCET by taking profit of the temporal reuse found in loops. Since loops can be processed as isolated regions, the optimal contents to lock into cache for each region can be obtained, and the WCET analysis time is further reduced. With these two improvements, our WCET analysis is around 10 times faster than other approaches. Also, our results show that the WCET is reduced, and the hit ratio achieved for the lockable instruction cache is similar to that of a real execution with an LRU instruction cache. Finally, we analyze the WCET sensitivity to compiler optimization, showing for each benchmark the right choices and pointing out that O0 is always the worst option
Avoiding Unnecessary Information Loss: Correct and Efficient Model Synchronization Based on Triple Graph Grammars
Model synchronization, i.e., the task of restoring consistency between two
interrelated models after a model change, is a challenging task. Triple Graph
Grammars (TGGs) specify model consistency by means of rules that describe how
to create consistent pairs of models. These rules can be used to automatically
derive further rules, which describe how to propagate changes from one model to
the other or how to change one model in such a way that propagation is
guaranteed to be possible. Restricting model synchronization to these derived
rules, however, may lead to unnecessary deletion and recreation of model
elements during change propagation. This is inefficient and may cause
unnecessary information loss, i.e., when deleted elements contain information
that is not represented in the second model, this information cannot be
recovered easily. Short-cut rules have recently been developed to avoid
unnecessary information loss by reusing existing model elements. In this paper,
we show how to automatically derive (short-cut) repair rules from short-cut
rules to propagate changes such that information loss is avoided and model
synchronization is accelerated. The key ingredients of our rule-based model
synchronization process are these repair rules and an incremental pattern
matcher informing about suitable applications of them. We prove the termination
and the correctness of this synchronization process and discuss its
completeness. As a proof of concept, we have implemented this synchronization
process in eMoflon, a state-of-the-art model transformation tool with inherent
support of bidirectionality. Our evaluation shows that repair processes based
on (short-cut) repair rules have considerably decreased information loss and
improved performance compared to former model synchronization processes based
on TGGs.Comment: 33 pages, 20 figures, 3 table
- …