1,984 research outputs found
ILP Modulo Data
The vast quantity of data generated and captured every day has led to a
pressing need for tools and processes to organize, analyze and interrelate this
data. Automated reasoning and optimization tools with inherent support for data
could enable advancements in a variety of contexts, from data-backed decision
making to data-intensive scientific research. To this end, we introduce a
decidable logic aimed at database analysis. Our logic extends quantifier-free
Linear Integer Arithmetic with operators from Relational Algebra, like
selection and cross product. We provide a scalable decision procedure that is
based on the BC(T) architecture for ILP Modulo Theories. Our decision procedure
makes use of database techniques. We also experimentally evaluate our approach,
and discuss potential applications.Comment: FMCAD 2014 final version plus proof
A mathematical formulation of the loop pipelining problem
This paper presents a mathematical model for the loop pipelining problem that considers several parameters for optimization and supports any combination of resource and timing constraints. The unrolling degree of the loop is one of the variables explored by the model. By using Farey’s series, an optimal exploration of the unrolling degree is performed and optimal solutions not considered by other methods are obtained. Finding an optimal schedule that minimizes resource and register requirements is solved by using an Integer linear programming (ILP) model. A novel paradigm called branch and prune is proposed to eficiently converge towards the optimal schedule and prune the search tree for integer solutions, thus drastically reducing the running time. This is the first formulation that combines the unrolling degree of the loop with timing and resource constraints in a mathematical model that guarantees optimal solutions.Peer ReviewedPostprint (author's final draft
A Comparative Study of Scheduling Techniques for Multimedia Applications on SIMD Pipelines
Parallel architectures are essential in order to take advantage of the
parallelism inherent in streaming applications. One particular branch of these
employ hardware SIMD pipelines. In this paper, we analyse several scheduling
techniques, namely ad hoc overlapped execution, modulo scheduling and modulo
scheduling with unrolling, all of which aim to efficiently utilize the special
architecture design. Our investigation focuses on improving throughput while
analysing other metrics that are important for streaming applications, such as
register pressure, buffer sizes and code size. Through experiments conducted on
several media benchmarks, we present and discuss trade-offs involved when
selecting any one of these scheduling techniques.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and
Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241
A unified modulo scheduling and register allocation technique for clustered processors
This work presents a modulo scheduling framework for clustered ILP processors that integrates the cluster assignment, instruction scheduling and register allocation steps in a single phase. This unified approach is more effective than traditional approaches based on sequentially performing some (or all) of the three steps, since it allows optimizing the global code generation problem instead of searching for optimal solutions to each individual step. Besides, it avoids the iterative nature of traditional approaches, which require repeated applications of the three steps until a valid solution is found. The proposed framework includes a mechanism to insert spill code on-the-fly and heuristics to evaluate the quality of partial schedules considering simultaneously inter-cluster communications, memory pressure and register pressure. Transformations that allow trading pressure on a type of resource for another resource are also included. We show that the proposed technique outperforms previously proposed techniques. For instance, the average speed-up for the SPECfp95 is 36% for a 4-cluster configuration.Peer ReviewedPostprint (published version
Hierarchical clustered register file organization for VLIW processors
Technology projections indicate that wire delays will become one of the biggest constraints in future microprocessor designs. To avoid long wire delays and therefore long cycle times, processor cores must be partitioned into components so that most of the communication is done locally. In this paper, we propose a novel register file organization for VLIW cores that combines clustering with a hierarchical register file organization. Functional units are organized in clusters, each one with a local first level register file. The local register files are connected to a global second level register file, which provides access to memory. All intercluster communications are done through the second level register file. This paper also proposes MIRS-HC, a novel modulo scheduling technique that simultaneously performs instruction scheduling, cluster selection, inserts communication operations, performs register allocation and spill insertion for the proposed organization. The results show that although more cycles are required to execute applications, the execution time is reduced due to a shorter cycle time. In addition, the combination of clustering and hierarchy provides a larger design exploration space that trades-off performance and technology requirements.Peer ReviewedPostprint (published version
A cost-effective clustered architecture
In current superscalar processors, all floating-point resources are idle during the execution of integer programs. As previous works show, this problem can be alleviated if the floating-point cluster is extended to execute simple integer instructions. With minor hardware modifications to a conventional superscalar processor, the issue width can potentially be doubled without increasing the hardware complexity. In fact, the result is a clustered architecture with two heterogeneous clusters. We propose to extend this architecture with a dynamic steering logic that sends the instructions to either cluster. The performance of clustered architectures depends on the inter-cluster communication overhead and the workload balance. We present a scheme that uses run-time information to optimise the trade-off between these figures. The evaluation shows that this scheme can achieve an average speed-up of 35% over a conventional 8-way issue (4 int+4 fp) machine and that it outperforms the previously proposed one.Peer ReviewedPostprint (published version
Software prefetching for software pipelined loops
The paper investigates the interaction between software pipelining and different software prefetching techniques for VLIW machines. It is shown that processor stalls due to memory dependencies have a great impact into execution time. A novel heuristic is proposed and it is show to outperform previous proposals.Peer ReviewedPostprint (published version
Pushing the envelope of Optimization Modulo Theories with Linear-Arithmetic Cost Functions
In the last decade we have witnessed an impressive progress in the
expressiveness and efficiency of Satisfiability Modulo Theories (SMT) solving
techniques. This has brought previously-intractable problems at the reach of
state-of-the-art SMT solvers, in particular in the domain of SW and HW
verification. Many SMT-encodable problems of interest, however, require also
the capability of finding models that are optimal wrt. some cost functions. In
previous work, namely "Optimization Modulo Theory with Linear Rational Cost
Functions -- OMT(LAR U T )", we have leveraged SMT solving to handle the
minimization of cost functions on linear arithmetic over the rationals, by
means of a combination of SMT and LP minimization techniques. In this paper we
push the envelope of our OMT approach along three directions: first, we extend
it to work also with linear arithmetic on the mixed integer/rational domain, by
means of a combination of SMT, LP and ILP minimization techniques; second, we
develop a multi-objective version of OMT, so that to handle many cost functions
simultaneously; third, we develop an incremental version of OMT, so that to
exploit the incrementality of some OMT-encodable problems. An empirical
evaluation performed on OMT-encoded verification problems demonstrates the
usefulness and efficiency of these extensions.Comment: A slightly-shorter version of this paper is published at TACAS 2015
conferenc
Time-Triggered Co-Scheduling of Computation and Communication with Jitter Requirements
The complexity of embedded application design is increasing with growing user
demands. In particular, automotive embedded systems are highly complex in
nature, and their functionality is realized by a set of periodic tasks. These
tasks may have hard real-time requirements and communicate over an
interconnect. The problem is to efficiently co-schedule task execution on cores
and message transmission on the interconnect so that timing constraints are
satisfied. Contemporary works typically deal with zero-jitter scheduling, which
results in lower resource utilization, but has lower memory requirements. This
article focuses on jitter-constrained scheduling that puts constraints on the
tasks jitter, increasing schedulability over zero- jitter scheduling. The
contributions of this article are: 1) Integer Linear Programming and
Satisfiability Modulo Theory model exploiting problem-specific information to
reduce the formulations complexity to schedule small applications. 2) A
heuristic approach, employing three levels of scheduling scaling to real-world
use-cases with 10000 tasks and messages. 3) An experimental evaluation of the
proposed approaches on a case-study and on synthetic data sets showing the
efficiency of both zero-jitter and jitter- constrained scheduling. It shows
that up to 28% higher resource utilization can be achieved by having up to 10
times longer computation time with relaxed jitter requirements.Comment: IEEE Transactions on Computers (2017
- …
