For a sampled-data control system, we propose to choose the time between samples to be shorter than the computational delay involved in computing the control signal, an approach we call intradelay sampling. It is shown that, utilising a parallel computing architecture, this is indeed feasible and that intra-delay sampling schemes yield better performance than their slower sampling counterparts.
Consider Figure 1 . Here a continuous-time plant P is controlled by a digital controller C, where the computation of the control signal requires T c seconds to terminate. Furthermore, the plant output is sampled and the control input is implemented with a (sampling) period of T s seconds. Typically the digital controller is then realized as follows: the plant output is sampled and the computation of the control signal is initiated. Once the computation of the control signal has terminated (after T c seconds) the control signal is implemented. At the same time, a new sample is taken and a further computation of the control signal is initiated, as depicted in Figure 2 (a). We observe that T s is bounded by T c from below (in fact in this case T c = T s ), since one has to 'wait' for the computation to terminate before one can sample again and initiate a further computation.
If parallel hardware is available, a standard approach to speed up the computation of the control signal is to perform parallelisable operations in parallel. For example, all rows of a state feedback controller u(t) = Kx(t) can be implemented in parallel. Also, the underlying dot product can be accelerated utilising parallel hardware, e.g. see [8] where the authors make exhaustive use of parallel structures to accelerate the computation of a dot product. Such obvious exploitation of parallelism within the control algorithm leads to smaller computational delays T c and therefore allows a higher sampling frequency 1/T s ; however, we still face the potential limitation that T c ≤ T s . Hence, the smallest achievable T c , and therefore T s , is a function of the exploitable parallelism in the control algorithm, which may be limited.
The core message of this paper is that one can find controller implementations that allow T s < T c , i.e. one samples faster than the computational delay, where T s is only bounded from below by the amount of available parallel resource and a constant that depends on the geometry of the system. More precisely, we will show that the sampling time T s is a function of the number of inputs, whereas the computational delay T c is a function of the number of states.
How this is conducted can be observed from Figure 2 is sampled and the computation of the control signal is initiated. Then, after T s seconds, the plant output is sampled again and a new computation of the control signal is initiated before the previous computation has terminated, hence T s < T c . We term this intuitive principle intra-delay sampling. Observe that intradelay sampling fundamentally depends on the availability of parallel computational units, as the number of diagonal lines that intersect any given vertical line in Figure 2 (b) is greater than one. Also note that intra-delay sampling is a single-rate control technique and therefore distinct from multi-rate control.
By utilising the proposed technique we are able to speed up the sampling and control signal update rate; spare parallel computational resources may be transformed directly into sampling speed-ups. The expected gains from the intra-delay sampling scheme are:
1. By increasing the sampling frequency, the controller may cope with higher bandwidth disturbances.
2. It reduces the controller reaction time to disturbances, promising better disturbance rejection.
3. We expect a smoother overall control signal.
Note that increasing the sampling frequency of the intra-delay sampling scheme alone does not necessarily translate into better disturbance rejection properties. For many implementations of the algorithm the computational delay will scale with the inverse of the sampling period, leading to increasingly poor disturbance rejection properties for increasingly small sampling periods, e.g. see Section 3.1. To prove existence of implementations that achieve a decoupling of computational delay and sampling period is one of the key results of the paper.
We stress that the theoretical bounds derived here are conceptual and proof of principle only: Linear state feedback is only considered for its simplicity and to illustrate the point rather than lending itself to actual implementations. The conducted analysis is a necessary first step towards the handling of more complicated control laws that require far more computational power than simple linear controllers. This is the first control paper that clearly shows how one may be able to decouple the sampling period from the computational delay while obtaining explicit formulas for computing T s and T c .
The paper is organized as follows. We will introduce the principle of intra-delay sampling with the help of a concrete control algorithm (linear state feedback) and discuss two possible implementations of a controller of this type: one where the computational delay T c is a function of the ratio Tc Ts , i.e. the sampling speed-up, and one where T c is a constant. While the former does not allow for a straightforward analysis, the latter allows a clear reasoning why intra-delay sampling is useful in practice. We then give bounds on achievable sampling rates and computational delays for a specific hardware architecture (a top-of-the-line FPGA). Finally, we will show by simulation that intra-delay sampling has the potential to outperform standard sampling schemes.
Intra-delay sampling controller design
Suppose the time required to compute the control signal, i.e. the computational delay T c , is known; we will see in Section 3 how T c may be determined explicitly. As a measure of the sampling speed-up define
where we assume w.l.o.g. that h ∈ N, h ≥ 1. To ensure that for h → ∞, T c → 0 the closed-loop system remains meaningful and tends towards its continuous-time counterpart, we employ the theory of sampled-data systems (e.g. see [3] ).
Throughout this section we will analyse three different plants, depicted in Figure 3 : the continuoustime plant P , the sampled-data discrete-time plantP and the sampled-data discrete-time plant incorporating the computational delayP h . Here z −h represents a discrete-time delay of h samples, hence hT s = T c seconds. Consequently we will design a sampled-data controller for the plantP h , i.e. a controller that is aware of its own computational delay.
Example: LQR control
The design challenge for a sampled-data LQR controller with a computational delay of T c = hT s seconds can be formulated as:
where u : [0, hT s ) → R m is given, subject to
and the zero-order-hold constraint
where
(A, B) stabilizable. Similar design challenges can be posed, e.g. for H ∞ , etc. A digital controller that solves such a control problem can be derived in a number of ways, e.g. moving the controller delay into the plant, followed by Artstein's model reduction approach [1] to obtain a plant without input delay in continuous-time, followed by discretization and controller design. Here we give an exemplar derivation that is based on state augmentation in discrete-time to deal with the delay. We would like to stress that although these derivations are more or less standard, e.g. see [3] , they are nevertheless central to the argument, since they motivate the intra-delay sampling controller structure that will be exploited subsequently.
From (1b) and (1c) we have that We have withũ
We can now use standard design tools to derive a corresponding discrete-time LQR controller that generates a control signal u that minimizes the continuous-time cost function in (1a) (the sampled-data control problem) subject to the computational delay T c . For that purpose we have to determine the equivalent discrete-time weighting matricesQ,R andS, which we will do next. Since
where the discrete-time sampled-data weighting matrices can be computed as
From [5] we then have that the optimal (sampled-data) controller gain matrixK for the augmented system is given byK = (R +B
, where the discrete-time algebraic Riccati equation
The minimizing control signal can be computed as
Note that there are more efficient ways to derive Σ than the direct computation via the given Riccati equation, e.g. [11] , however, since these computations are performed off-line, we merely note that computable solutions exist for large h.
Controller implementation
The intra-delay sampling controller is now implemented as follows: at every sampling time kT s , k ∈ N, we compute (2) and implement the corresponding control signal u(t), ∀t ∈ [kT s , (k + 1)T s ). Note that since we assume that each such computation requires T c seconds to terminate and since T s < T c this implies that we have to initiate h parallel computations, i.e. the number of diagonal lines that intersect every given vertical line in Figure 2 (b). This first resource-related challenge is therefore that the utilised hardware architecture is required to deal with at least h parallel processes. Furthermore, since for a fixed T c the sampling speed-up h = Tc Ts , or the minimum number of parallel processes, is an increasing function of the sampling frequency 1/T s , the amount of required (parallel) computational resources scales with h.
The second challenge that we face is posed by the computation of (2) itself. The main concern here is that the computational complexity scales with the sampling speed-up h, sincex[k] ∈ R n+hm , and therefore the number of operations to compute eachKx scales with h. This implies that, for a direct implementation of the matrix-by-vector product, the computational delay T c is in fact an increasing function of h and not a constant, which is inconvenient and potentially problematic. We will now discuss these issues in detail. Also observe that since we can writeũ
A direct implementation
.., m}, is a row ofK, we can compute the matrix-by-vector productKx by computing m dot products in parallel. One possible implementation of a dot product is given in Figure 4 , where every box represents a real, physical implementation of either a scalar multiplication or a scalar addition.
We can then computeKx by letting v :=K q , q ∈ {1, 2, ..., m}, w :=x(kT s ), l := n + mh in Figure 4 .
Custom implementation:
A typical custom hardware implementation of the dot product would utilize parallel hardware to compute every column in Figure 4 . This is usually achieved by employing a standard technique in digital hardware design called 'pipelining' [6] .
A pipelined implementation of the dot product can be visualized as in Figure 5 , where each block represents the parallel hardware that implements one column of 
where the ceiling function ⌈a⌉ := min {b ∈ Z | b ≥ a}, a ∈ R and T m and T a are the times that the hardware requires to compute a scalar multiplication and addition, respectively. The computational delay T c includes a log 2 since the number of required addition stages scales logarithmically with the number of inputs, whereas the ceiling function ensures that the number of addition stages is an integer.
Note that since we need to implement m such pipelines in parallel to computeKx(kT s ), the utilisation,
i.e. the required number of computational units, is given by
where MUL and ADD represents a single scalar multiplier and adder, respectively. Pipelining has the effect that the sampling time T s is now bounded from below by the slowest element in the pipeline, i.e. in our case T s ≥ max{T m , T a }. However, we note that deeper pipelining or a fully parallel implementation of the dot product allows a further reduction of T s . For example, for a fully parallel implementation, T s can be chosen as small as we can physically samplex, typically in the order of the clock-period of the device. In the interest of simplicity, we only consider one likely implementation: the scalar product is pipelined, adders and multipliers are not pipelined, hence T s ≥ max{T m , T a }.
GPP implementation:
For single-core/single-thread GPPs the exploitation of parallelism on the level demonstrated for the custom implementation is usually not feasible, since inter-GPP communication delays are likely to negate the expected speed-up. This is one typical issue of multi-core GPP processors which often complicates a straightforward parallel implementation of algorithms. To take this effect into account, we will only compute allK qx , q ∈ {1, 2, ..., m}, in parallel utilizing multiple GPPs, whereas the dot products for v :=K q , w :=x(kT s ), l := (n + mh) in Figure 4 are computed sequentially. This reduces the inter-GPP communication considerably while posing a reasonably large atomic problem to every GPP. Please note that there naturally exist infinitely many possible GPP implementations where we only consider one likely one. The computational delay for such an implementation is given by
Assuming the availability of a sufficient number of parallel GPPs we can initiate a new computation whenever we like. Therefore, T s can be chosen as small as we can physically samplex. The utilisation is then given by
where GPP represents a computational unit that is able to perform scalar multiplications and additions.
Note that in (3) and (5) T c is an increasing function of h. Although the intra-delay sampling controller may still show better disturbance rejection properties to the standard sampling controller, e.g. when the system dynamics are slow, the dependence of T c on h complicates the analysis significantly. For example, we may substitute hT s for T c in (3) and (5). Together with the lower bounds on T s , this gives the set of allowable pairs (h, T s ) as functions of T m and T a . We would then have to analyse if there exists allowable pairs (h, T s ) that promise better (disturbance rejection) properties of the intra-delay sampling controller over the standard sampling controller (T c = T s and h = 1). We will now highlight the relationship between intra-delay sampling and IIR filtering and give a controller implementation that decouples T c and T s .
A filter implementation of the intra-delay sampling controller where T c is independent of h
We observed in the previous section that the computational delay T c for a direct implementation ofKx is a function of h, since at every sampling time kT s we have to compute a matrix-by-vector product for which the computational delay T c scales with h. Now observe that we can rewrite (2) as follows.
For k ∈ N:
Replacing k with k − h throughout leads to
Note that the controller described by (8) has the structure of an Infinite Impulse Response (IIR) filter.
Furthermore, (8) with u[k] = u(kT s ) leads to:
The key observation now is that, in order to compute u(kT s ), k ∈ N, it is not necessary to evaluate the complete sum in (9) at once, but one can compute every step i whenever a new u becomes available.
This implies that at every sampling time kT s we initiate a new computation of the type Lu, as depicted in Figure 6 for the case where h = 2. We denote the time these computations require to terminate T l .
We then define the smallest achievable sampling time T s := T l + 2T a .
In parallel to the computation of the sum, we also initiate the computation of Kx. Finally, when the computation of the sum is complete, we add the result of the computation of Kx. This gives u and prevents a build-up of addition latencies. Note that we will assume n > m throughout, which is often the case. For n ≤ m, although possible implementations exist, the computational timing will be different.
Please note that we also assumed w.l.o.g. that T s is an integer multiple of T c , i.e. h ∈ N.
One possible implementation of the algorithm is now given in Figure 7 , where × represents an implementation of the dot product and + represents a scalar addition. This is a so-called transpose implementation of the IIR filter, which functions as follows: All computations of the type Lu are initiated synchronously, where the required delay is implemented at the output, i.e. the result from the computation L h u[k] needs to propagate through h delay stages where the summation is performed along the way. Finally, when all summations of the type Lu are complete, the result from the earlier initiated computation Kx is added. This completes the construction of the control signal.
Custom implementation: The lowest achievable sampling time for a custom implementation of the IIR filter in Figure 7 with pipelined dot products is given by
The computational delay for the proposed implementation is given by
We note that the sampling time T s is a function of the input dimension m, whereas the computational delay T c is a function of the state dimension n. The maximum achievable sampling speed-up is therefore given by
The ceiling function is a result of the possibility that initially Tc Ts / ∈ N, where we then simply delay T c sufficiently long such that Tc Ts ∈ N, i.e. we let T c = hT s where T s is the lower bound from (10) and h is the largest h ∈ N that satisfies (12).
From Figure 2(b) we have that the intra-delay sampling scheme requires that a new control input is available every T s seconds. Since we assume that all dot products are pipelined, the implementation in Figure 7 has this property. However, recall that Figure 7 only computes one row of u, hence we require m parallel implementations. This leads for a pipelined implementation of the dot product to the utilization
where registers are assumed to require a negligible amount of resources.
GPP implementation: As before, the algorithm would be implemented differently on single-core/singlethread GPPs due to the specifics of the hardware. In this case we would compute all dot products sequentially on a single GPP. The minimum achievable sampling time is then given by
and the computation delay is given by
hence the maximum achievable sampling speed-up is given by
For the utilisation we have that each computation of the type Lu requires one GPP, where the computation of the type Kx requires h GPPs, e.g. for h = 2 every vertical line in Figure 6 intersects h blocks of the type Lu and h blocks of the type Kx. The utilization is therefore given by
We can now observe that the lower bound on the computational delay T c , as given in (11) and (15) is in fact independent of h and only depends on the number of states n, whereas the lowest achievable sampling time T s depends on the number of inputs m. However, note that the achievable sampling speedup h is not determined by T c and T s alone. The sampling speed-up is also bounded from above by the availability of parallel resources, i.e. h in (13) and (17) is bounded from above by the availability of adders and multipliers or GPPs, respectively. However, assuming a sufficient amount of parallel resources, the above analysis shows that the implementation of intra-delay sampling, i.e. T s < T c , is feasible. We will discuss the implications of analytical and resource constraints in the following section with the help of a concrete example. LUTs and a multiplier U m = 0.7 × 10 3 LUTs. The time to perform a single precision floating point addition and multiplication are given by T m ≈ T a ≈ 2/f = 18.1ns [9] . For simplicity assume that m = 1.
We then have from (12) that the achievable sampling speed-up h is bounded from above by
Furthermore, from (13) and the fact that the required resource U for a particular design is bounded from above by the available resource L, i.e.
we obtain a further hardware-specific constraint on the sampling speed-up h as a function of the state dimension n. Both these upper bounds are now plotted in Figure 8 , where we observe that the largest implementable speed-up h = 3, for L = 150 × 10 3 Look-Up Tables, is achieved for a system with up to n = 132 states.
We then have from (10) and (11) that for n = 132 the smallest sampling time is given by T s = T m +2T a = 54.3ns and the computational delay is given by T c = hT s = 217.2ns ≥ T m +T a ⌈log 2 (n)⌉+T a = 181ns.
Note that T c is a function of the number of states of n and the clock frequency f . For example, in low power applications it may be necessary to reduce f significantly. This will lead to a larger T c and will make the intra-delay sampling scheme interesting for applications with slower dynamics. We also expect the availability of much larger devices in the future. For example, the next version of Xilinx FPGAs will incorporate approximately L = 500 × 10 3 Look-Up Tables (Virtex 6, XC6VLX760) [10] . By Figure 8 , larger devices will allow the implementation of larger speed-ups for larger systems in the future.
Finally, note that this simple example has been chosen to demonstrate the fundamental mechanics and 5 Example: Performance of the closed-loop system
In this section we discuss the closed-loop behaviour of the intra-delay sampling state feedback controller with the help of a simple example. We stress that such simple control problems will in most cases not require a intra-delay sampling implementation of the linear state feedback controller since standard sampling schemes are more than sufficient to provide reasonable sampling rates, even on very weak hardware.
The purpose of the example is rather to highlight two key points: 1) the effect of the system geometry on the achievable sampling speed-ups and 2) better disturbance rejection properties if implemented on identical hardware (with identical computational delays).
Let P be given by P :ẋ(t) = Ax(t) + B(u(t) + w(t)), ∀t ≥ 0, where 
and w is defined below. This is the model of an inverted pendulum on a cart, wherex is the position of the cart, φ is the angle between the pendulum and the vertical and the control objective is to apply a force u(t) to the cart in order to balance the pendulum in an upright position, i.e. φ → 0. in the standard sampling scheme (h = 1), a sample is only taken every 0.20s, where the computational delay is also 0.20s, i.e. T c = T s = 0.20s. Since the disturbance is entering at t = 1.0001s, a counteracting control signal will not be applied until t = 1 + T s + T c = 1.4s, as depicted in Figure 9 (b). In contrast, the intra-delay sampling controller for h = 3 has a sampling time of T s = 0.0667s and a computation delay of T c = 0.20s. It will therefore be able to react at time t = 1 + T s + T c = 1.2667s < 1.4s. Hence, intra-delay sampling effectively reduces the (maximum) total time to react to a disturbance, i.e. T c + T s , and therefore promises better performance in the presence of disturbances. In fact, the disturbance w(t)
has been chosen to illustrate this behaviour.
The total maximum time to react to a disturbance T c + T s is now plotted in Figure 10 for different sampling speed-ups h. Note that for increasing h, T s + T c must tend towards T c since T s = Tc h → 0 for h → ∞ and T c is constant. In the limit, the corresponding closed-loop system then consists of a continuous-time plant with a pure input delay T c , and a corresponding continuous-time controller. The second curve in Figure 10 is the root mean square error of the output φ (closed-loop simulated for 15s) where now w is a Gaussian process with mean µ = 0, variance σ 2 = 1 and a sample period of 0.001s. We observe the curves are qualitatively very similar. This shows that the maximum time to react T c + T s provides a good indication for the disturbance rejection properties of the closed-loop system, where the intra-delay sampling controller performs increasingly well for increasingly large sampling speed-ups h.
Finally, we observe that, apart from where the disturbance forces discontinuities, the control signal in Figure 9 (b) is smoother for h > 1 than for h = 1, reducing the burden on the actuation hardware.
The shift in technology, away from single-core/single-thread processors to massively parallel architectures, opens up the opportunity for new control techniques that were previously not thought feasible.
In this paper we present a novel digital control technique, which we call intra-delay sampling, that enables a digital controller to sample faster than the computational delay involved in computing the control signal. By using a parallel computing infrastructure, we analytically showed 1) that intra-delay sampling is indeed feasible from an implementation point of view and that 2) the sampling time T s is a function of the number of inputs, whereas the computational delay T c is a function of the number of states. Furthermore, we demonstrated the advantages over standard sampling techniques in terms of disturbance rejection and highlighted the basic trade-offs that come with the approach. Although we used specific examples and controller architectures to illustrate the point, the principle of intra-delay sampling applies in a more general context, i.e. it is potentially applicable to other (digital) control techniques than the state feedback LQR case considered here. Especially computationally intensive algorithms, such as model predictive and multiple model control, may benefit from intra-delay sampling type schemes. The derivation of suitable intra-delay sampling implementations of these algorithms, however, will require significantly more research.
Although we did not consider actual hardware implementations of the analysed intra-delay sampling sate feed back controller in this paper, we can expect similar qualitative results in practice since the resulting controller structure (an IIR filter) is well studied and there exist efficient hardware implementations (at least on FPGAs) that preserve the discussed timing-properties. RMS in angle T s +T c Figure 10 : RMS error in angle and total time to react (T s + T c ) to a random disturbance.
Acknowledgements

