Abstract-This paper presents a new parallel approach to identification of linear repetitive processes based on subspace algorithms. Parallel realizations of these algorithms are tested on various graphic cards that use NVIDIA CUDA technology. The paper describes implementation of subspace identification algorithms and their parallel speedup, efficiency, throughput, and delay. The parallel approach to the identification of linear repetitive processes based on subspace methods, presented in this paper, is very useful not only for time invariant LRPs but also for processes with dynamics that changes rapidly from pass to pass. A simulation example is included to illustrate the effectiveness of the proposed approach.
Abstract-This paper presents a new parallel approach to identification of linear repetitive processes based on subspace algorithms. Parallel realizations of these algorithms are tested on various graphic cards that use NVIDIA CUDA technology. The paper describes implementation of subspace identification algorithms and their parallel speedup, efficiency, throughput, and delay. The parallel approach to the identification of linear repetitive processes based on subspace methods, presented in this paper, is very useful not only for time invariant LRPs but also for processes with dynamics that changes rapidly from pass to pass. A simulation example is included to illustrate the effectiveness of the proposed approach. Index Terms-state space models, subspace methods, identification algorithms, parameter estimation.state space models, subspace methods, identification algorithms, parameter estimation.s
I. INTRODUCTION
F OR SEVERAL years, the growth of maximum clock frequency of digital implementations has stopped, particularly in general purpose processors. This phenomenon is presented in [1] . It seems that further progress in computer system performance is still possible despite reaching the limit of switching frequency of digital integrated circuits. The further increase in computing power can be achieved by multiplying the number of computational units in multi-core general purpose Digital Signal Processors (DSP), Grapihics Processing Units (GPU), and reconfiguration Field Programmable Gate Array (FPGA) chips [2] .
Subspace identification methods are an attractive alternative to the well-known prediction error methods due to simple and general parameterizations in the MIMO case. They do not need any canonical parameterizations as well. Moreover, no nonlinear optimization is performed and reliable statespace models for complex multi-input multi-output dynamical systems are derived directly from the input-output data. Also, computational complexity of subspace identification methods is modest in comparison with the well-known prediction error methods [3] , [4] , [5] , [10] . Subspace identification algorithms consist of two steps. In the first step, based on the inputoutput data, the repetitive process order is determined and the repetitive process sequence of states (N4SID algorithm) or the extended observability matrix of system (MOESP algorithm) are computed [7] . In the second step, the unknown system matrices are determined.
The repetitive control theory has been an area of intense research since the beginning of 1990's and considerable results have been achieved both in the analysis and synthesis problems, (see: [6] ) for the actual state of art. Contrary to the LRP control theory, the identification of LRPs has attracted very limited attention. The aim of this paper is to propose a new approach to the identification of the LRPs based on subspace algorithms. In this approach, the order of a LRP and the unknown process matrices are determined based on the input and output sequences of the actual pass and the output sequence of the previous pass using parallel implementations of N4SID and MOESP identification algorithms.
Parallelization of algorithm for repetitive processes subspace identification requires creating and processing Hankel matrices. The number of processors that is simultaneously busy depends on the size of matrices storing information defining the spatial variable, and the time variable determining the position on the pass, while the length of each pass is finite [6] .
This paper considers the number of processors used, speedup, efficiency, throughput and computing time of parallel implementations of subspace identification algorithms and compares them to the modified sequential versions of MOESP N4SID algorithms. They were both quantitative and qualitative indicators describing the identification repetitive processes. These algorithms are examined using the input-output data generated by the repetitive process simulator. The paper is organized as follows. Section 2 presents an introduction to the identification of discrete deterministic repetitive processes. Parallel problem identification was formulated, and its solution based on subspace algorithms is given in Sections 3 and 4. Section 5 presents the simulation results. The conclusions are given in Section 6.
II. IDENTIFICATION ALGORITHMS

A. Discrete linear repetitive process model
Consider the state-space model of a discrete linear repetitive process of the following form:
where:
m is the input vector, A, B, B 0 , C, D, D 0 are matrices of appropriate dimensions.
To complete process description, it is necessary to specify the boundary conditions:
n is a vector with known constant entries and f (p) ∈ R l are known functions of p.
Define the following input Hankel block matrix
Define also the output block matrix Y 0|2i−1
The number of block rows i should be lager than the maximum order of the LRP. 
The state-sequence matrix X i is defined as
Define the extended observability matrix Γ i and the reversed extended controllability matrix ∆ i :
Assume also that the pair {A, C} is observable and the pair {A, [BB 0 ]} is controllable. Define the lower block triangular Toeplitz matrix H i
The above block Hankel matrices along with the extended observability matrix, the reversed extended controllability matrix and the lower block triangular Toeplitz matrix methods. Following Theorem 1 (see: [8] ) , the state-space model (1) − (2) can be reformulated in a matrix form:
where X f = X i and X p = X 0 .
B. Identification Problem
Given α measurements of the input u k+1 (p) and the outputs y k (p) and y k+1 (p) measurements generated by the LRP (1) − (2) determine its order and the matrices A, B, B 0 , C, D and D 0 up to within a similarity transformation.
Assuming that the augmented input [u k+1 (p) y k (p)] is persistently exciting of order 2i and the intersection of the row space of U f and the row space of X p is empty, the unknown LRP matrices A, B, B 0 , C, D and D 0 can be computed based on the results of Theorem 2 (see: [8] ) . It can be done in two different ways using N4SID or MOESP. In both these algorithms, the order of the process (1)- (2) can be find based on inspection of the singular value decomposition of the matrix
where W 1 ∈ R lixli and W 2 ∈ R jxj are the user defined weighting matrices and θi is the oblique projection
The order of LRP is equal to the number of non-zero singular values in S 1 . The extended observability matrix Γ i is computed from the the following equation
where T ∈ R n×n is an arbitrary non-singular similarity transformation matrix. N4SID calculates the state sequence X i from the equation
where Γ † i denotes the Moore-Penrose pseudo-inverse of the matrix Γ i . Finally, the matrices A, B, B 0 , C, D and D 0 are determined solving the following set of equations
The state sequence X i+1 is calculated from the equation
where Γ i−1 is the extended observability matrix Γ i without the last l rows and θ i−1 is the oblique projection
In MOESP, the unknown LRP matrices are determined in two steps. In the first step, A and C are calculated from the extended observability matrix Γ i . In the other step, B, B 0 , D, and D 0 are calculated solving a set of equations.
III. PARALLEL IMPLEMENTATION OF MODIFIED N4SID
AND MOESP ALGORITHMS The modified N4SID and MOESP algorithms can be partitioned into the following three modules:
• Input module • Matrix calculation modul • Model testing modul The data input module performs pre-processing operations in spatial context in fine-grained pipelined architecture. The detailed time analysis including technological factors can be found in [9] . The delay in the operations of fine-grained data (in cycles) can be expressed as
where r is the delay of additional variables in the passes. The transport delay can be determined from (19), on basis of the pass length. The component r corresponds to the delay resulting from the use of additional variables, synchronizing data. They are used to distribute evenly the propagation time between the elements in the matrices. The other modules are parallelized in a coarse-grained architecture. The transport delay for operation with coarse-grained data (in cycles) is
Modules of subspace identification algorithms can been partitioned into the following classes performing parallel operations:
• Class Hankel • Class oblique projection • Class SVD • Class observability • Class controllability • Class ABCD The UML activity diagram (Fig. 1) presents parallelization of block Hankel matrices for subspace identification of repetitive processes.
IV. PARALLELIZATION LEVEL INDICATORS
System architecture can be adapted to a particular computational task in itself is not a determinant of performance or efficiency.
||− executing operations simultaneity F σ − operator, σ ∈ (1, .., l). Many operations can be performed simultaneously at the inputs (21). For the evaluation of computational tasks performance in multi-core systems speedup formula was used, (22).
In multiprocessor systems, the actual value of the speedup is less than the theoretical (based on the number of processors) due to communication overheads and the need to share resources such as memory and buses. This property describes a model of efficiency, which is a measure of concurrent use of resources
where: E m − the speedup efficiency with m processors E m ∈ (0; 1).
In the ideal case, the efficiency (23) is 1.0 which means that the speedup is proportional to the number of processor or computing elements.
In the assessment of computing activity is often used as a criterion of time needed to process a specific task or quantum computing. The concept of throughput was introduced, which corresponds to the number of data processed per time unit.
where: P − throughput, D n − the number of computed data, t− measurement time.
Relative to the computing system has the ability to compute the limit, usually in the long term. Throughput (24) relative to the computing system has the ability to compute the limit, usually in the long term. The algorithms were tested on a PC computer with Genuine Intel CPU T2300 @ 1,66 GHz, and 1536 MB RAM NVIDIA GTX 460 graphic card. Figure 2 shows the speedup of a parallelized algorithm N4SID and in Figure 3 Based on these measurements, the speed up was evaluated. Speedups, shown in Fig. 2 and 3 , are equal to formally calculated speedups of parallel algorithms. Parallel implementations run about 5-6 times faster than their sequential versions.
VI. CONCLUSIONS
For Stages 1 and 3, the following indicators: speedup S m , throughput P and computational efficiency E m increase. At the same time, the transport delay L decreases. For Module 2, speedup, throughput P , and efficiency decrease while the delay transport increases. This comes from the calculation of the system order using SVD decomposition. Speedups, shown in Fig. 2 and 3 , are equal to formally calculated speedups of parallel algorithms. Parallel implementations run about 5-6 times faster than their sequential versions. Transport delay is interpreted as the time of calculation. The speedup and throughput efficiency are good candidates for indices to evaluate the degree of parallelization. A parallel algorithm may be useful in selecting fast and using less resources algorithm for identification of stationary linear repetitive processes.
