ABSTRACT Big-data is a challenging domain for high-throughput digital signal processing (DSP), especially in global-projects like the square kilometer array. The composite input data rate for correlator in this system is more than 11 Tb/s, which immensely increases the memory requirement, complexity of correlation implementation and the overall power dissipation. This paper is focused on computational minimization as well as the improvement of energy efficiency in the complex architectural X-part of an FX correlator employed in large array radio telescopes. A dedicated correlator-multiplier block termed, correlator-systemmultiplier-and-accumulator (CoSMAC) cell architecture is proposed, which produces two 16-b integer complex multiplications within the same clock period. The novel hardware optimization is achieved by utilizing the flipped mirror relationship (conjugate complex symmetry) between discrete Fourier transform (DFT) samples owing to the symmetry and periodicity of the DFT coefficient vectors (twiddle factors). In addition, using the proposed CoSMAC architecture a new processing element (PE) is designed to calculate both the cross-and auto-correlation functions within the same clock period. This paper describes how the arithmetic processing of three baseline calculations will be minimized in the X-part using the proposed novel algorithm and hardware design (CoSMAC and PE). In addition to optimizing the core processing elements, it is possible to eliminate nearly 50% of the usual memory requirement. The design has been synthesized using global foundries 28-nm HPP CMOS standard cells.
I. INTRODUCTION
Radio interferometers are array of telescopes which monitor the cosmic and astrophysical occurrences in space such as the very-long-baseline interferometer (VLBI) [1] . Many ongoing and future interferometers like Square Kilometer Array (SKA) [1] have very high input data rate [2] of at least 11 Tbps which creates a ''BIG DATA'' [3] computational problem for astronomers and design engineers. It significantly raises the order of complexity in signal processing electronics for storage, processing and transmitting the Teradata in an efficient manner. There is thus an enormous necessity for providing faster and cheaper miniaturized low power electronics to overcome this Big-data challenge in astronomical digital signal processing (ADSP). The correlator plays a major role in the image formation in ADSP architectures of all interferometer types, and the FX correlator has been widely used [4] - [8] in this regard. Unlike the lag correlator [9] , [10] (XF type), the FX correlator [7] , [11] converts each time domain signal to frequency domain in the F-section followed by the X-section which performs the multiplication and accumulation over each frequency sample for all the signals and thus directly measures the cross-power spectrum. The correlator-size increases at the rate of square of the number of antennas (Na) in the interferometer, multiplied with the total bandwidth (Tb), and hence, justifies the fact that the correlator is the most power consuming unit for very large array telescope structures.
A correlator performs cross-correlation among signal pairs and auto-correlation of each signal with itself to form baselines using complex multiplication and accumulation (CMAC) units [4] - [8] , [12] . The correlation product elements of baselines are called ''visibilities'' [13] . A comparison of the matrix correlator architecture [14] used in Atacama Large Millimeter/submillimeter Array (ALMA) [15] , Expanded Very Large Array (EVLA) [16] and SKA [6] , with the pipeline correlator architecture [17] used in Allan Telescope Array (ATA), indicates that the power consumption can be minimized mostly by using an architecture with minimum memory operations. Hence, the matrix architecture requiring less memory access than the pipeline architecture, can be considered to be superior to the pipeline architecture in this regard.
This work reports a novel computationally minimized architectural approach to correlation multiplication for a three baseline formation of the matrix architecture along-with reduced memory usage. A dedicated CoSMAC cell architecture is proposed which produces two 16-bit integer complex multiplications (two visibilities) of a single cross-correlation baseline, within the same clock. Also, a new processing element (PE) design based on the CoSMAC is proposed which uses two input registers and four accumulators to produce six visibilities for three baselines. For an N -point (N = 2 r , r an integer) F-section transform, the base-line cross-correlations for ( N 2 − 1) frequency samples can be utilized to produce the base-line cross-correlations for the other ( N 2 − 1) conjugate frequency samples at almost zero-cost. This paper describes a novel algorithm to perform this computational minimization of baseline calculations in the X-part of the FX correlator. Hence, the core idea is to reduce around 50% of the hardware requirement without degrading the speed and performance of a real-time correlator. Only ( N 2 + 1) frequency samples are required for computing the full visibility spectrum of cross-and auto-correlation baselines. As a result, both highspeed and computational efficiency will be achieved for the X-section by using this algorithm and cell architecture for calculating all the baselines. The proposed computational minimization thus renders a suitable mechanism to handle some ''BIG DATA'' problems.
Large radio interferometers often do not transmit or process the ( N 2 −1) conjugate frequency samples in order to save memory and hardware cost, but in this way almost half the cross-correlation visibilities will not be available for image formation, unless the conjugate part is software processed (at some extra cost) at the back-end after the X-part hardware. The described technique thus complements the overall FX correlator by producing the complete visibility spectrum in real-time using only ( N 2 + 1) transmitted channels from the F-section. This constitutes a major contribution of this paper.
The paper illustrates the novel architecture through the implementation of three baseline calculations, which epitomize the core idea for minimized implementation of the entire X-section, consisting of a large number of baseline calculations. The data-flow within the CoSMAC and PE cells is quite simple, and, since only half the channels are needed, there is a consequent reduction of the data-flow management overhead for the overall X-section. The matrix architecture in [8] can employ the proposed new cells and benefit from the drastically reduced computational requirements. It is assumed here that the samples from the F-section are corner turned [18] , [19] and each signal in the input is pre-processed using individual filter banks. In the correlator section, each input signal is assumed to be a quantized discrete frequencychannel sequenced complex sample (with real and imaginary components). The input registers are designed to store the complex sample as a 16-bit two's complement number (8-bit real + 8-bit imaginary). The use of 8-bit word length for real and imaginary components will yield high dynamic range and low quantization noise. The accumulation time period (the ''dump rate'') of correlated data is fixed [20] for all the baselines. Both ASIC (VLSI) and FPGA design flows are considered in this work in order to emphasize the efficiency of the proposed architecture.
II. A GENERIC FX CORRELATOR AND CMAC
In this section, an FX correlator architecture based on [13] and the conventional (existing) CMAC to perform cross-and auto-correlations is discussed along-with a general introduction to the associated computational matrices. Let us consider Na dual polarized antennas [21] - [23] producing 2Na input signals for the correlator which is oversampled at the rate of Ns and produces outputs defined as x i,n [m] . Fig. 1 illustrates the generic architecture of such an FX correlator with Na dual polarized antennas. Here i ranges from 0 to (2Na−1), n varies from 0 to (N−1) (with N being the number of discrete time samples), and m ranges from 0 to (T−1) (with mbeing the time-slice index for the dump period T ). The F-section converts these time sampled x i,n [m] signals into frequency sampled (DFT) signals X i,k [m] using FFT/Filter-banks, where k is the index for the discrete frequency samples varying from 0 to (N−1). Thus in general, the overall maximum possible frequency channels generated by the F-section, Nfc = N x 2Na. These channels would be corner turned and fed as input signals to the X-section, forming 2Na 2 −Na cross-correlation and 2Na auto-correlation baselines for a total of 2Na 2 + Na baselines [4] . As mentioned in the introduction, the actual number of frequency channels transmitted by the F-section depends on its specific implementation to save memory and logic. Each pair of distinct input signals form a cross-correlation baseline, while, each input signal correlated with itself forms an auto-correlation baseline. Considering all N -channels [13] each baseline has N complex multiplications which are the visibilities, and are accumulated individually over a certain dump period [24] of T before being read out. Thus, for Na antennas, the full visibility spectrum consists of (2Na 2 + Na)N visibilities. The cross-correlation of two frequency samples (k th discrete frequency samples) of input signals labelled i and j, X i,k [m] and X j,k [m] is represented as Co q , where q is the index number of cross-correlated baselines ranging from 1 to 2Na 2 − Na, so that,
Next, as a computational example, we consider two frequency sampled input signals X 0,k [m] and
where,
And,
Here, m = 0,1,. . . . . . (T−1) for the T temporal slices. The channel-wise cross-correlation products between (2) and (3) produces a baseline Co 1 which can be considered algebraically as the extracted diagonal elements of the product of an N×T matrix (N DFT samples of the first signal for T time slices) with the conjugate transpose of another N×T matrix (N DFT samples of the second signal for T time slices), and can be given by the MATLAB diag function (4) , as shown at the bottom of this page. The N -tuple shown in (5) , as shown at the bottom of this page, is the result of the diagonal element extraction operation in (4) . The rows in (5) are the ''visibilities'' for this cross-correlation baseline. They are thus the result of channel-wise multiplication of ''timestamped'' DFT samples of two signals at a particular temporal instance, m and their accumulation over the temporal dump period, T. Based on (5), it requires N CMACs to calculate a single cross-correlation baseline.
Thus for one baseline, in general, if N = 64, it requires 128 16-bit (8-bit real+8-bit imaginary) input registers to store 128 samples along-with 64 CMACs in a generic X-section implementation producing all the 64 visibilities. Each CMAC clusters four real multipliers, an adder and a subtractor to produce a complex product as well as two accumulators (a complex accumulator) to integrate the products over the temporal dump period T as shown in the dual tree CMAC computation flow in Fig. 2 . Hence 256 real multipliers and 128 adders/subtractors are required to calculate full visibility spectrum of one cross-correlation baseline in real-time. Other schemes such as Gauss multiplication [25] which needs 3 multipliers and 4 to 5 adders per complex multiplication can also be employed, but the generic complex multiplier is utilized in this work for various comparisons. In addition, N implementations of conjugate operation (in this case 64 sign conversions) as shown in Fig. 2 are also required in calculating one baseline regardless of the complex multiplication algorithm employed.
III. PROPOSED NEW COMPUTATIONALLY MINIMIZED X-PART AND EFFICIENT PROCESSING ELEMENT (PE)
In this section a novel algorithm is proposed for producing the cross-correlations for ( 
FIGURE 2. CMAC cell with complex conjugation block at the input to perform cross-correlation between two complex samples.
conjugate channels being transmitted by the F-part. Hence, the full visibility spectrum can be produced in real-time for image formation without the cost of storage and computations for the conjugate channels. A minimized correlation computation example for three baselines formed by two signals, a cross-correlation baseline and two auto-correlation baselines is explained elaborately. (4), the input matrix transforms to one as shown in (6) below:
Or,
. . .
The application of mirror image symmetry therefore eliminates the requirement of N conjugate operations per baseline as shown in (7) as compared to (5), and hence, in total reduces (2Na 2 +Na)N conjugate operations for (2Na 2 +Na) baselines. An alternative computationally minimized correlation algorithm will next be developed using (7) in the following sub-section. This leads to equations for a new dedicated multiplier-accumulator specifically for performing optimized correlation in a correlator system implementation. Based on these equations, a new VLSI architecture is proposed to produce six visibilities using a single processing element (PE).
B. X-PART MINIMIZATION AND CORRELATOR-SYSTEM-MULTIPLIER-AND-ACCUMULATOR (CoSMAC)
Now, we consider the frequency samples X 0,1 [0] = Ry + jIy (say) and
respectively. We then have the crosscorrelations, (8) and (9) which also constitute respectively the 2 nd and the last row elements in (7), it is evident that the correlation magnitudes in both cases are the same and the outputs are the conjugate of each other. Hence, considering this inherent redundancy for a pair of cross-correlations, it is sufficient to calculate just one cross-correlation while the other can be obtained by applying the conjugation property. Now, exploiting this property, the computation of (7) can be portioned into dual redundant column matrix parts, a ( (11) with the algebraically corresponding input matrix product (of ( N 2 + 1) DFT samples of two signals for T time slices) containing this part as the diagonal trace being given by (10) , as shown at the top of the next page; which can be derived by backtracking from (11) , as shown at the top of the next page. The DFT samples
[m] and their crosscorrelation products forms a unique constant set which are calculated independently. As per discussion above, the lower ( N 2 − 1) rows of (7) (the 2 nd part) is obtained from the conjugate of the column matrix formed by the 2 nd to the 2 nd last row of the 1 st part in (11) , in flipped (mirror image) order, whose equivalent forms are provided in (12) and (13), as shown at the top of the next page, as the ( N 2 − 1)-tuple. Thus exploiting the complex conjugate symmetry of cross-correlation products, the required arithmetic operations is reduced to only ( N 2 + 1) to determine all the N visibilities per baseline. The ( N 2 −1) conjugate frequency samples (conjugate channels) are not needed in accounting for their cross-correlation visibility contribution which is thus obtained at almost zero cost. In a similar fashion, the channel-wise auto-correlations (autocorrelation baselines) Ao 1 and Ao 2 for these two signals can also be obtained without conceding redundant arithmetic. For example, the generation of Ao 1 is shown from (14) to (17) , as shown at the top of the next page. Here (14) shows the diagonal trace extraction from the auto-correlation matrix product. Like before, redundancy is exploited in the generation of Ao 1 , since the auto-correlation of the frequency samples X 0,
is exactly equal to the auto-correlation of the frequency samples from X 0,
, as the samples are the complex conjugate of each other. Next, (16) shows the 1 st part of Ao 1 with the corresponding input matrix product containing this part as the diagonal trace being given by (15) ; while (17) shows the 2 nd part of Ao 1 generated by directly utilizing the 2 nd to the 2 nd last rows of (16) in flipped (mirrored) order as all the product accumulations are real. The generation of Ao 2 will follow the exactly similar procedure as for Ao 1 . Accordingly, the above derivations based on (8) and (9) indicate that the requirement of 2N input registers reduces to just (N+2) for calculating one baseline. Also, the presented computational minimization algorithm results in nearly 50% reduction of the arithmetic and storage memory requirement in calculating a single baseline, since the N arithmetic operations are reduced to just ( N 2 + 1). Hence using only ( N 2 + 1)DFT samples from the F-section the entire visibility spectrum can be produced. In addition, the energy saving is enormous.
The 1 st part of Co 1 in (7) is defined in (10) , as shown at the top of the next page. Or, the 1 st part of Co 1 in (7) is defined in (11) , as shown at the top of the next page. And, the 2 nd part of Co 1 is defined in (12) , as shown at the top of the next page. Also, equivalently, using the mirror image symmetry, the 2 nd part of Co 1 is defined in (13) and (14), as shown at the top of the next page. The 1 st part of Ao 1 is defined in (15) and (16), as shown at the top of the next page. And, the 2 nd part of Ao 1 is defined in (17) , as shown at the top of the next page.
Utilizing the above minimized arithmetic base-line computations, Fig. 5 shows the new dedicated correlation-systemmultiplier-and-accumulator (CoSMAC) which provides the product components for two mirror-image symmetric (complex conjugate) cross-correlations in a single clock cycle. Unlike the generic CMAC, the CoSMAC does the conjugation in a single final step (single sign conversion) after the product accumulation for the dump time T (T time slices), thus saving power. = Diag
Now let us explore a bit further the equivalent forms of the auto-correlations of the same signal samples,
As evident from (18) and (19) the auto-correlation of a complex sample (in any of the equivalent forms using complex conjugate symmetry) involves only two real multipliers and the output has no imaginary part, which is also equal to the auto-correlation of its conjugate sample. Thus, a single CoSMAC can be used to calculate four auto-correlations (dual co-incident auto-correlation pairs). To summarise, one CoSMAC can calculate two cross-correlation visibilities or four auto-correlation visibilities for a baseline in one clock cycle. The design of a single hardware unit to calculate all of these six visibilities for three baselines in one clock cycle helps to reduce the area and power utilization and is discussed in the following sub-section.
C. ARCHITECTURE AND OPERATION OF A COMPUTATIONALLY MINIMIZED PROCESSING ELEMENT (PE) FOR THE X-PART
Expanding on the CoSMAC arithmetic processor of block in order to perform both auto-and cross-correlations. The two additional dedicated accumulators enable separate storage of the final auto-and cross-correlation outputs separately after dump period T . The two 16-bit registers are loaded at every clock interval with two new signal samples at a certain frequency, Ry+jIy and Rz+jIz, which are each 8-bit+8-bit two's complement complex numbers. When SEL is 'HIGH', the multiplexers in the input section selects operands for cross-correlation, and, the PE will produce real and imaginary magnitude parts of a product element in (11) and (12) (similar to (8) and (9)) along with the signconversion of the imaginary part for the complex conjugate cross-correlation product (for the conjugate channel) in (12) . At the same time, the demultiplexers select the dedicated accumulators for cross-correlation and enables integration of products over the dump time period T . Similarly, when SEL is 'LOW', the multiplexers chooses the auto-correlation operands and the demultiplexers select the dedicated accumulators for auto-correlation in order to integrate the autocorrelation outputs (18) and (19) . After the integration (accumulation) over the dump time period T , the data from the accumulators are read out for the next process step. Based on the above description, Table 1 summarizes the visibilities computed by a single PE after the integration time period T . The accumulators 1 and 3 yields the two cross correlation visibilities for the first baseline, with the accumulator 1 yielding the real coordinate value that is common to both the visibilities and the accumulator 3 generating the imaginary coordinate value for the first visibility. The output of the accumulator 3 is then fed into the conjugation unit (two's complement sign alteration unit) to produce the imaginary coordinate for the second visibility (as per deductions in sub-section B) after the integration time period T . Both of these visibilities belong to the 1 st baseline (cross-correlation baseline). Similarly, accumulators 2 and 4 produces the first and the third auto-correlation visibilities and are identical to the second and the fourth auto-correlation visibilities respectively because of co-incident conjugate auto-correlations. These visibilities belong to the 2 nd and the 3 rd baselines respectively. On the other hand, without the above computational minimization, four CMACs based on one CMAC for each cross-correlation and one CMAC for each co-incident auto-correlation pair, will be required to calculate all the six visibilities for the same three baselines. In other words, the existing technique (present state-of-the-art) involves sixteen multipliers, eight adder/subtractor blocks and eight accumulators. Whereas, the proposed novel computationally minimized PE uses only four multipliers, two adder/subtractor blocks and four accumulators for the operation. The extra multiplexers and demultiplexers in the proposed new PE design is only a small overhead compared to the larger number of multipliers and adder/subtractor blocks needed for using the generic method. Over-all, in the proposed novel architecture implementing the computationally minimized algorithm, all the six visibilities are obtained employing just one PE which results in high efficiency in terms of siliconarea and energy utilization. The data-flow within the CosMAC (Fig. 5 ) and PE (Fig. 6 ) cells is quite simple and regular when compared with the existing CMAC (Fig. 2) cells. Hence these proposed new cells for minimized baseline calculations do not impose any extra data-flow management overhead compared to the existing state-of-the-art.
IV. DESIGN AND PERFORMANCE COMPARISON OF THE PROPOSED CoSMAC AND PE CELLS
In this section, design of the new CoSMAC and PE processing cells for implementing the computationally minimized X-part are discussed and compared along-with the conventional (existing) CMAC. ASIC (VLSI) implementation in 28 nm CMOS technology as well as FPGA based design are considered to emphasize the efficiency of the proposed correlator cells over the existing state-of-the-art CMAC [8] , [15] , [16] with respect to power (energy) and area utilization. However, the existing (present state-of-the-art) CMAC cell designs are in lower process technologies [15] , [16] whose area and power efficiency is not accurately comparable with the proposed CoSMAC and PE implementations in 28nm CMOS technology. Never the less, in general, since the supply voltage is considerably scaled to 0.85V for the 28nm CMOS process, the proposed designs will be inherently more power efficient compared to previous CMAC cells housed in lower technologies (larger channel lengths) with higher supply voltages. Also, the chip-size will be much smaller for the proposed new cells in the advanced CMOS technology. For a better comparison, the existing CMAC cell was re-designed in the 28nm CMOS process technology and then compared with the proposed PE and CoSMAC cells in the same 28nm CMOS technology. This comparison provided below also indicates the vast improvement over the present state-of-theart CMAC cells if the same technology were to be employed.
The CMAC based architecture proposed in [8] for baseline calculations over 1024 frequency channels, uses a 128 MHz global clock. This means that the CMAC cells in that implementation are also operated in the range of 128MHz. The operating frequency of the proposed ASIC CoSMAC and PE cells is 400MHz with a corresponding throughput rate of 25.6Gbps per CosMAC cell and 51.2Gbps per PE cell. Considering that only half the channels are needed in this novel implementation the virtual throughput rates per cell will double and in a parallel stream implementation the proposed cells can comfortably handle the 11Tbps input data-rate using an array of few hundred cells. The proposed implementation would thus be more capable to process the 11Tbps overall system input data-rate compared to the various existing state-of-the-art FX correlator based interferometry schemes.
Verilog hardware description language (HDL) is utilized to create the register transfer level (RTL) design for both the proposed and conventional (existing) cells. For the FPGA design flow, Xilinx compilation tool is employed to optimize the design for speed and area based on the Virtex-6 FPGA [28] . In the ASIC implementation, 28nm HPP (high performance process) GF CMOS standard cell library is used to synthesize the design along-with timing constraints based on the clock frequency f ck = 400MHz and dump time T = 64 (64 time slices). The physical layer standard cells in this deep nanometric CMOS library are available with various process-voltage-temperature (PVT) corners [29] . In order to evaluate the worst-case situation, CMOS standard cells with slow corner specification along-with 0.765 V drain-to-source voltage and −40 • C operating temperature were instanced. The synthesis was carried out using the Cadence Encounter RTL compiler. Fig. 7 depicts the Verilog simulation results for a single CoSMAC and a single PE cell considering spectral samples of two input signals, based on the above timing constraints. The input data are 8-bit integers and the output data are 32-bit integers. The processing cells calculates the correlation for the dump period of T = 64 (64 time slices) and after that the data is read out from the accumulator registers. The registers are then cleared to start the next iteration. Table 2 summarizes the power and area utilization for a conventional (existing) CMAC cell [8] , [15] , [16] re-synthesized in 28nm HPP CMOS, and the proposed new CoSMAC and PE cells obtained through the same 28nm HPP CMOS ASIC synthesis. Although the power dissipation and area of the conventional (existing) CMAC cell appear to be just slightly lower, it is used to calculate only one cross-correlation visibility per baseline [8] . On the other hand, the CoSMAC cell FIGURE 7. Verilog simulation outputs for new CoSMAC and PE cells for the timing constraints of f ck = 400MHz, dump-time T = 64. For CoSMAC real_co is the common real part of the two cross-correlations, img1_co and img2_co are the two imaginary parts of the two cross-correlations. For the PE pe_re_co is the common real part of the two cross-correlations, pe_im1_co and pe_im2_co are the two imaginary parts of the two cross-correlations, and finally, pe_A1_co and pe_A2_co are the two products for the two co-incident autocorrelation pairs. PE employs the internally wired level-triggering mux-control signal SEL. which is designed to exploit the computationally minimized X-part algorithm, calculates two cross-correlation visibilities per baseline, utilising just slightly higher area and dissipating almost the same power. The slightly higher area is due to the 32-bit conjugation operation in the CoSMAC at the end (at the output), compared to the 8-bit conjugation in the beginning (at the input) for the CMAC cell. Calculation of two cross-correlation visibilities using existing state-of-theart CMACs in the conventional architecture [8] , [13] would require two of these cells and hence incur an overall power dissipation of twice its initial value which is much higher than that using one CoSMAC cell. On the other hand, as discussed before, a PE cell (a progression from CoSMAC cell) can calculate six visibilities (two cross-correlations and four auto-correlations) within the same clock interval, so that, the overall area and power consumption as shown in Table 2 is lower per visibility compared to the stand-alone CoSMAC and far lower compared to the conventional (existing) CMAC.
A. ASIC DESIGN RESULTS AND DISCUSSIONS
Meanwhile, calculation of six visibilities using conventional (existing) CMAC cells can be implemented in two ways; either by the allocation of individual CMAC cell for each visibility or by using the concept of CMAC reuse [9] . The former method results in a large increase in both the area and energy consumption, whereas, the latter approach utilises less area but at the expense of three times the dissipation of a single conventional (existing) CMAC cell. Overall, the PE cell with the lowest area and energy per visibility demonstrates to be an efficient alternative to the conventional (existing) CMAC. Considering the calculations for one cross-correlation baseline for varying DFT sample length N (=2 r , r = 3,4,5,6,7), it requires N CMACs using a traditional (existing) CMAC array, and, ( N 2 + 1) CoSMACs using CoSMAC array.
It is evident that as N increases, the power consumed also increases linearly. The power consumed by a CoSMAC array is almost half of that consumed by a CMAC array as shown in Fig. 8(a) using the same 28nm HPP CMOS process technology. Next the calculation of one cross-correlation and two auto-correlation baselines is considered. For this comparison the CMAC and the CoSMAC arrays are assumed to be implementing a reuse scheme [9] for calculating the auto-correlations in a second clock cycle. On the other hand, ( N 2 + 1) PEs with four accumulators will be computing the auto-correlation baselines in the same clock interval as the cross-correlation baseline by employing the SEL muxcontrol.
The CMAC and CoSMAC arrays will incur additional power dissipation due to reuse and the resulting comparison is shown in Fig. 8(b) . The plot indicates that the use of PE array to calculate auto-correlation baselines along with the cross-correlation baseline of a pair of signals results in lower power consumption than a CMAC or a CoSMAC array. Additionally, the performance of the PE array and the CoSMAC array is a ''constellation'' apart from the CMAC array in energy efficiency. In addition to power estimates, the indicators of area utilization in each case can also be determined. As the CoSMAC based implementation uses fewer cells than the traditional (existing) CMAC based architecture in calculating cross-correlation baselines, the widening area efficiency gap is shown in Fig. 9(a) . In case of one crossand two auto-correlation baselines, the reuse CoSMAC array appear to have the highest area efficiency as shown in Fig. 9(b) . The PE array due to larger cell size is found to be slightly better than the CMAC array but with increasing area efficiency for higher values of N . However, since the PE computes auto-and cross-correlation visibilities in the same clock cycle, it is overall the most efficient considering power consumption. The proposed CoSMAC and PE cells can be combined to provide more overall area and power efficient solutions compared to the current (existing state-ofthe-art) CMAC based architectures in constructing correlators for large array radio interferometers. In Figs. 8 and 9 the comparisons with PE are shown separately in Fig. 8(b) and Fig. 9(b . Logic components m1, m2, m3 and m4 are 8×8 multipliers; t1, t2, t3 and t4 are 8-bit input registers to store Ry, Rz, Iy and Iz; acc1, acc2, acc3 and acc4 are 32-bit accumulators; c1 is a complex conjugate block; s1 in (c) is a 16-bit adder/subtractor whereas s1 in (a) and (b) is a 16-bit subtractor; aa1 is a 16-bit adder; d1 and d2 are the output register select blocks; and mu1, mu2, mu3 and mu4 are the input select blocks. The SEL signal for PE is derived from the clock, it is internally wired and hence not shown at the cell edge in (c).
single CoSMAC will be needed to just perform the crosscorrelation baseline for the pair [X2, X3].
The overall power and area usage of re-designed CMAC, and new CoSMAC and PE arrays obtained from the 28nm CMOS standard-cell ASIC synthesis for different values of N are summarized in the Table 3 . It can be noticed that for calculating three baselines (one cross-correlation and two auto-correlation baselines) the area required by the CMAC and CoSMAC cells increases slightly. This is due to the need to switch (multiplex) connections to the four multipliers for calculating auto-correlation baselines during reuse.
The mask layouts of the re-designed conventional (existing) CMAC cell and the proposed new CoSMAC and PE cells, using the 28 nm GF CMOS HPP process with 0.85 V supply voltage, are shown in Fig. 10 (a), (b) and (c) respectively. The synthesis of all the three cells was performed using the SC12MC ARM standard cells available for this GF process. The routing was implemented using a five metal BEOL (back-end-of-line) stack.
B. FPGA DESIGN
The same gate-level designs of CMAC, CoSMAC and PE cells were also synthesized using Xilinx compiler to check their utilization of power and area and their performance on the FPGA platform. The FPGA hardware and energy requirements are summarised in the Table 4 . It is evident that the logic and power consumed per visibility is much lower for the new CoSMAC and PE cells compared to the existing CMAC cell, and additionally, the PE cell provides the best efficiency among the three considering critical power utilization constraint in large array correlators, similar to the case of the ASIC implementations discussed in sub-section A. 
V. CONCLUSION
The X-part of an FX correlator implementation for very long baseline radio interferometry e.g., the SKA, requires enormous computational and storage hardware, and, sustained power consumption. This is necessary for generating the enormous number of channel-wise and base-line-wise visibilities through cross-and auto-correlation among the signals at over 10 Tbps composite throughput rate. The proposed computationally minimized algorithm for the baseline calculations in the X-part and the corresponding optimized hardware architecture which exploits the redundancies in correlation products will reduce the computational requirement by nearly 50% without degrading the speed and performance of the correlator. It can generate the complete visibility spectrum (including those for the conjugate channels) for image formation using only ( N 2 + 1) (just about 50%) frequency channel samples from the F-section. Also, using the new CoSMAC and PE cells in the ASIC (VLSI) implementation, the energy efficiency will be enhanced many-folds in addition to enabling the processing of Tera-scale input data-rate by the X-section. This work thus provides a computational technique for some emerging Big-data problems.
