With increasing photovoltaic (PV) installations, large amounts of time series data from utility-scale PV systems such as meteorological data and string level measurements are collected [1, 2]. Due to fluctuations in irradiance and temperature, PV data is highly stochastic. Spatio-temporal differences with potential time-lagged correlation are also exhibited, due to the wind directions affecting cloud movements [3]. Coupling these variations with different types of PV systems in terms of power output and wiring configuration, as well as localised PV effects like partial shading and module mismatches, lengthy time series data from solar systems are highly multi-dimensional and challenging to process. In addition, these raw datasets can rarely be used directly due to the possibly high noise and irrelevant information embedded in them. Moreover, it is challenging to operate directly on the raw datasets, especially when it comes to visualizing and analyzing these data. On this point, the Pareto principle, or better-known as the 80/20 rule, commonly applies: researchers and solar engineers often spend most of their time collecting, cleaning, filtering, reducing and formatting the data.
In this work, a data analytics algorithm is applied to mitigate some of the complexities and make sense of the large time series data in PV systems. Each time series is treated as an individual entity which can be characterized by a set of generic or application-specific features. This reduces the dimension of the data, i.e., from hundreds of samples in a time series to a few descriptive features. It is is also easier to visualize big time series data in the feature space, as compared to the traditional time series visualization methods, such as the spaghetti plot and horizon plot, which are informative but not very scalable. The time series data is processed to extract features through clustering and identify correspondence between specific measurements and geographical location of the PV systems. This characterisation of the time series data can be used for several PV applications, namely, (1) PV fault identification, (2) PV network design and (3) PV type pre-design for PV installation in locations with different geographical attributes