Simulations of complex scientific phenomena involve the execution of massively parallel computer programs. These simulation programs generate large-scale data sets over the spatio-temporal space. Modeling such massive data sets is an essential step in helping scientists discover new information from their computer simulations. In this paper, we present a simple but effective multivariate clustering algorithm for large-scale scientific simulation data sets. Our algorithm utilizes the cosine similarity measure to cluster the field variables in a data set. Field variables include all variables except the spatial (x, y, z) and temporal (time) variables. The exclusion of the spatial dimensions is important since “similar ” characteristics could be located (spatially) far from each other. To scale our multivariate clustering algorithm for large-scale data sets, we take advantage of the geometrical properties of the cosine similarity measure. This allows us to reduce the modeling time from O(n 2) to O(n × g(f(u))), where n is the number of data points, f(u) is a function of the user-defined clustering threshold, and g(f(u)) is the number of data points satisfying f(u). We show that on average g(f(u)) is much less than n. Finally, even though spatial variables do not play a role in building clusters, it is desirable to associate each cluster with its correct spatial region. To achieve this, we present a linking algorithm for connecting each cluster to the appropriate nodes of the data set’s topology tree (where the spatial information of the data set is stored). Our experimental evaluations on two largescale simulation data sets illustrate the value of our multivariate clustering and linking algorithms. 1
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.