Management of large geospatial datasets

Abstract

In large simulations, like predicting the movement of ocean particles, it is common that simulation executions are related when they share one or more inputs. When the number of simulations increases, it becomes harder for users who run the simulations to keep track of all the simulations. Also, more storage spaces are wasted if there are multiple copies of the same input files. This thesis describes a system that collects data from previous simulations, allowing users to search for the data they need to run the next simulation. Also, the system identifies the same files that were used in previous simulations, which allows users to re-use these files instead of copying the files to a new simulation folder to use them. Among the simulations that were executed in our current environment, the system identifies around 11\% of input files that are shared by the simulations. Users can refer to the same file to use it instead of copying the file to new simulation folders. The conclusion is that the system helps users who run simulations to reduce their efforts and time to find input files that are used in previous simulations when they set up for a new simulation. Also, it saves storage space on the computing cluster where the simulations run on by identifying the duplicated data

    Similar works