– Even with small per process data volumes, aggregate data volumes very large (10s of TB per output). – Communication during IO can negatively impact performance. Large Storage Systems – 100s of storage targets that must be managed to get performance. – Shared use by analysis data preparation impacts other users. Multi-user Systems – Simultaneous large jobs run concurrently (internal) –File system may be shared across systems (external) –Prep data in transit to aid downstream usage. Platform Concerns API performance on platform – The best performing IO API for a platform varies. – Some platforms do not have a working implementation of an API requiring selecting a different choice (e.g., HDF-5). File system characteristics vary – Adjust the IO organization to meet system characteristics (stripe size/count, storage targets). – Respond to variations in performance of the file system dynamically (adaptive IO techniques). Annotate data to aid in analysis – Generate characteristics for locating data (min, max) – Index data with characteristics to aid in finding – Use resilient formats to protect output dat
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.