17 research outputs found
Evaluation of object storage technologies for climate data storage and analysis
RESUMEN: El análisis de datos en ciencias de la tierra ha estado dominado por el modelo descargar-analizar, por el cual un científico primero descarga el dataset, desde un servidor remoto, a su estación de trabajo o infraestructura HPC de su institución y después procede a su análisis. Con el paso del tiempo, el tamaño y variedad de los datasets ha aumentado de forma exponencial y, a su vez, se han introducido nuevas técnicas de análisis de datos. Estos cambios han introducido nuevos requisitos en los sistemas que almacenan los datasets y en las herramientas de análisis. En la comunidad científica del clima, el formato dominante para los datasets es netCDF, que con el paso del tiempo ha incorporado nuevas funcionalidades para permitir un almacenamiento y acceso a los datos de forma más eficiente, como el uso del formato HDF5 y su técnica de chunking, que permite el uso de sistemas de ficheros en paralelo. El acceso a datos también se ha visto beneficiado de protocolos que permiten el acceso a un subconjunto de los datasets, como por ejemplo DAP. En los últimos años, el cloud computing y en concreto el object storage, se han presentado como una alternativa tanto para el almacenamiento como para el análisis de datos, por lo que están propiciando la aparición de nuevas especificaciones de almacenamiento y de acceso a los datasets, como por ejemplo Zarr. El object storage permite asignar un identificador alfanumérico (hash id) a un bloque arbitrario de bytes (blob) combinado con APIs de tipo REST. El objetivo del trabajo consiste en la evaluación de los beneficios y la eficiencia de estas nuevas tecnologías y especificaciones respecto a las ya existentes, tanto para el almacenamiento como el acceso de datos para su análisis.ABSTRACT: Data analytics in earth science have been dominated by the download-analyze model, in which data analysts first download the desired dataset from a remote server to it’s local workstation or HPC infrastructure, in order to perform the desired analysis. Over time, the size and variety of datasets have increased exponentially and new data science methodologies have appeared, along with new requirements in how datasets are stored and analyzed. In the climate community, climate data is usually stored as netCDF, which has incorporated, new functionalities such as HDF5 storage and chunking, that allows netCDF files to be accessed in parallel by parallel file systems. Data access has also been improved by protocols like the DAP, which allows to access only the required subset from a remote dataset. In recent years, cloud computing and more specifically object storage, have appeared as an alternative to store climate data and to perform data analysis. This fact has encouraged the development of new storage specifications and libraries, such as Zarr. Object storage works by assigning a string (hash id) to an arbitrary block of bytes (blob), combined with REST APIs. The purpose of this work is to compare these new technologies with the traditional stack both for data analysis and data storage.Máster en Ciencia de Dato
TensorBank:Tensor Lakehouse for Foundation Model Training
Storing and streaming high dimensional data for foundation model training
became a critical requirement with the rise of foundation models beyond natural
language. In this paper we introduce TensorBank, a petabyte scale tensor
lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU
memory at wire speed based on complex relational queries. We use Hierarchical
Statistical Indices (HSI) for query acceleration. Our architecture allows to
directly address tensors on block level using HTTP range reads. Once in GPU
memory, data can be transformed using PyTorch transforms. We provide a generic
PyTorch dataset type with a corresponding dataset factory translating
relational queries and requested transformations as an instance. By making use
of the HSI, irrelevant blocks can be skipped without reading them as those
indices contain statistics on their content at different hierarchical
resolution levels. This is an opinionated architecture powered by open
standards and making heavy use of open-source technology. Although, hardened
for production use using geospatial-temporal data, this architecture
generalizes to other use case like computer vision, computational neuroscience,
biological sequence analysis and more
Deep Learning for Rapid Landslide Detection using Synthetic Aperture Radar (SAR) Datacubes
With climate change predicted to increase the likelihood of landslide events,
there is a growing need for rapid landslide detection technologies that help
inform emergency responses. Synthetic Aperture Radar (SAR) is a remote sensing
technique that can provide measurements of affected areas independent of
weather or lighting conditions. Usage of SAR, however, is hindered by domain
knowledge that is necessary for the pre-processing steps and its interpretation
requires expert knowledge. We provide simplified, pre-processed,
machine-learning ready SAR datacubes for four globally located landslide events
obtained from several Sentinel-1 satellite passes before and after a landslide
triggering event together with segmentation maps of the landslides. From this
dataset, using the Hokkaido, Japan datacube, we study the feasibility of
SAR-based landslide detection with supervised deep learning (DL). Our results
demonstrate that DL models can be used to detect landslides from SAR data,
achieving an Area under the Precision-Recall curve exceeding 0.7. We find that
additional satellite visits enhance detection performance, but that early
detection is possible when SAR data is combined with terrain information from a
digital elevation model. This can be especially useful for time-critical
emergency interventions. Code is made publicly available at
https://github.com/iprapas/landslide-sar-unet.Comment: Accepted in the NeurIPS 2022 workshop on Tackling Climate Change with
Machine Learning. Authors Vanessa Boehm, Wei Ji Leong, Ragini Bal Mahesh,
Ioannis Prapas contributed equally as researchers for the Frontier
Development Lab (FDL) 202
Verification against in-situ observations for Data-Driven Weather Prediction
Data-driven weather prediction models (DDWPs) have made rapid strides in
recent years, demonstrating an ability to approximate Numerical Weather
Prediction (NWP) models to a high degree of accuracy. The fast, accurate, and
low-cost DDWP forecasts make their use in operational forecasting an attractive
proposition, however, there remains work to be done in rigorously evaluating
DDWPs in a true operational setting. Typically trained and evaluated using ERA5
reanalysis data, DDWPs have been tested only in a simulation, which cannot
represent the real world with complete accuracy even if it is of a very high
quality. The safe use of DDWPs in operational forecasting requires more
thorough "real-world" verification, as well as a careful examination of how
DDWPs are currently trained and evaluated. It is worth asking, for instance,
how well do the reanalysis datasets, used for training, simulate the real
world? With an eye towards climate justice and the uneven availability of
weather data: is the simulation equally good for all regions of the world, and
would DDWPs exacerbate biases present in the training data? Does a good
performance in simulation correspond to good performance in operational
settings? In addition to approximating the physics of NWP models, how can ML be
uniquely deployed to provide more accurate weather forecasts? As a first step
towards answering such questions, we present a robust dataset of in-situ
observations derived from the NOAA MADIS program to serve as a benchmark to
validate DDWPs in an operational setting. By providing a large corpus of
quality-controlled, in-situ observations, this dataset provides a meaningful
real-world task that all NWPs and DDWPs can be tested against. We hope that
this data can be used not only to rigorously and fairly compare operational
weather models but also to spur future research in new directions.Comment: 10 pages, 6 figures, under review at NeurIPS main conferenc
Python: Una alternativa para el procesamiento de escenarios de cambio climático
The objective of this research work is to propose the Python programming language as an alternative for the processing of climate change scenarios, considering the maximum air temperature as the climatic variable of study. The future climate change scenarios are based on the shared socioeconomic pathways used in the Sixth Assessment Report of the Intergovernmental Panel on Climate Change, from which four (4) delimited scenarios are taken into account: SSP126, SSP245, SSP370 and SSP585, which are extracted from the public data of the Google Cloud, as part of the Pangeo Project, the same ones that are derived from the original files of the Coupled Model Intercomparison Project Phase 6. The selected general circulation model is MRI-ESM2.0, developed by the Meteorological Research Institute of the Japan Meteorological Agency. As a result of data processing, there are four (4) global graphic representations of the variation of the study variable in the period 2023-2100 compared to the period 1937-2014 for each of the four (4) established future scenarios, where an imminent increase in maximum temperature is observed globally, mainly in the Arctic region. Finally, the Python programming language, having the ability to read files that contain climate information, can be considered as one more tool for its processing.El objetivo del presente trabajo de investigación es proponer el lenguaje de programación Python como una alternativa para el procesamiento de escenarios de cambio climático, considerando a la temperatura máxima del aire como la variable climática de estudio. Los escenarios futuros de cambio climático se basan en las rutas socioeconómicas compartidas utilizadas en el Sexto Informe de Evaluación del Grupo Intergubernamental de Expertos sobre el Cambio Climático, de donde se toman en cuenta cuatro (4) escenarios delimitados entre sí: SSP126, SSP245, SSP370 y SSP585, los cuales se extraen de los datos públicos de la Nube de Google, como parte del Proyecto Pangeo, los mismos que se derivan de los archivos originales del Proyecto de Intercomparación de Modelos Acoplados Fase 6. El modelo de circulación general seleccionado es MRI-ESM2.0, desarrollado por el Instituto de Investigación Meteorológica de la Agencia Meteorológica de Japón. Como resultado del procesamiento de datos, se tienen cuatro (4) representaciones gráficas a nivel global de la variación de la variable de estudio en el periodo 2023-2100 en comparación con el periodo 1937-2014 por cada uno de los cuatro (4) escenarios futuros establecidos, donde se observa un aumento inminente de la temperatura máxima a nivel mundial, principalmente en la región del Ártico. Finalmente, el lenguaje de programación Python, al tener la capacidad para la lectura de archivos que contienen información climática, puede ser considerada como una herramienta más para su procesamiento
From the oceans to the cloud: Opportunities and challenges for data, models, computation and workflows.
© The Author(s), 2019. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in Vance, T. C., Wengren, M., Burger, E., Hernandez, D., Kearns, T., Medina-Lopez, E., Merati, N., O'Brien, K., O'Neil, J., Potemrag, J. T., Signell, R. P., & Wilcox, K. From the oceans to the cloud: Opportunities and challenges for data, models, computation and workflows. Frontiers in Marine Science, 6(211), (2019), doi:10.3389/fmars.2019.00211.Advances in ocean observations and models mean increasing flows of data. Integrating observations between disciplines over spatial scales from regional to global presents challenges. Running ocean models and managing the results is computationally demanding. The rise of cloud computing presents an opportunity to rethink traditional approaches. This includes developing shared data processing workflows utilizing common, adaptable software to handle data ingest and storage, and an associated framework to manage and execute downstream modeling. Working in the cloud presents challenges: migration of legacy technologies and processes, cloud-to-cloud interoperability, and the translation of legislative and bureaucratic requirements for “on-premises” systems to the cloud. To respond to the scientific and societal needs of a fit-for-purpose ocean observing system, and to maximize the benefits of more integrated observing, research on utilizing cloud infrastructures for sharing data and models is underway. Cloud platforms and the services/APIs they provide offer new ways for scientists to observe and predict the ocean’s state. High-performance mass storage of observational data, coupled with on-demand computing to run model simulations in close proximity to the data, tools to manage workflows, and a framework to share and collaborate, enables a more flexible and adaptable observation and prediction computing architecture. Model outputs are stored in the cloud and researchers either download subsets for their interest/area or feed them into their own simulations without leaving the cloud. Expanded storage and computing capabilities make it easier to create, analyze, and distribute products derived from long-term datasets. In this paper, we provide an introduction to cloud computing, describe current uses of the cloud for management and analysis of observational data and model results, and describe workflows for running models and streaming observational data. We discuss topics that must be considered when moving to the cloud: costs, security, and organizational limitations on cloud use. Future uses of the cloud via computational sandboxes and the practicalities and considerations of using the cloud to archive data are explored. We also consider the ways in which the human elements of ocean observations are changing – the rise of a generation of researchers whose observations are likely to be made remotely rather than hands on – and how their expectations and needs drive research towards the cloud. In conclusion, visions of a future where cloud computing is ubiquitous are discussed.This is PMEL contribution 4873