17 research outputs found

    Evaluation of object storage technologies for climate data storage and analysis

    Get PDF
    RESUMEN: El análisis de datos en ciencias de la tierra ha estado dominado por el modelo descargar-analizar, por el cual un científico primero descarga el dataset, desde un servidor remoto, a su estación de trabajo o infraestructura HPC de su institución y después procede a su análisis. Con el paso del tiempo, el tamaño y variedad de los datasets ha aumentado de forma exponencial y, a su vez, se han introducido nuevas técnicas de análisis de datos. Estos cambios han introducido nuevos requisitos en los sistemas que almacenan los datasets y en las herramientas de análisis. En la comunidad científica del clima, el formato dominante para los datasets es netCDF, que con el paso del tiempo ha incorporado nuevas funcionalidades para permitir un almacenamiento y acceso a los datos de forma más eficiente, como el uso del formato HDF5 y su técnica de chunking, que permite el uso de sistemas de ficheros en paralelo. El acceso a datos también se ha visto beneficiado de protocolos que permiten el acceso a un subconjunto de los datasets, como por ejemplo DAP. En los últimos años, el cloud computing y en concreto el object storage, se han presentado como una alternativa tanto para el almacenamiento como para el análisis de datos, por lo que están propiciando la aparición de nuevas especificaciones de almacenamiento y de acceso a los datasets, como por ejemplo Zarr. El object storage permite asignar un identificador alfanumérico (hash id) a un bloque arbitrario de bytes (blob) combinado con APIs de tipo REST. El objetivo del trabajo consiste en la evaluación de los beneficios y la eficiencia de estas nuevas tecnologías y especificaciones respecto a las ya existentes, tanto para el almacenamiento como el acceso de datos para su análisis.ABSTRACT: Data analytics in earth science have been dominated by the download-analyze model, in which data analysts first download the desired dataset from a remote server to it’s local workstation or HPC infrastructure, in order to perform the desired analysis. Over time, the size and variety of datasets have increased exponentially and new data science methodologies have appeared, along with new requirements in how datasets are stored and analyzed. In the climate community, climate data is usually stored as netCDF, which has incorporated, new functionalities such as HDF5 storage and chunking, that allows netCDF files to be accessed in parallel by parallel file systems. Data access has also been improved by protocols like the DAP, which allows to access only the required subset from a remote dataset. In recent years, cloud computing and more specifically object storage, have appeared as an alternative to store climate data and to perform data analysis. This fact has encouraged the development of new storage specifications and libraries, such as Zarr. Object storage works by assigning a string (hash id) to an arbitrary block of bytes (blob), combined with REST APIs. The purpose of this work is to compare these new technologies with the traditional stack both for data analysis and data storage.Máster en Ciencia de Dato

    TensorBank:Tensor Lakehouse for Foundation Model Training

    Full text link
    Storing and streaming high dimensional data for foundation model training became a critical requirement with the rise of foundation models beyond natural language. In this paper we introduce TensorBank, a petabyte scale tensor lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU memory at wire speed based on complex relational queries. We use Hierarchical Statistical Indices (HSI) for query acceleration. Our architecture allows to directly address tensors on block level using HTTP range reads. Once in GPU memory, data can be transformed using PyTorch transforms. We provide a generic PyTorch dataset type with a corresponding dataset factory translating relational queries and requested transformations as an instance. By making use of the HSI, irrelevant blocks can be skipped without reading them as those indices contain statistics on their content at different hierarchical resolution levels. This is an opinionated architecture powered by open standards and making heavy use of open-source technology. Although, hardened for production use using geospatial-temporal data, this architecture generalizes to other use case like computer vision, computational neuroscience, biological sequence analysis and more

    Deep Learning for Rapid Landslide Detection using Synthetic Aperture Radar (SAR) Datacubes

    Full text link
    With climate change predicted to increase the likelihood of landslide events, there is a growing need for rapid landslide detection technologies that help inform emergency responses. Synthetic Aperture Radar (SAR) is a remote sensing technique that can provide measurements of affected areas independent of weather or lighting conditions. Usage of SAR, however, is hindered by domain knowledge that is necessary for the pre-processing steps and its interpretation requires expert knowledge. We provide simplified, pre-processed, machine-learning ready SAR datacubes for four globally located landslide events obtained from several Sentinel-1 satellite passes before and after a landslide triggering event together with segmentation maps of the landslides. From this dataset, using the Hokkaido, Japan datacube, we study the feasibility of SAR-based landslide detection with supervised deep learning (DL). Our results demonstrate that DL models can be used to detect landslides from SAR data, achieving an Area under the Precision-Recall curve exceeding 0.7. We find that additional satellite visits enhance detection performance, but that early detection is possible when SAR data is combined with terrain information from a digital elevation model. This can be especially useful for time-critical emergency interventions. Code is made publicly available at https://github.com/iprapas/landslide-sar-unet.Comment: Accepted in the NeurIPS 2022 workshop on Tackling Climate Change with Machine Learning. Authors Vanessa Boehm, Wei Ji Leong, Ragini Bal Mahesh, Ioannis Prapas contributed equally as researchers for the Frontier Development Lab (FDL) 202

    Verification against in-situ observations for Data-Driven Weather Prediction

    Full text link
    Data-driven weather prediction models (DDWPs) have made rapid strides in recent years, demonstrating an ability to approximate Numerical Weather Prediction (NWP) models to a high degree of accuracy. The fast, accurate, and low-cost DDWP forecasts make their use in operational forecasting an attractive proposition, however, there remains work to be done in rigorously evaluating DDWPs in a true operational setting. Typically trained and evaluated using ERA5 reanalysis data, DDWPs have been tested only in a simulation, which cannot represent the real world with complete accuracy even if it is of a very high quality. The safe use of DDWPs in operational forecasting requires more thorough "real-world" verification, as well as a careful examination of how DDWPs are currently trained and evaluated. It is worth asking, for instance, how well do the reanalysis datasets, used for training, simulate the real world? With an eye towards climate justice and the uneven availability of weather data: is the simulation equally good for all regions of the world, and would DDWPs exacerbate biases present in the training data? Does a good performance in simulation correspond to good performance in operational settings? In addition to approximating the physics of NWP models, how can ML be uniquely deployed to provide more accurate weather forecasts? As a first step towards answering such questions, we present a robust dataset of in-situ observations derived from the NOAA MADIS program to serve as a benchmark to validate DDWPs in an operational setting. By providing a large corpus of quality-controlled, in-situ observations, this dataset provides a meaningful real-world task that all NWPs and DDWPs can be tested against. We hope that this data can be used not only to rigorously and fairly compare operational weather models but also to spur future research in new directions.Comment: 10 pages, 6 figures, under review at NeurIPS main conferenc

    Python: Una alternativa para el procesamiento de escenarios de cambio climático

    Get PDF
    The objective of this research work is to propose the Python programming language as an alternative for the processing of climate change scenarios, considering the maximum air temperature as the climatic variable of study. The future climate change scenarios are based on the shared socioeconomic pathways used in the Sixth Assessment Report of the Intergovernmental Panel on Climate Change, from which four (4) delimited scenarios are taken into account: SSP126, SSP245, SSP370 and SSP585, which are extracted from the public data of the Google Cloud, as part of the Pangeo Project, the same ones that are derived from the original files of the Coupled Model Intercomparison Project Phase 6. The selected general circulation model is MRI-ESM2.0, developed by the Meteorological Research Institute of the Japan Meteorological Agency. As a result of data processing, there are four (4) global graphic representations of the variation of the study variable in the period 2023-2100 compared to the period 1937-2014 for each of the four (4) established future scenarios, where an imminent increase in maximum temperature is observed globally, mainly in the Arctic region. Finally, the Python programming language, having the ability to read files that contain climate information, can be considered as one more tool for its processing.El objetivo del presente trabajo de investigación es proponer el lenguaje de programación Python como una alternativa para el procesamiento de escenarios de cambio climático, considerando a la temperatura máxima del aire como la variable climática de estudio. Los escenarios futuros de cambio climático se basan en las rutas socioeconómicas compartidas utilizadas en el Sexto Informe de Evaluación del Grupo Intergubernamental de Expertos sobre el Cambio Climático, de donde se toman en cuenta cuatro (4) escenarios delimitados entre sí: SSP126, SSP245, SSP370 y SSP585, los cuales se extraen de los datos públicos de la Nube de Google, como parte del Proyecto Pangeo, los mismos que se derivan de los archivos originales del Proyecto de Intercomparación de Modelos Acoplados Fase 6. El modelo de circulación general seleccionado es MRI-ESM2.0, desarrollado por el Instituto de Investigación Meteorológica de la Agencia Meteorológica de Japón. Como resultado del procesamiento de datos, se tienen cuatro (4) representaciones gráficas a nivel global de la variación de la variable de estudio en el periodo 2023-2100 en comparación con el periodo 1937-2014 por cada uno de los cuatro (4) escenarios futuros establecidos, donde se observa un aumento inminente de la temperatura máxima a nivel mundial, principalmente en la región del Ártico. Finalmente, el lenguaje de programación Python, al tener la capacidad para la lectura de archivos que contienen información climática, puede ser considerada como una herramienta más para su procesamiento

    From the oceans to the cloud: Opportunities and challenges for data, models, computation and workflows.

    Get PDF
    © The Author(s), 2019. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in Vance, T. C., Wengren, M., Burger, E., Hernandez, D., Kearns, T., Medina-Lopez, E., Merati, N., O'Brien, K., O'Neil, J., Potemrag, J. T., Signell, R. P., & Wilcox, K. From the oceans to the cloud: Opportunities and challenges for data, models, computation and workflows. Frontiers in Marine Science, 6(211), (2019), doi:10.3389/fmars.2019.00211.Advances in ocean observations and models mean increasing flows of data. Integrating observations between disciplines over spatial scales from regional to global presents challenges. Running ocean models and managing the results is computationally demanding. The rise of cloud computing presents an opportunity to rethink traditional approaches. This includes developing shared data processing workflows utilizing common, adaptable software to handle data ingest and storage, and an associated framework to manage and execute downstream modeling. Working in the cloud presents challenges: migration of legacy technologies and processes, cloud-to-cloud interoperability, and the translation of legislative and bureaucratic requirements for “on-premises” systems to the cloud. To respond to the scientific and societal needs of a fit-for-purpose ocean observing system, and to maximize the benefits of more integrated observing, research on utilizing cloud infrastructures for sharing data and models is underway. Cloud platforms and the services/APIs they provide offer new ways for scientists to observe and predict the ocean’s state. High-performance mass storage of observational data, coupled with on-demand computing to run model simulations in close proximity to the data, tools to manage workflows, and a framework to share and collaborate, enables a more flexible and adaptable observation and prediction computing architecture. Model outputs are stored in the cloud and researchers either download subsets for their interest/area or feed them into their own simulations without leaving the cloud. Expanded storage and computing capabilities make it easier to create, analyze, and distribute products derived from long-term datasets. In this paper, we provide an introduction to cloud computing, describe current uses of the cloud for management and analysis of observational data and model results, and describe workflows for running models and streaming observational data. We discuss topics that must be considered when moving to the cloud: costs, security, and organizational limitations on cloud use. Future uses of the cloud via computational sandboxes and the practicalities and considerations of using the cloud to archive data are explored. We also consider the ways in which the human elements of ocean observations are changing – the rise of a generation of researchers whose observations are likely to be made remotely rather than hands on – and how their expectations and needs drive research towards the cloud. In conclusion, visions of a future where cloud computing is ubiquitous are discussed.This is PMEL contribution 4873
    corecore