9 research outputs found
Rigid Transformations for Stabilized Lower Dimensional Space to Support Subsurface Uncertainty Quantification and Interpretation
Subsurface datasets inherently possess big data characteristics such as vast
volume, diverse features, and high sampling speeds, further compounded by the
curse of dimensionality from various physical, engineering, and geological
inputs. Among the existing dimensionality reduction (DR) methods, nonlinear
dimensionality reduction (NDR) methods, especially Metric-multidimensional
scaling (MDS), are preferred for subsurface datasets due to their inherent
complexity. While MDS retains intrinsic data structure and quantifies
uncertainty, its limitations include unstabilized unique solutions invariant to
Euclidean transformations and an absence of out-of-sample points (OOSP)
extension. To enhance subsurface inferential and machine learning workflows,
datasets must be transformed into stable, reduced-dimension representations
that accommodate OOSP.
Our solution employs rigid transformations for a stabilized Euclidean
invariant representation for LDS. By computing an MDS input dissimilarity
matrix, and applying rigid transformations on multiple realizations, we ensure
transformation invariance and integrate OOSP. This process leverages a convex
hull algorithm and incorporates loss function and normalized stress for
distortion quantification. We validate our approach with synthetic data,
varying distance metrics, and real-world wells from the Duvernay Formation.
Results confirm our method's efficacy in achieving consistent LDS
representations. Furthermore, our proposed "stress ratio" (SR) metric provides
insight into uncertainty, beneficial for model adjustments and inferential
analysis. Consequently, our workflow promises enhanced repeatability and
comparability in NDR for subsurface energy resource engineering and associated
big data workflows.Comment: 30 pages, 17 figures, Submitted to Computational Geosciences Journa
Recommended from our members
Improved integration of information to reduce subsurface model bias
Subsurface modeling deals with data-related issues like cognitive and sampling biases, and model-related challenges including statistical assumptions, misspecification, and algorithmic biases. These challenges introduce four critical implications during subsurface modeling. Firstly, subsurface sampling is subject to sampling bias, which compromises statistical representativeness. Secondly, analog selection methodologies rely on multivariate statistics and expert judgment that overlook spatial information and data dimensionality. Thirdly, subsurface inferential workflows that utilize dimensionality reduction seldom provide repeatable frameworks that maintain model stability and are invariant to Euclidean transformations. Lastly, deep learning methods for dimensionality reduction, characterized as black-box models, lack interpretability and robust evaluation metrics, increasing susceptibility to algorithmic bias. Consequently, neglecting these challenges in subsurface modeling could lead to erroneous predictions, inconsistent inferences, diminished model reliability, and suboptimal decision-making that impacts project economics.
This dissertation integrates information within subsurface models to reduce model bias and significantly improve their accuracy, robustness, and generalizability. First, I create spatial declustering methods to debias spatial datasets with single and multiscale preferential sampling in stationary populations. Second, I introduce a novel geostatistics-based machine learning method for identifying subsurface resource analogs that integrate spatial information in subsurface datasets with high dimensionality. Next, I efficiently combine machine learning and computational geometry methods to stabilize lower dimensional spaces for uncertainty quantification and interpretation. Finally, I create a methodology to assess, evaluate, and interpret the stability of deep learning latent feature spaces.
These novel methodologies demonstrate the importance of improved techniques for information integration in subsurface modeling and show better results over na茂ve methods. This results in objective sampling debiasing in spatial stationary populations with single or multiple data scales, improving statistical representativity. Also, the results show better generalization and accurate identification of spatial analogs in high-dimensional datasets. Moreover, the methods yield Euclidean transformation-invariant lower-dimensional spaces, ensuring unique and repeatable solutions that improve model reliability and interpretability, for rational comparisons. Finally, the results indicate that deep learning models for dimensionality reduction exhibit algorithmic biases and instabilities, including sample, structural, and inferential instability, affecting their reliability and interpretability. Together, these innovations ultimately reduce model bias and significantly improve subsurface modeling.Petroleum and Geosystems Engineerin
Metodolog铆a de visualizaci贸n de datos utilizando m茅todos espectrales y basados en divergencias para la reducci贸n interactiva de la dimensi贸n
Las tareas de reconocimiento de patrones aplican m茅todos que evolucionan de manera equivalente al crecimiento de los datos, alcanzando m茅tricas eficientes en t茅rminos de optimizaci贸n y rendimiento computacional aplicado a exploraci贸n, selecci贸n y representaci贸n de datos. No obstante, los resultados brindados por dichos m茅todos y herramientas podr铆an resultar ambiguos y/o abstractos para el usuario, haciendo que su aplicaci贸n sea compleja, aun mas si no cuentan con un conocimiento previo de los datos. Tener un conocimiento a priori garantiza en el mayor de los casos la correcta selecci贸n del modelo, as铆 como tambi茅n algoritmos y m茅todos adecuados. Sin embargo, en datos masivos, donde este conocimiento es escaso y poco factible, los procesos de interpretaci贸n podr铆an ser arduos para los usuarios, especialmente, para aquellos usuarios no expertos. En consecuencia, han surgido diversos problemas que debe enfrentar el reconocimiento de patrones, entre los m谩s importantes se encuentran: La reducci贸n de dimensi贸n, la interacci贸n con grandes vol煤menes de informaci贸n, la interpretaci贸n y la visualizaci贸n de los datos. Lo anterior puede enmarcar conceptos de controlabilidad e interacci贸n que son propiedades, en su mayor铆a, ausentes en las investigaciones t铆picas dentro del campo de reducci贸n de dimensi贸n. Esta tesis presenta un nuevo enfoque de visualizaci贸n de datos, basada en la mezcla interactiva de resultados de los m茅todos de reducci贸n de dimensional dad (RD). Tal mezcla es una suma ponderada, cuyos factores de ponderaci贸n son definidos por el usuario a trav茅s de una interfaz visual e intuitiva. Adem谩s, el espacio de representaci贸n de baja dimensi贸n producida por m茅todos de (RD) se representan gr谩ficamente mediante diagramas de dispersi贸n alimentados a trav茅s de una visualizaci贸n de datos interactiva controlada. Para ello, se calculan las distancias entre pares por similitud y se emplean para definir el grafico a representar en el diagrama de dispersi贸n..
Metodolog铆a de visualizaci贸n de datos utilizando m茅todos espectrales y basados en divergencias para la reducci贸n interactiva de la dimensi贸n
Las tareas de reconocimiento de patrones aplican m茅todos que evolucionan de manera equivalente al crecimiento de los datos, alcanzando m茅tricas eficientes en t茅rminos de optimizaci贸n y rendimiento computacional aplicado a exploraci贸n, selecci贸n y representaci贸n de datos. No obstante, los resultados brindados por dichos m茅todos y herramientas podr铆an resultar ambiguos y/o abstractos para el usuario, haciendo que su aplicaci贸n sea compleja, aun mas si no cuentan con un conocimiento previo de los datos. Tener un conocimiento a priori garantiza en el mayor de los casos la correcta selecci贸n del modelo, as铆 como tambi茅n algoritmos y m茅todos adecuados. Sin embargo, en datos masivos, donde este conocimiento es escaso y poco factible, los procesos de interpretaci贸n podr铆an ser arduos para los usuarios, especialmente, para aquellos usuarios no expertos. En consecuencia, han surgido diversos problemas que debe enfrentar el reconocimiento de patrones, entre los m谩s importantes se encuentran: La reducci贸n de dimensi贸n, la interacci贸n con grandes vol煤menes de informaci贸n, la interpretaci贸n y la visualizaci贸n de los datos. Lo anterior puede enmarcar conceptos de controlabilidad e interacci贸n que son propiedades, en su mayor铆a, ausentes en las investigaciones t铆picas dentro del campo de reducci贸n de dimensi贸n. Esta tesis presenta un nuevo enfoque de visualizaci贸n de datos, basada en la mezcla interactiva de resultados de los m茅todos de reducci贸n de dimensional dad (RD). Tal mezcla es una suma ponderada, cuyos factores de ponderaci贸n son definidos por el usuario a trav茅s de una interfaz visual e intuitiva. Adem谩s, el espacio de representaci贸n de baja dimensi贸n producida por m茅todos de (RD) se representan gr谩ficamente mediante diagramas de dispersi贸n alimentados a trav茅s de una visualizaci贸n de datos interactiva controlada. Para ello, se calculan las distancias entre pares por similitud y se emplean para definir el grafico a representar en el diagrama de dispersi贸n..
Large-Scale Indexing, Discovery, and Ranking for the Internet of Things (IoT)
Network-enabled sensing and actuation devices are key enablers to connect real-world objects to the cyber world. The Internet of Things (IoT) consists of the network-enabled devices and communication technologies that allow connectivity and integration of physical objects (Things) into the digital world (Internet). Enormous amounts of dynamic IoT data are collected from Internet-connected devices. IoT data are usually multi-variant streams that are heterogeneous, sporadic, multi-modal, and spatio-temporal. IoT data can be disseminated with different granularities and have diverse structures, types, and qualities. Dealing with the data deluge from heterogeneous IoT resources and services imposes new challenges on indexing, discovery, and ranking mechanisms that will allow building applications that require on-line access and retrieval of ad-hoc IoT data. However, the existing IoT data indexing and discovery approaches are complex or centralised, which hinders their scalability. The primary objective of this article is to provide a holistic overview of the state-of-the-art on indexing, discovery, and ranking of IoT data. The article aims to pave the way for researchers to design, develop, implement, and evaluate techniques and approaches for on-line large-scale distributed IoT applications and services
Low dimension hierarchical subspace modelling of high dimensional data
Building models of high-dimensional data in a low dimensional space has become extremely popular in recent years. Motion tracking, facial animation, stock market tracking, digital libraries and many other different models have been built and tuned to specific application domains. However, when the underlying structure of the original data is unknown, the modelling of such data is still an open question. The problem is of interest as capturing and storing large amounts of high dimensional data has become trivial, yet the capability to process, interpret, and use this data is limited. In this thesis, we introduce novel algorithms for modelling high dimensional data with an unknown structure, which allows us to represent the data with good accuracy and in a compact manner. This work presents a novel fully automated dynamic hierarchical algorithm, together with a novel automatic data partitioning method to work alongside existing specific models (talking head, human motion). Our algorithm is applicable to hierarchical data visualisation and classification, meaningful pattern extraction and recognition, and new data sequence generation. Also during our work we investigated problems related to low dimensional data representation: automatic optimal input parameter estimation, and robustness against noise and outliers. We show the potential of our modelling with many data domains: talking head, motion, audio, etc. and we believe that it has good potential in adapting to other domains
Adaptive Regression Methods with Application to Streaming Financial Data
This thesis is concerned with the analysis of adaptive incremental regression algorithms
for data streams. The development of these algorithms is motivated by issues
pertaining to financial data streams, data which are very noisy, non-stationary and
exhibit high degrees of dependence. These incremental regression techniques are subsequently
used to develop efficient and adaptive algorithms for portfolio allocation.
We develop a number of temporally incremental regression algorithms that have
the following attributes; efficiency: the algorithms are iterative, robustness: the algorithms
have a built-in safeguard for outliers and/or use regularisation techniques to
alleviate for estimation error, and adaptiveness: the algorithms estimation is adaptive
to the underlying streaming data. These algorithms make use of known regression
techniques: EWRLS (Exponentially Weighted Recursive Least Squares), TSVD
(Truncated Singular Value Decomposition) and FLS (Flexible Least Squares). We
focus more of our attention on a proposed robust version of EWRLS algorithm, denoted
R-EWRLS, and assess its robustness using a purpose built simulation engine.
This simulation engine is able to generate correlated data streams whose drift and
correlation change over time and can be subjected to randomly generated outliers
whose magnitudes and directions vary.
The R-EWRLS algorithm is developed further to allow for a self-tuned forgetting
factor in the formulation. The forgetting factor is an important tool to account for
non-stationarity in the data through an exponential decay profile which assigns more
weight to the more recent data. The new algorithm is assessed against the R-EWRLS
algorithm using various performance measures.
A number of applications with real data from equities and foreign exchange are used. Various measures are computed to compare our algorithms to established portfolio
allocation techniques. The results are promising and in many cases outperform
benchmark allocation techniques
Nonlinear Manifold Learning for Data Stream
There has been a renewed interest in understanding the structure of high dimensional data set based on manifold learning. Examples include ISOMAP [25], LLE [20] and Laplacian Eigenmap [2] algorithms. Most of these algorithms operate in a "batch" mode and cannot be applied e#ciently for a data stream. We propose an incremental version of ISOMAP. Our experiments not only demonstrate the accuracy and e#ciency of the proposed algorithm, but also reveal interesting behavior of the ISOMAP as the size of available data increases