1,046 research outputs found

    Unsupervised Learning Techniques for Microseismic and Croswell Geophysical Data

    Get PDF
    Machine learning has served to develop and explore a wide range of applications for geoscientists and petroleum engineers. Fundamental limitations of conventional methodologies include mathematical formulations of physical systems, multi-scale heterogeneity, processing of large datasets, and computational time. The impact of these new technologies has brought the interest of multiple energy industries such as renewables, oil and gas, carbon sequestration, and geothermal. The acquisition of subsurface measurements has been a key factor to characterize reservoir properties. Hence, the integration of machine learning could provide essential information and new knowledge of subsurface monitoring signals. In this work, we focus on the use of unsupervised learning to determine new insights into geophysical tools and subsurface physical properties. We propose three methodologies using microseismic, distributed acoustic sensing (DAS), seismic and electrical resistivity tomography. A critical aspect of monitoring tools is the high computational power of big data. We applied unsupervised dimensionality reduction to compress, denoise and retrieve vital information of microseismic and DAS data. To achieve this, we implemented high-order SVD for high-dimensional arrays of 3D and 4D space. For the 3D microseismic, we achieved a compression of approximately 75% and a reduction of samples from 1,728,000 to 431,303. We also tested the model to the 3D DAS data where we obtained a compression of 70.2% for a data size of 3.5 GB. Lastly, a 4D HOSVD model was established using a synthetic microseismic tensor, accomplishing a reduction of 83%. Another major application of unsupervised learning is the clustering algorithms to group observations of similar characteristics. We applied spatial-temporal clustering to identify hidden patterns of subsurface mapping for a geological carbon storage field. The studies were divided according to the geophysical method (crosswell seismic and ERT) and temporal component (single time or time-series). Using crosswell seismic, we developed a multi-level clustering approach to visualize the CO2 plume behavior. For the first level, we obtained a silhouette score of 0.85, a Calinski-Harabasz of 160666.50, and a Davies-Bouldin value of 0.43. The second level achieved a silhouette, CalinskiHarabasz, and Davies-Bouldin score of 0.74, 59656.01, and 0.32 respectively. We established a total of four clusters of non, low, medium, and high SCO2. Finally, we elaborated a spatial-temporal clustering using derived-SCO2 from daily ERT images. A novel feature extraction methodology was designed to retrieve the spatial and temporal changes of the moving CO2. Four clusters were determined and linked to the saturation levels. The interval validation of clusters was 0.58 for the DTW-silhouette score, 262791.45 for Calinski-Harabasz, and 0.71 for the Davies-Bouldin index. To evaluate the dynamics of CO2 flow regimes, we performed a second clustering where 6 distinctive plume patterns were observed. Therefore, machine learning and in particular unsupervised learning can be used to describe complex systems and optimize data processing

    Rigid Transformations for Stabilized Lower Dimensional Space to Support Subsurface Uncertainty Quantification and Interpretation

    Full text link
    Subsurface datasets inherently possess big data characteristics such as vast volume, diverse features, and high sampling speeds, further compounded by the curse of dimensionality from various physical, engineering, and geological inputs. Among the existing dimensionality reduction (DR) methods, nonlinear dimensionality reduction (NDR) methods, especially Metric-multidimensional scaling (MDS), are preferred for subsurface datasets due to their inherent complexity. While MDS retains intrinsic data structure and quantifies uncertainty, its limitations include unstabilized unique solutions invariant to Euclidean transformations and an absence of out-of-sample points (OOSP) extension. To enhance subsurface inferential and machine learning workflows, datasets must be transformed into stable, reduced-dimension representations that accommodate OOSP. Our solution employs rigid transformations for a stabilized Euclidean invariant representation for LDS. By computing an MDS input dissimilarity matrix, and applying rigid transformations on multiple realizations, we ensure transformation invariance and integrate OOSP. This process leverages a convex hull algorithm and incorporates loss function and normalized stress for distortion quantification. We validate our approach with synthetic data, varying distance metrics, and real-world wells from the Duvernay Formation. Results confirm our method's efficacy in achieving consistent LDS representations. Furthermore, our proposed "stress ratio" (SR) metric provides insight into uncertainty, beneficial for model adjustments and inferential analysis. Consequently, our workflow promises enhanced repeatability and comparability in NDR for subsurface energy resource engineering and associated big data workflows.Comment: 30 pages, 17 figures, Submitted to Computational Geosciences Journa

    The 11th Conference of PhD Students in Computer Science

    Get PDF

    An overview of clustering methods with guidelines for application in mental health research

    Get PDF
    Cluster analyzes have been widely used in mental health research to decompose inter-individual heterogeneity by identifying more homogeneous subgroups of individuals. However, despite advances in new algorithms and increasing popularity, there is little guidance on model choice, analytical framework and reporting requirements. In this paper, we aimed to address this gap by introducing the philosophy, design, advantages/disadvantages and implementation of major algorithms that are particularly relevant in mental health research. Extensions of basic models, such as kernel methods, deep learning, semi-supervised clustering, and clustering ensembles are subsequently introduced. How to choose algorithms to address common issues as well as methods for pre-clustering data processing, clustering evaluation and validation are then discussed. Importantly, we also provide general guidance on clustering workflow and reporting requirements. To facilitate the implementation of different algorithms, we provide information on R functions and librarie

    Acta Polytechnica Hungarica 2006

    Get PDF

    Numerical Pattern Mining Through Compression

    Get PDF
    International audiencePattern Mining (PM) has a prominent place in Data Science and finds its application in a wide range of domains. To avoid the exponential explosion of patterns different methods have been proposed. They are based on assumptions on interestingness and usually return very different pattern sets. In this paper we propose to use a compression-based objective as a well-justified and robust interestingness measure. We define the description lengths for datasets and use the Minimum Description Length principle (MDL) to find patterns that ensure the best compression. Our experiments show that the application of MDL to numerical data provides a small and characteristic subsets of patterns describing data in a compact way

    Reconstrução e classificação de sequências de ADN desconhecidas

    Get PDF
    The continuous advances in DNA sequencing technologies and techniques in metagenomics require reliable reconstruction and accurate classification methodologies for the diversity increase of the natural repository while contributing to the organisms' description and organization. However, after sequencing and de-novo assembly, one of the highest complex challenges comes from the DNA sequences that do not match or resemble any biological sequence from the literature. Three main reasons contribute to this exception: the organism sequence presents high divergence according to the known organisms from the literature, an irregularity has been created in the reconstruction process, or a new organism has been sequenced. The inability to efficiently classify these unknown sequences increases the sample constitution's uncertainty and becomes a wasted opportunity to discover new species since they are often discarded. In this context, the main objective of this thesis is the development and validation of a tool that provides an efficient computational solution to solve these three challenges based on an ensemble of experts, namely compression-based predictors, the distribution of sequence content, and normalized sequence lengths. The method uses both DNA and amino acid sequences and provides efficient classification beyond standard referential comparisons. Unusually, it classifies DNA sequences without resorting directly to the reference genomes but rather to features that the species biological sequences share. Specifically, it only makes use of features extracted individually from each genome without using sequence comparisons. RFSC was then created as a machine learning classification pipeline that relies on an ensemble of experts to provide efficient classification in metagenomic contexts. This pipeline was tested in synthetic and real data, both achieving precise and accurate results that, at the time of the development of this thesis, have not been reported in the state-of-the-art. Specifically, it has achieved an accuracy of approximately 97% in the domain/type classification.Os contínuos avanços em tecnologias de sequenciação de ADN e técnicas em meta genómica requerem metodologias de reconstrução confiáveis e de classificação precisas para o aumento da diversidade do repositório natural, contribuindo, entretanto, para a descrição e organização dos organismos. No entanto, após a sequenciação e a montagem de-novo, um dos desafios mais complexos advém das sequências de ADN que não correspondem ou se assemelham a qualquer sequencia biológica da literatura. São três as principais razões que contribuem para essa exceção: uma irregularidade emergiu no processo de reconstrução, a sequência do organismo é altamente dissimilar dos organismos da literatura, ou um novo e diferente organismo foi reconstruído. A incapacidade de classificar com eficiência essas sequências desconhecidas aumenta a incerteza da constituição da amostra e desperdiça a oportunidade de descobrir novas espécies, uma vez que muitas vezes são descartadas. Neste contexto, o principal objetivo desta tese é fornecer uma solução computacional eficiente para resolver este desafio com base em um conjunto de especialistas, nomeadamente preditores baseados em compressão, a distribuição de conteúdo de sequência e comprimentos de sequência normalizados. O método usa sequências de ADN e de aminoácidos e fornece classificação eficiente além das comparações referenciais padrão. Excecionalmente, ele classifica as sequências de ADN sem recorrer diretamente a genomas de referência, mas sim às características que as sequências biológicas da espécie compartilham. Especificamente, ele usa apenas recursos extraídos individualmente de cada genoma sem usar comparações de sequência. Além disso, o pipeline é totalmente automático e permite a reconstrução sem referência de genomas a partir de reads FASTQ com a garantia adicional de armazenamento seguro de informações sensíveis. O RFSC é então um pipeline de classificação de aprendizagem automática que se baseia em um conjunto de especialistas para fornecer classificação eficiente em contextos meta genómicos. Este pipeline foi aplicado em dados sintéticos e reais, alcançando em ambos resultados precisos e exatos que, no momento do desenvolvimento desta dissertação, não foram relatados na literatura. Especificamente, esta ferramenta desenvolvida, alcançou uma precisão de aproximadamente 97% na classificação de domínio/tipo.Mestrado em Engenharia de Computadores e Telemátic

    Learning regions of interest from low level maps in virtual microscopy

    Get PDF
    Virtual microscopy can improve the workflow of modern pathology laboratories, a goal limited by the large size of the virtual slides (VS). Lately, determination of the Regions of Interest has shown to be useful in navigation and compression tasks. This work presents a novel method for establishing RoIs in VS, based on a relevance score calculated from example images selected by pathologist. The process starts by splitting the Virtual Slide (VS) into a grid of blocks, each represented by a set of low level features which aim to capture the very basic visual properties, namely, color, intensity, orientation and texture. The expert selects then two blocks i.e. A typical relevant (irrelevant) instance. Different similarity (disimilarity) maps are then constructed, using these positive (negative) examples. The obtained maps are then integrated by a normalization process that promotes maps with a similarity global maxima that largely exceeds the average local maxima. Each image region is thus entailed with an associated score, established by the number of closest positive (negative) blocks, whereby any block has also an associated score. Evaluation was carried out using 8 VS from different tissues, upon which a group of three pathologists had navigated. Precision-recall measurements were calculated at each step of any actual navigation, obtaining an average precision of 55% and a recall of about 38% when using the available set of navigations

    Machine Learning in Wireless Sensor Networks: Algorithms, Strategies, and Applications

    Get PDF
    Wireless sensor networks monitor dynamic environments that change rapidly over time. This dynamic behavior is either caused by external factors or initiated by the system designers themselves. To adapt to such conditions, sensor networks often adopt machine learning techniques to eliminate the need for unnecessary redesign. Machine learning also inspires many practical solutions that maximize resource utilization and prolong the lifespan of the network. In this paper, we present an extensive literature review over the period 2002-2013 of machine learning methods that were used to address common issues in wireless sensor networks (WSNs). The advantages and disadvantages of each proposed algorithm are evaluated against the corresponding problem. We also provide a comparative guide to aid WSN designers in developing suitable machine learning solutions for their specific application challenges.Comment: Accepted for publication in IEEE Communications Surveys and Tutorial
    corecore