10 research outputs found

    Deep Learning with Tabular Data: A Self-supervised Approach

    Full text link
    We have described a novel approach for training tabular data using the TabTransformer model with self-supervised learning. Traditional machine learning models for tabular data, such as GBDT are being widely used though our paper examines the effectiveness of the TabTransformer which is a Transformer based model optimised specifically for tabular data. The TabTransformer captures intricate relationships and dependencies among features in tabular data by leveraging the self-attention mechanism of Transformers. We have used a self-supervised learning approach in this study, where the TabTransformer learns from unlabelled data by creating surrogate supervised tasks, eliminating the need for the labelled data. The aim is to find the most effective TabTransformer model representation of categorical and numerical features. To address the challenges faced during the construction of various input settings into the Transformers. Furthermore, a comparative analysis is also been conducted to examine performance of the TabTransformer model against baseline models such as MLP and supervised TabTransformer. The research has presented with a novel approach by creating various variants of TabTransformer model namely, Binned-TT, Vanilla-MLP-TT, MLP- based-TT which has helped to increase the effective capturing of the underlying relationship between various features of the tabular dataset by constructing optimal inputs. And further we have employed a self-supervised learning approach in the form of a masking-based unsupervised setting for tabular data. The findings shed light on the best way to represent categorical and numerical features, emphasizing the TabTransormer performance when compared to established machine learning models and other self-supervised learning methods

    A Multiscale Spatiotemporal Approach for Smallholder Irrigation Detection

    Get PDF
    In presenting an irrigation detection methodology that leverages multiscale satellite imagery of vegetation abundance, this paper introduces a process to supplement limited ground-collected labels and ensure classifier applicability in an area of interest. Spatiotemporal analysis of MODIS 250 m enhanced vegetation index (EVI) timeseries characterizes native vegetation phenologies at regional scale to provide the basis for a continuous phenology map that guides supplementary label collection over irrigated and non-irrigated agriculture. Subsequently, validated dry season greening and senescence cycles observed in 10 m Sentinel-2 imagery are used to train a suite of classifiers for automated detection of potential smallholder irrigation. Strategies to improve model robustness are demonstrated, including a method of data augmentation that randomly shifts training samples; and an assessment of classifier types that produce the best performance in withheld target regions. The methodology is applied to detect smallholder irrigation in two states in the Ethiopian Highlands, Tigray and Amhara, where detection of irrigated smallholder farm plots is crucial for energy infrastructure planning. Results show that a transformer-based neural network architecture allows for the most robust prediction performance in withheld regions, followed closely by a CatBoost model. Over withheld ground-collection survey labels, the transformer-based model achieves 96.7% accuracy over non-irrigated samples and 95.9% accuracy over irrigated samples. Over a larger set of samples independently collected via the introduced method of label supplementation, non-irrigated and irrigated labels are predicted with 98.3 and 95.5% accuracy, respectively. The detection model is then deployed over Tigray and Amhara, revealing crop rotation patterns and year-over-year irrigated area change. Predictions suggest that irrigated area in these two states has decreased by approximately 40% from 2020 to 2021

    Individualized survival prediction and surgery recommendation for patients with glioblastoma

    Get PDF
    BackgroundThere is a lack of individualized evidence on surgical choices for glioblastoma (GBM) patients.AimThis study aimed to make individualized treatment recommendations for patients with GBM and to determine the importance of demographic and tumor characteristic variables in the selection of extent of resection.MethodsWe proposed Balanced Decision Ensembles (BDE) to make survival predictions and individualized treatment recommendations. We developed several DL models to counterfactually predict the individual treatment effect (ITE) of patients with GBM. We divided the patients into the recommended (Rec.) and anti-recommended groups based on whether their actual treatment was consistent with the model recommendation.ResultsThe BDE achieved the best recommendation effects (difference in restricted mean survival time (dRMST): 5.90; 95% confidence interval (CI), 4.40–7.39; hazard ratio (HR): 0.71; 95% CI, 0.65–0.77), followed by BITES and DeepSurv. Inverse probability treatment weighting (IPTW)-adjusted HR, IPTW-adjusted OR, natural direct effect, and control direct effect demonstrated better survival outcomes of the Rec. group.ConclusionThe ITE calculation method is crucial, as it may result in better or worse recommendations. Furthermore, the significant protective effects of machine recommendations on survival time and mortality indicate the superiority of the model for application in patients with GBM. Overall, the model identifies patients with tumors located in the right and left frontal and middle temporal lobes, as well as those with larger tumor sizes, as optimal candidates for SpTR

    Data Mining Applied to Decision Support Systems for Power Transformers’ Health Diagnostics

    Full text link
    This manuscript addresses the problem of technical state assessment of power transformers based on data preprocessing and machine learning. The initial dataset contains diagnostics results of the power transformers, which were collected from a variety of different data sources. It leads to dramatic degradation of the quality of the initial dataset, due to a substantial number of missing values. The problems of such real-life datasets are considered together with the performed efforts to find a balance between data quality and quantity. A data preprocessing method is proposed as a two-iteration data mining technology with simultaneous visualization of objects’ observability in a form of an image of the dataset represented by a data area diagram. The visualization improves the decision-making quality in the course of the data preprocessing procedure. On the dataset collected by the authors, the two-iteration data preprocessing technology increased the dataset filling degree from 75% to 94%, thus the number of gaps that had to be filled in with the synthetic values was reduced by 2.5 times. The processed dataset was used to build machine-learning models for power transformers’ technical state classification. A comparative analysis of different machine learning models was carried out. The outperforming efficiency of ensembles of decision trees was validated for the fleet of high-voltage power equipment taken under consideration. The resulting classification-quality metric, namely, F1-score, was estimated to be 83%. © 2022 by the authors.Ministry of Education and Science of the Russian Federation, MinobrnaukaThe research funding from the Ministry of Science and Higher Education of the Russian Federation (Ural Federal University Program of Development within the Priority-2030 Program) is gratefully acknowledged

    Machine learning for particle identification in the LHCb detector

    Get PDF
    LHCb experiment is a specialised b-physics experiment at the Large Hadron Collider at CERN. It has a broad physics program with the primary objective being the search for CP violations that would explain the matter-antimatter asymmetry of the Universe. LHCb studies very rare phenomena, making it necessary to process millions of collision events per second to gather enough data in a reasonable time frame. Thus software and data analysis tools are essential for the success of the experiment. Particle identification (PID) is a crucial ingredient of most of the LHCb results. The quality of the particle identification depends a lot on the data processing algorithms. This dissertation aims to leverage the recent advances in machine learning field to improve the PID at LHCb. The thesis contribution consists of four essential parts related to LHCb internal projects. Muon identification aims to quickly separate muons from the other charged particles using only information from the Muon subsystem. The second contribution is a method that takes into account a priori information on label noise and improves the accuracy of a machine learning model for classification of this data. Such data are common in high-energy physics and, in particular, is used to develop the data-driven muon identification methods. Global PID combines information from different subdetectors into a single set of PID variables. Cherenkov detector fast simulation aims to improve the speed of the PID variables simulation in Monte-Carlo
    corecore