214,843 research outputs found

    DropIn: Making Reservoir Computing Neural Networks Robust to Missing Inputs by Dropout

    Full text link
    The paper presents a novel, principled approach to train recurrent neural networks from the Reservoir Computing family that are robust to missing part of the input features at prediction time. By building on the ensembling properties of Dropout regularization, we propose a methodology, named DropIn, which efficiently trains a neural model as a committee machine of subnetworks, each capable of predicting with a subset of the original input features. We discuss the application of the DropIn methodology in the context of Reservoir Computing models and targeting applications characterized by input sources that are unreliable or prone to be disconnected, such as in pervasive wireless sensor networks and ambient intelligence. We provide an experimental assessment using real-world data from such application domains, showing how the Dropin methodology allows to maintain predictive performances comparable to those of a model without missing features, even when 20\%-50\% of the inputs are not available

    On methods for prediction based on complex data with missing values and robust principal component analysis

    Get PDF
    Massive volumes of data are currently being generated, and at astonishing speed. Technological advances are making it cheaper and accessible for companies/institutions to obtain or generate large flows of data. These data can contain different types of complexities such as unobserved values, illogical values, extreme observations, among many others. On the other hand, sometimes researchers have limitations to obtain samples. For instance it can be costly to grow an organism in a lab. Therefore, a researcher may prefer to grow just a few of them at the expense of lower quality results. This type of data often has a large number of features measured in only a small number of observations so that the dimension of the data is much larger than its size. %Think for example of microarray data. Very often practitioners are more concerned about the proper collection of the data than actually performing a correct data analysis. In this work we discuss methods for two relevant steps in data analysis. We first look at methods for the exploratory step where the practitioner wants to dig through the big flow of information to start understanding its structure and features. Next, we discuss methods for the statistical data analysis and focus on one of the most important tasks in this step: predicting an outcome. In this work we also want to address common complexities of real applications such as high-dimensional data, atypical data and missing values. More specifically, this thesis starts by discussing methods for principal component analysis which is one of the most popular exploratory tools. These methods are extensions of the classical principal components approach which are resistant to atypical data. Chapter \ref{Chapter1} describes the Multivariate S- and the Multivariate least trimmed squares estimators for principal components and proposes an algorithm which can yield more robust results and be computational faster for high-dimensional problems than existing algorithms for these methods and other robust methods. We show that the corresponding functionals are Fisher-consistent at elliptical distributions. Moreover, we study the robustness properties of the Multivariate S-estimator by deriving its influence function. The Multivariate S- and the Multivariate least trimmed squares estimators however only target casewise outliers, i.e. observations are either regular or outlying. Chapter \ref{Chapter2} introduces a new method for principal components that is shown to be more powerful against outliers: the coordinatewise least trimmed squares estimator. In particular, our proposal can handle cellwise outliers which is very common in modern high-dimensional datasets. We adapted our algorithm for the multivariate methods to fit coordinatewise least trimmed squares so that it can also be computed faster in higher dimensions. In addition, we introduce the functional of the estimator which can be shown to be Fisher-consistent at elliptical distributions. Chapter \ref{Chapter3} extends these three methods to the functional data setting and shows that these extensions preserve the robust characteristics of the methods in the multivariate setting. In Chapter \ref{Chapter4} we give some concluding remarks on the robust principal components procedures discussed in Chapters \ref{Chapter1}, \ref{Chapter2} and \ref{Chapter3}. The last chapter of the thesis covers the topic of prediction with missing data values. To make predictions we consider tree-based methods. Trees are a popular data mining technique that allows one to make predictions on data of different type and with missing values. We compare the prediction performance of tree-based techniques when the available training data contain features with missing values. The missing values are handled either by using surrogate decisions within the trees or by the combination of an imputation method with a tree-based method. Both classification and regression problems are considered. Overall, our results show that for smaller fractions of missing data an ensemble method combined with surrogates or single imputation suffices. For moderate to large fractions of missing values, ensemble methods based on conditional inference trees combined with multiple imputation show the best performance, while conditional bagging using surrogates is a good alternative for high-dimensional prediction problems. Theoretical results confirm the potential better prediction performance of multiple imputation ensembles

    A-SFS: Semi-supervised Feature Selection based on Multi-task Self-supervision

    Full text link
    Feature selection is an important process in machine learning. It builds an interpretable and robust model by selecting the features that contribute the most to the prediction target. However, most mature feature selection algorithms, including supervised and semi-supervised, fail to fully exploit the complex potential structure between features. We believe that these structures are very important for the feature selection process, especially when labels are lacking and data is noisy. To this end, we innovatively introduce a deep learning-based self-supervised mechanism into feature selection problems, namely batch-Attention-based Self-supervision Feature Selection(A-SFS). Firstly, a multi-task self-supervised autoencoder is designed to uncover the hidden structure among features with the support of two pretext tasks. Guided by the integrated information from the multi-self-supervised learning model, a batch-attention mechanism is designed to generate feature weights according to batch-based feature selection patterns to alleviate the impacts introduced by a handful of noisy data. This method is compared to 14 major strong benchmarks, including LightGBM and XGBoost. Experimental results show that A-SFS achieves the highest accuracy in most datasets. Furthermore, this design significantly reduces the reliance on labels, with only 1/10 labeled data needed to achieve the same performance as those state of art baselines. Results show that A-SFS is also most robust to the noisy and missing data.Comment: 18 pages, 7 figures, accepted by knowledge-based system

    Predicting the outcome of renal transplantation

    Get PDF
    ObjectiveRenal transplantation has dramatically improved the survival rate of hemodialysis patients. However, with a growing proportion of marginal organs and improved immunosuppression, it is necessary to verify that the established allocation system, mostly based on human leukocyte antigen matching, still meets today's needs. The authors turn to machine-learning techniques to predict, from donor-recipient data, the estimated glomerular filtration rate (eGFR) of the recipient 1 year after transplantation.DesignThe patient's eGFR was predicted using donor-recipient characteristics available at the time of transplantation. Donors' data were obtained from Eurotransplant's database, while recipients' details were retrieved from Charite Campus Virchow-Klinikum's database. A total of 707 renal transplantations from cadaveric donors were included.MeasurementsTwo separate datasets were created, taking features with <10% missing values for one and <50% missing values for the other. Four established regressors were run on both datasets, with and without feature selection.ResultsThe authors obtained a Pearson correlation coefficient between predicted and real eGFR (COR) of 0.48. The best model for the dataset was a Gaussian support vector machine with recursive feature elimination on the more inclusive dataset. All results are available at http://transplant.molgen.mpg.de/.LimitationsFor now, missing values in the data must be predicted and filled in. The performance is not as high as hoped, but the dataset seems to be the main cause.ConclusionsPredicting the outcome is possible with the dataset at hand (COR=0.48). Valuable features include age and creatinine levels of the donor, as well as sex and weight of the recipient
    • …
    corecore