Search CORE

663,236 research outputs found

AUTOENCODER BASED GENERATOR FOR CREDIT INFORMATION RECOVERY OF RURAL BANKS

Author: Yan Gujun
Publication venue: International Journal of Industrial Engineering: Theory, Applications, and Practice
Publication date: 18/04/2023
Field of study

By using machine learning algorithms, banks and other lending institutions can construct intelligent risk control models for loan businesses, which helps to overcome the disadvantages of traditional evaluation methods, such as low efficiency and excessive reliance on the subjective judgment of auditors. However, in the practical evaluation process, it is inevitable to encounter data with missing credit characteristics. Therefore, filling in the missing characteristics is crucial for the training process of those machine learning algorithms, especially when applied to rural banks with little credit data. In this work, we proposed an autoencoder-based algorithm that can use the correlation between data to restore the missing data items in the features. Also, we selected several open-source datasets (German Credit Data, Give Me Some Credit on the Kaggle platform, etc.) as the training and test dataset to verify the algorithm. The comparison results show that our model outperforms the others, although the performance of the autoencoder-based feature restorer decreases significantly when the feature missing ratio exceeds 70%

International Journal of Industrial Engineering: Theory, Applications and Practice

Discovering linear causal model from incomplete data

Author: Dai Honghua
Li Gang
Tu Y.
Publication venue: Chiang Mai University, Institute for Science and Technology Research and Development
Publication date: 01/01/2003
Field of study

One common drawback in algorithms for learning Linear Causal Models is that they can not deal with incomplete data set. This is unfortunate since many real problems involve missing data or even hidden variable. In this paper, based on multiple imputation, we propose a three-step process to learn linear causal models from incomplete data set. Experimental results indicate that this algorithm is better than the single imputation method (EM algorithm) and the simple list deletion method, and for lower missing rate, this algorithm can even find models better than the results from the greedy learning algorithm MLGS working in a complete data set. In addition, the method is amenable to parallel or distributed processing, which is an important characteristic for data mining in large data sets.<br /

Deakin Research Online

Imputation Techniques in Machine Learning – A Survey

Author: Angeline Christobel et al.
Publication venue: Auricle Global Society of Education and Research
Publication date: 02/11/2023
Field of study

Machine learning plays a pivotal role in data analysis and information extraction. However, one common challenge encountered in this process is dealing with missing values. Missing data can find its way into datasets for a variety of reasons. It can result from errors during data collection and management, intentional omissions, or even human errors. It's important to note that most machine learning models are not designed to handle missing values directly. Consequently, it becomes essential to perform data imputation before feeding the data into a machine learning model. Multiple techniques are available for imputing missing values, and the choice of technique should be made judiciously, considering various parameters. An inappropriate choice can disrupt the overall distribution of data values and subsequently impact the model's performance. In this paper, various imputation methods, including Mean, Median, K-nearest neighbors (KNN)-based imputation, Linear Regression, Miss Forest, and MICE are examined

International Journal on Recent and Innovation Trends in Computing and Communication

Joints in Random Forests

Author: Campos Cassio P. de
Correia Alvaro H. C.
Peharz Robert
Publication venue
Publication date: 01/01/2020
Field of study

Decision Trees (DTs) and Random Forests (RFs) are powerful discriminative learners and tools of central importance to the everyday machine learning practitioner and data scientist. Due to their discriminative nature, however, they lack principled methods to process inputs with missing features or to detect outliers, which requires pairing them with imputation techniques or a separate generative model. In this paper, we demonstrate that DTs and RFs can naturally be interpreted as generative models, by drawing a connection to Probabilistic Circuits, a prominent class of tractable probabilistic models. This reinterpretation equips them with a full joint distribution over the feature space and leads to Generative Decision Trees (GeDTs) and Generative Forests (GeFs), a family of novel hybrid generative-discriminative models. This family of models retains the overall characteristics of DTs and RFs while additionally being able to handle missing features by means of marginalisation. Under certain assumptions, frequently made for Bayes consistency results, we show that consistency in GeDTs and GeFs extend to any pattern of missing input features, if missing at random. Empirically, we show that our models often outperform common routines to treat missing data, such as K-nearest neighbour imputation, and moreover, that our models can naturally detect outliers by monitoring the marginal probability of input features

Pure OAI Repository