62 research outputs found
Missing Value Imputation With Unsupervised Backpropagation
Many data mining and data analysis techniques operate on dense matrices or
complete tables of data. Real-world data sets, however, often contain unknown
values. Even many classification algorithms that are designed to operate with
missing values still exhibit deteriorated accuracy. One approach to handling
missing values is to fill in (impute) the missing values. In this paper, we
present a technique for unsupervised learning called Unsupervised
Backpropagation (UBP), which trains a multi-layer perceptron to fit to the
manifold sampled by a set of observed point-vectors. We evaluate UBP with the
task of imputing missing values in datasets, and show that UBP is able to
predict missing values with significantly lower sum-squared error than other
collaborative filtering and imputation techniques. We also demonstrate with 24
datasets and 9 supervised learning algorithms that classification accuracy is
usually higher when randomly-withheld values are imputed using UBP, rather than
with other methods
Solving Incomplete Datasets in Soft Set Using Supported Sets and Aggregate Values
AbstractThe theory of soft set proposed by Molodtsovin 1999[1]is a new method for handling uncertain data and can be defined as a Boolean-valued information system. Ithas been applied to data analysis and decision support systems based on large datasets. In this paper, it is shown that calculated support value can be used to determine missing attribute value of an object. However, in cases when more than one value is missing, the aggregate values and calculated support values will be used in determining the missing values. By successfully recovering missing attribute values, the integrity of a dataset can still been maintained
Bayesian network classification of gastrointestinal bleeding
The source of gastrointestinal bleeding (GIB) remains uncertain in patients presenting without hematemesis. This paper aims at studying the accuracy, specificity and sensitivity of the Naive Bayesian Classifier (NBC) in identifying the source of GIB in the absence of hematemesis. Data of 325 patients admitted via the emergency department (ED) for GIB without hematemesis and who underwent confirmatory testing were analysed. Six attributes related to demography and their presenting signs were chosen. NBC was used to calculate the conditional probability of an individual being assigned to Upper Gastrointestinal bleeding (UGIB) or Lower Gastrointestinal bleeding (LGIB). High classification accuracy (87.3 %), specificity (0.85) and sensitivity (0.88) were achieved. NBC is a useful tool to support the identification of the source of gastrointestinal bleeding in patients without hematemesis
Improved Heterogeneous Distance Functions
Instance-based learning techniques typically handle continuous and linear
input values well, but often do not handle nominal input attributes
appropriately. The Value Difference Metric (VDM) was designed to find
reasonable distance values between nominal attribute values, but it largely
ignores continuous attributes, requiring discretization to map continuous
values into nominal values. This paper proposes three new heterogeneous
distance functions, called the Heterogeneous Value Difference Metric (HVDM),
the Interpolated Value Difference Metric (IVDM), and the Windowed Value
Difference Metric (WVDM). These new distance functions are designed to handle
applications with nominal attributes, continuous attributes, or both. In
experiments on 48 applications the new distance metrics achieve higher
classification accuracy on average than three previous distance functions on
those datasets that have both nominal and continuous attributes.Comment: See http://www.jair.org/ for an online appendix and other files
accompanying this articl
Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals
Cataloged from PDF version of article.A new classification algorithm, called VFI5 (for Voting Feature Intervals), is developed and applied to problem of differential diagnosis of erythemato-squamous diseases. The domain contains records of patients with known diagnosis. Given a training set of such records, the VFI5 classifier learns how to differentiate a new case in the domain. VFI5 represents a concept in the form of feature intervals on each feature dimension separately. classification in the VFI5 algorithm is based on a real-valued voting. Each feature equally participates in the voting process and the class that receives the maximum amount of votes is declared to be the predicted class. The performance of the VFI5 classifier is evaluated empirically in terms of classification accuracy and running time. (C) 1998 Elsevier Science B.V. All rights reserved
Iterative missing value imputation based on feature importance
Many datasets suffer from missing values due to various reasons,which not
only increases the processing difficulty of related tasks but also reduces the
accuracy of classification. To address this problem, the mainstream approach is
to use missing value imputation to complete the dataset. Existing imputation
methods estimate the missing parts based on the observed values in the original
feature space, and they treat all features as equally important during data
completion, while in fact different features have different importance.
Therefore, we have designed an imputation method that considers feature
importance. This algorithm iteratively performs matrix completion and feature
importance learning, and specifically, matrix completion is based on a filling
loss that incorporates feature importance. Our experimental analysis involves
three types of datasets: synthetic datasets with different noisy features and
missing values, real-world datasets with artificially generated missing values,
and real-world datasets originally containing missing values. The results on
these datasets consistently show that the proposed method outperforms the
existing five imputation algorithms.To the best of our knowledge, this is the
first work that considers feature importance in the imputation model
- …