4 research outputs found

    Data clustering using proximity matrices with missing values

    Get PDF
    In most applications of data clustering the input data includes vectors describing the location of each data point, from which distances between data points can be calculated and a proximity matrix constructed. In some applications, however, the only available input is the proximity matrix, that is, the distances between each pair of data point. Several clustering algorithms can still be applied, but if the proximity matrix has missing values no standard method is directly applicable. Imputation can be done to replace missing values, but most imputation methods do not apply when only the proximity matrix is available. As a partial solution to fill this gap, we propose the Proximity Matrix Completion (PMC) algorithm. This algorithm assumes that data is missing due to one of two reasons: complete dissimilarity or incomplete observations; and imputes values accordingly. To determine which case applies the data is modeled as a graph and a set of maximum cliques in the graph is found. Overlap between cliques then determines the case and hence the method of imputation for each missing data point. This approach is motivated by an application in plant breeding, where what is needed is to cluster new experimental seed varieties into sets of varieties that interact similarly to the environment, and this application is presented as a case study in the paper. The applicability, limitations and performance of the new algorithm versus other methods of imputation are further studied by applying it to datasets derived from three well-known test datasets

    Evidential reasoning for preprocessing uncertain categorical data for trustworthy decisions: An application on healthcare and finance

    Get PDF
    The uncertainty attributed by discrepant data in AI-enabled decisions is a critical challenge in highly regulated domains such as health care and finance. Ambiguity and incompleteness due to missing values in output and input attributes, respectively, is ubiquitous in these domains. It could have an adverse impact on a certain unrepresented set of people in the training data without a developer's intention to discriminate. The inherently non-numerical nature of categorical attributes than numerical attributes and the presence of incomplete and ambiguous categorical attributes in a dataset increases the uncertainty in decision-making. This paper addresses the challenges in handling categorical attributes as it is not addressed comprehensively in previous research. Three sources of uncertainties in categorical attributes are recognised in this research. The informational uncertainty, unforeseeable uncertainty in the decision task environment, and the uncertainty due to lack of pre-modelling explainability in categorical attributes are addressed in the proposed methodology on maximum likelihood evidential reasoning (MAKER). It can transform and impute incomplete and ambiguous categorical attributes into interpretable numerical features. It utilises a notion of weight and reliability to include subjective expert preference over a piece of evidence and the quality of evidence in a categorical attribute, respectively. The MAKER framework strives to integrate the recognised uncertainties in the transformed input data that allow a model to perceive data limitations during the training regime and acknowledge doubtful predictions by supporting trustworthy pre-modelling and post modelling explainability. The ability to handle uncertainty and its impact on explainability is demonstrated on a real-world healthcare and finance data for different missing data scenarios in three types of AI algorithms: deep-learning, tree-based, and rule-based model
    corecore