20 research outputs found

    Explainable Machine Learning for Categorical and Mixed Data with Lossless Visualization

    Full text link
    Building accurate and interpretable Machine Learning (ML) models for heterogeneous/mixed data is a long-standing challenge for algorithms designed for numeric data. This work focuses on developing numeric coding schemes for non-numeric attributes for ML algorithms to support accurate and explainable ML models, methods for lossless visualization of n-D non-numeric categorical data with visual rule discovery in these visualizations, and accurate and explainable ML models for categorical data. This study proposes a classification of mixed data types and analyzes their important role in Machine Learning. It presents a toolkit for enforcing interpretability of all internal operations of ML algorithms on mixed data with a visual data exploration on mixed data. A new Sequential Rule Generation (SRG) algorithm for explainable rule generation with categorical data is proposed and successfully evaluated in multiple computational experiments. This work is one of the steps to the full scope ML algorithms for mixed data supported by lossless visualization of n-D data in General Line Coordinates beyond Parallel Coordinates.Comment: 46 pages, 32 figures, 29 tables. arXiv admin note: substantial text overlap with arXiv:2206.0647

    Full High-Dimensional Intelligible Learning In 2-D Lossless Visualization Space

    Full text link
    This study explores a new methodology for machine learning classification tasks in 2-dimensional visualization space (2-D ML) using Visual knowledge Discovery in lossless General Line Coordinates. It is shown that this is a full machine learning approach that does not require processing n-dimensional data in an abstract n-dimensional space. It enables discovering n-D patterns in 2-D space without loss of n-D information using graph representations of n-D data in 2-D. Specifically, this study shows that it can be done with static and dynamic In-line Based Coordinates in different modifications, which are a category of General Line Coordinates. Based on these inline coordinates, classification and regression methods were developed. The viability of the strategy was shown by two case studies based on benchmark datasets (Wisconsin Breast Cancer and Page Block Classification datasets). The characteristics of page block classification data led to the development of an algorithm for imbalanced high-resolution data with multiple classes, which exploits the decision trees as a model design facilitator producing a model, which is more general than a decision tree. This work accelerates the ongoing consolidation of an emerging field of full 2-D machine learning and its methodology. Within this methodology the end users can discover models and justify them as self-service. Providing interpretable ML models is another benefit of this approach.Comment: 30 pages, 17 figures, 14 tables. arXiv admin note: text overlap with arXiv:2106.0756

    Super-intelligence Challenges and Lossless Visual Representation of High-Dimensional Data

    Get PDF
    Fundamental challenges and goals of the cognitive algorithms are moving super-intelligent machines and super-intelligent humans from dreams to reality. This paper is devoted to a technical way to reach some specific aspects of super-intelligence that are beyond the current human cognitive abilities. Specifically the proposed technique is to overcome inabilities to analyze a large amount of abstract numeric high-dimensional data and finding complex patterns in these data with a naked eye. Discovering patterns in multidimensional data using visual means is a long-standing problem in multiple fields and Data Science and Modeling in general. The major challenge is that we cannot see n-D data by a naked eye and need visualization tools to represent n-D data in 2-D losslessly. The number of available lossless methods is quite limited. The objective of this paper is expanding the class of such lossless methods, by proposing a new concept of Generalized Shifted Collocated Paired Coordinates. The paper shows the advantages of proposed lossless technique by proving mathematical properties and by demonstration on real data

    Data Visualization and Classification of Artificially Created Images

    Get PDF
    Visualization of multidimensional data is a long-standing challenge in machine learning and knowledge discovery. A problem arises as soon as 4-dimensions are introduced since we live in a 3-dimensional world. There are methods out there which can visualize multidimensional data, but loss of information and clutter are still a problem. General Line Coordinates (GLC) can losslessly project n-dimensional data in 2- dimensions. A new method is introduced based on GLC called GLC-L. This new method can do interactive visualization, dimension reduction, and supervised learning. One of the applications of GLC-L is transformation of vector data into image data. This novel approach of transforming vector data into images using lossless visualization introduces a new method for classification of data in vector format. Having images which are in raster format instead of vector format allows it to be classified with a Convolutional Neural Network (CNN). Experiments conducted on datasets of different sizes show that these artificially created images provide useful information for the CNN. The CNN can classify these artificially created images with competitive results to other analytic machine learning algorithms for vector data. The artificially created images were also classified with a Support Vector Machine (SVM) and a Multilayer Preceptron (MLP)

    Visual Knowledge Discovery with General Line Coordinates

    Full text link
    Understanding black-box Machine Learning methods on multidimensional data is a key challenge in Machine Learning. While many powerful Machine Learning methods already exist, these methods are often unexplainable or perform poorly on complex data. This paper proposes visual knowledge discovery approaches based on several forms of lossless General Line Coordinates. These are an expansion of the previously introduced General Line Coordinates Linear and Dynamic Scaffolding Coordinates to produce, explain, and visualize non-linear classifiers with explanation rules. To ensure these non-linear models and rules are accurate, General Line Coordinates Linear also developed new interactive visual knowledge discovery algorithms for finding worst-case validation splits. These expansions are General Line Coordinates non-linear, interactive rules linear, hyperblock rules linear, and worst-case linear. Experiments across multiple benchmark datasets show that this visual knowledge discovery method can compete with other visual and computational Machine Learning algorithms while improving both interpretability and accuracy in linear and non-linear classifications. Major benefits from these expansions consist of the ability to build accurate and highly interpretable models and rules from hyperblocks, the ability to analyze interpretability weaknesses in a model, and the input of expert knowledge through interactive and human-guided visual knowledge discovery methods.Comment: 44 pages, 26 figures, 3 table

    Decreasing Occlusion and Increasing Explanation in Interactive Visual Knowledge Discovery

    Get PDF
    Lack of explanation and occlusion are the major problems for interactive visual knowledge discovery, machine learning and data mining in multidimensional data. This thesis proposes a hybrid method that combines visual and analytical means to deal with these problems. This method, denoted as FSP, uses visualization of n-D data in 2-D in a set of Shifted Paired Coordinates (SPC). SPC for n-D data consists of n/2 pairs of Cartesian coordinates that are shifted relative to each other to avoid their overlap. Each n-D point is represented as a directed graph in SPC. It is shown that the FSP method simplifies pattern discovery in n-D data providing explainable rules in a visual form with significantly decrease of the cognitive load for analysis of n-D data. The computational experiments on real data has shown its efficiency on both training and validation data

    Visual Data Mining

    Get PDF
    Occlusion is one of the major problems for interactive visual knowledge discovery and data mining in the process of finding patterns in multidimensional data.This project proposes a hybrid method that combines visual and analytical means to deal with occlusion in visual knowledge discovery called as GLC-S which uses visualization of n-D data in 2D in a set of Shifted Paired Coordinates (SPC). A set of Shifted Paired Coordinates for n-D data consists of n/2 pairs of common Cartesian coordinates that are shifted relative to each other to avoid their overlap. Each n-D point A is represented as a directed graph A* in SPC, where each node is the 2D projection of A in a respective pair of the Cartesian coordinates. The proposed GLC-S method significantly decrease cognitive load for analysis of n-D data and simplify pattern discovery in n-D data. The GLC-S method iteratively splits n-D data into non-overlapping clusters (hyper-rectangles) around local centers and visualizes only data within these clusters at each iteration. The requirements for these clusters are to contain cases of only one class and be the largest cluster with this property in SPC visualization. Such sequential splitting allows: (1) avoiding occlusion, (2) finding visually local classification patterns, rules, and (3) combine local sub-rules to a global rule that classifies all given data of two or more classes. The computational experiment with Wisconsin Breast Cancer data(9-D), User Knowledge Modeling data(6-D), and Letter Recognition data(17-D) from UCI Machine Learning Repository confirm this capability. At each iteration, these data have been split into training (70%) and validation (30%) data. It required 3 iterations in Wisconsin Breast Cancer data, 4 iterations in User Knowledge Modeling and 5 iterations in Letter Recognition data and respectively 3, 4, 5 local sub-rules that covered over 95% of all n-D data points with 100% accuracy at both training and validation experiments. After each iteration, the data that were used in this iteration are removed and remaining data are used in the next iteration. This removal process helps to decrease occlusion too. The GLC-S algorithm refuses to classify remaining cases that are not covered by these rules, i.e.,., do not belong to found hyper-rectangles. The interactive visualization process in SPC allows adjusting the sides of the hyper-rectangles to maximize the size of the hyper-rectangle without its overlap with the hyper-rectangles of the opposing classes. The GLC-S method splits data using the fixed split of n coordinates to pairs. This hybrid visual and analytical approach avoids throwing all data of several classes into a visualization plot that typically ends up in a messy highly occluded picture that hides useful patterns. This approach allows revealing these hidden patterns. The visualization process in SPC is reversible (lossless). i.e.,., all n-D information is visualized in 2D and can be restored from 2D visualization for each n-D case. This hybrid visual analytics method allowed classifying n-D data in a way that can be communicated to the user’s in the understandable and visual form

    Visualization for Solving Non-image Problems and Saliency Mapping

    Get PDF
    High-dimensional data play an important role in knowledge discovery and data science. Integration of visualization, visual analytics, machine learning (ML), and data mining (DM) are the key aspects of data science research for high-dimensional data. This thesis is to explore the efficiency of a new algorithm to convert non-images data into raster images by visualizing data using heatmap in the collocated paired coordinates (CPC). These images are called the CPC-R images and the algorithm that produces them is called the CPC-R algorithm. Powerful deep learning methods open an opportunity to solve non-image ML/DM problems by transforming non-image ML problems into image recognition. The main idea behind CPC-R is splitting attributes of an n-D point into consecutive pairs of its attributes, locating pairs in the same 2-D Cartesian space, and assigning greyscale intensities or colors to the pairs. There are several parameters that can be changed producing several versions of CPC-R images allowing to optimize images for classification. This thesis reports the results of computational experiments with the CPC-R algorithm for different Convolution Neural Network classifiers, and the methods to optimize the several versions of CPC-R images for the same n-point. These results show that the combined CPC-R and deep learning Convolution Neural Network algorithms are able to solve non-image Machine Learning problems reaching high accuracy on the benchmark datasets. The second part of this thesis reports the results of Saliency Mapping with the CPC-R algorithm. The saliency models take an image and generate a saliency map that predicts which regions of the image will most likely draw a human viewer’s attention. The saliency mappings with the CPC-R are explored, and further optimization studies are outlined. This thesis reports the importance of features by estimating the change of prediction accuracy due to the exclusion of the individual features. The large sets of pixels are used as features that can capture a large context. This approach views a cell as the most informative if covering it leads to the largest decrease in classification accuracy. This method is called the Informative Cell Covering (ICC) algorithm. Keywords: Knowledge Discovery, Deep Learning, Collocated Paired Coordinates, Convolutional Neutral Networks, Raster Images, Machine Learning, Visualization, Nonimage data, Data conversion

    A Visual Analysis of EHR Flowsheet to Assist Clinicians’ Interpretation of Health-related Data

    Get PDF
    This project designed and implemented visualization interface of EHR Systems based on the requirements of the doctors. The visualizations combined both clinic-reported data and patient self-reported data to provide a better representation of patient health related information for the doctor to make decisions. The visualizations highlighted the trend of values, the outliers and make it possible to compare across time and measures. The user study of ten participants suggests that the visualization interface helped them find the information in an efficient way.Master of Science in Information Scienc
    corecore