20 research outputs found
Explainable Machine Learning for Categorical and Mixed Data with Lossless Visualization
Building accurate and interpretable Machine Learning (ML) models for
heterogeneous/mixed data is a long-standing challenge for algorithms designed
for numeric data. This work focuses on developing numeric coding schemes for
non-numeric attributes for ML algorithms to support accurate and explainable ML
models, methods for lossless visualization of n-D non-numeric categorical data
with visual rule discovery in these visualizations, and accurate and
explainable ML models for categorical data. This study proposes a
classification of mixed data types and analyzes their important role in Machine
Learning. It presents a toolkit for enforcing interpretability of all internal
operations of ML algorithms on mixed data with a visual data exploration on
mixed data. A new Sequential Rule Generation (SRG) algorithm for explainable
rule generation with categorical data is proposed and successfully evaluated in
multiple computational experiments. This work is one of the steps to the full
scope ML algorithms for mixed data supported by lossless visualization of n-D
data in General Line Coordinates beyond Parallel Coordinates.Comment: 46 pages, 32 figures, 29 tables. arXiv admin note: substantial text
overlap with arXiv:2206.0647
Full High-Dimensional Intelligible Learning In 2-D Lossless Visualization Space
This study explores a new methodology for machine learning classification
tasks in 2-dimensional visualization space (2-D ML) using Visual knowledge
Discovery in lossless General Line Coordinates. It is shown that this is a full
machine learning approach that does not require processing n-dimensional data
in an abstract n-dimensional space. It enables discovering n-D patterns in 2-D
space without loss of n-D information using graph representations of n-D data
in 2-D. Specifically, this study shows that it can be done with static and
dynamic In-line Based Coordinates in different modifications, which are a
category of General Line Coordinates. Based on these inline coordinates,
classification and regression methods were developed. The viability of the
strategy was shown by two case studies based on benchmark datasets (Wisconsin
Breast Cancer and Page Block Classification datasets). The characteristics of
page block classification data led to the development of an algorithm for
imbalanced high-resolution data with multiple classes, which exploits the
decision trees as a model design facilitator producing a model, which is more
general than a decision tree. This work accelerates the ongoing consolidation
of an emerging field of full 2-D machine learning and its methodology. Within
this methodology the end users can discover models and justify them as
self-service. Providing interpretable ML models is another benefit of this
approach.Comment: 30 pages, 17 figures, 14 tables. arXiv admin note: text overlap with
arXiv:2106.0756
Super-intelligence Challenges and Lossless Visual Representation of High-Dimensional Data
Fundamental challenges and goals of the cognitive algorithms are moving super-intelligent machines and super-intelligent humans from dreams to reality. This paper is devoted to a technical way to reach some specific aspects of super-intelligence that are beyond the current human cognitive abilities. Specifically the proposed technique is to overcome inabilities to analyze a large amount of abstract numeric high-dimensional data and finding complex patterns in these data with a naked eye. Discovering patterns in multidimensional data using visual means is a long-standing problem in multiple fields and Data Science and Modeling in general. The major challenge is that we cannot see n-D data by a naked eye and need visualization tools to represent n-D data in 2-D losslessly. The number of available lossless methods is quite limited. The objective of this paper is expanding the class of such lossless methods, by proposing a new concept of Generalized Shifted Collocated Paired Coordinates. The paper shows the advantages of proposed lossless technique by proving mathematical properties and by demonstration on real data
Data Visualization and Classification of Artificially Created Images
Visualization of multidimensional data is a long-standing challenge in machine learning and knowledge discovery. A problem arises as soon as 4-dimensions are introduced since we live in a 3-dimensional world. There are methods out there which can visualize multidimensional data, but loss of information and clutter are still a problem. General Line Coordinates (GLC) can losslessly project n-dimensional data in 2- dimensions. A new method is introduced based on GLC called GLC-L. This new method can do interactive visualization, dimension reduction, and supervised learning. One of the applications of GLC-L is transformation of vector data into image data. This novel approach of transforming vector data into images using lossless visualization introduces a new method for classification of data in vector format. Having images which are in raster format instead of vector format allows it to be classified with a Convolutional Neural Network (CNN). Experiments conducted on datasets of different sizes show that these artificially created images provide useful information for the CNN. The CNN can classify these artificially created images with competitive results to other analytic machine learning algorithms for vector data. The artificially created images were also classified with a Support Vector Machine (SVM) and a Multilayer Preceptron (MLP)
Visual Knowledge Discovery with General Line Coordinates
Understanding black-box Machine Learning methods on multidimensional data is
a key challenge in Machine Learning. While many powerful Machine Learning
methods already exist, these methods are often unexplainable or perform poorly
on complex data. This paper proposes visual knowledge discovery approaches
based on several forms of lossless General Line Coordinates. These are an
expansion of the previously introduced General Line Coordinates Linear and
Dynamic Scaffolding Coordinates to produce, explain, and visualize non-linear
classifiers with explanation rules. To ensure these non-linear models and rules
are accurate, General Line Coordinates Linear also developed new interactive
visual knowledge discovery algorithms for finding worst-case validation splits.
These expansions are General Line Coordinates non-linear, interactive rules
linear, hyperblock rules linear, and worst-case linear. Experiments across
multiple benchmark datasets show that this visual knowledge discovery method
can compete with other visual and computational Machine Learning algorithms
while improving both interpretability and accuracy in linear and non-linear
classifications. Major benefits from these expansions consist of the ability to
build accurate and highly interpretable models and rules from hyperblocks, the
ability to analyze interpretability weaknesses in a model, and the input of
expert knowledge through interactive and human-guided visual knowledge
discovery methods.Comment: 44 pages, 26 figures, 3 table
Decreasing Occlusion and Increasing Explanation in Interactive Visual Knowledge Discovery
Lack of explanation and occlusion are the major problems for interactive visual knowledge discovery, machine learning and data mining in multidimensional data. This thesis proposes a hybrid method that combines visual and analytical means to deal with these problems. This method, denoted as FSP, uses visualization of n-D data in 2-D in a set of Shifted Paired Coordinates (SPC). SPC for n-D data consists of n/2 pairs of Cartesian coordinates that are shifted relative to each other to avoid their overlap. Each n-D point is represented as a directed graph in SPC. It is shown that the FSP method simplifies pattern discovery in n-D data providing explainable rules in a visual form with significantly decrease of the cognitive load for analysis of n-D data. The computational experiments on real data has shown its efficiency on both training and validation data
Visual Data Mining
Occlusion is one of the major problems for interactive visual knowledge discovery and data mining in the process of finding patterns in multidimensional data.This project proposes a hybrid method that combines visual and analytical means to deal with occlusion in visual knowledge discovery called as GLC-S which uses visualization of n-D data in 2D in a set of Shifted Paired Coordinates (SPC). A set of Shifted Paired Coordinates for n-D data consists of n/2 pairs of common Cartesian coordinates that are shifted relative to each other to avoid their overlap. Each n-D point A is represented as a directed graph A* in SPC, where each node is the 2D projection of A in a respective pair of the Cartesian coordinates.
The proposed GLC-S method significantly decrease cognitive load for analysis of n-D data and simplify pattern discovery in n-D data. The GLC-S method iteratively splits n-D data into non-overlapping clusters (hyper-rectangles) around local centers and visualizes only data within these clusters at each iteration. The requirements for these clusters are to contain cases of only one class and be the largest cluster with this property in SPC visualization.
Such sequential splitting allows: (1) avoiding occlusion, (2) finding visually local classification patterns, rules, and (3) combine local sub-rules to a global rule that classifies all given data of two or more classes. The computational experiment with Wisconsin Breast Cancer data(9-D), User Knowledge Modeling data(6-D), and Letter Recognition data(17-D) from UCI Machine Learning Repository confirm this capability. At each iteration, these data have been split into training (70%) and validation (30%) data. It required 3 iterations in Wisconsin Breast Cancer data, 4 iterations in User Knowledge Modeling and 5 iterations in Letter Recognition data and respectively 3, 4, 5 local sub-rules that covered over 95% of all n-D data points with 100% accuracy at both training and validation experiments. After each iteration, the data that were used in this iteration are removed and remaining data are used in the next iteration. This removal process helps to decrease occlusion too. The GLC-S algorithm refuses to classify remaining cases that are not covered by these rules, i.e.,., do not belong to found hyper-rectangles. The interactive visualization process in SPC allows adjusting the sides of the hyper-rectangles to maximize the size of the hyper-rectangle without its overlap with the hyper-rectangles of the opposing classes.
The GLC-S method splits data using the fixed split of n coordinates to pairs. This hybrid visual and analytical approach avoids throwing all data of several classes into a visualization plot that typically ends up in a messy highly occluded picture that hides useful patterns. This approach allows revealing these hidden patterns.
The visualization process in SPC is reversible (lossless). i.e.,., all n-D information is visualized in 2D and can be restored from 2D visualization for each n-D case. This hybrid visual analytics method allowed classifying n-D data in a way that can be communicated to the user’s in the understandable and visual form
Visualization for Solving Non-image Problems and Saliency Mapping
High-dimensional data play an important role in knowledge discovery and data science. Integration of visualization, visual analytics, machine learning (ML), and data mining (DM) are the key aspects of data science research for high-dimensional data. This thesis is to explore the efficiency of a new algorithm to convert non-images data into raster images by visualizing data using heatmap in the collocated paired coordinates (CPC). These images are called the CPC-R images and the algorithm that produces them is called the CPC-R algorithm. Powerful deep learning methods open an opportunity to solve non-image ML/DM problems by transforming non-image ML problems into image recognition. The main idea behind CPC-R is splitting attributes of an n-D point into consecutive pairs of its attributes, locating pairs in the same 2-D Cartesian space, and assigning greyscale intensities or colors to the pairs. There are several parameters that can be changed producing several versions of CPC-R images allowing to optimize images for classification. This thesis reports the results of computational experiments with the CPC-R algorithm for different Convolution Neural Network classifiers, and the methods to optimize the several versions of CPC-R images for the same n-point. These results show that the combined CPC-R and deep learning Convolution Neural Network algorithms are able to solve non-image Machine Learning problems reaching high accuracy on the benchmark datasets. The second part of this thesis reports the results of Saliency Mapping with the CPC-R algorithm. The saliency models take an image and generate a saliency map that predicts which regions of the image will most likely draw a human viewer’s attention. The saliency mappings with the CPC-R are explored, and further optimization studies are outlined. This thesis reports the importance of features by estimating the change of prediction accuracy due to the exclusion of the individual features. The large sets of pixels are used as features that can capture a large context. This approach views a cell as the most informative if covering it leads to the largest decrease in classification accuracy. This method is called the Informative Cell Covering (ICC) algorithm.
Keywords: Knowledge Discovery, Deep Learning, Collocated Paired Coordinates, Convolutional Neutral Networks, Raster Images, Machine Learning, Visualization, Nonimage data, Data conversion
A Visual Analysis of EHR Flowsheet to Assist Clinicians’ Interpretation of Health-related Data
This project designed and implemented visualization interface of EHR Systems based on the requirements of the doctors. The visualizations combined both clinic-reported data and patient self-reported data to provide a better representation of patient health related information for the doctor to make decisions. The visualizations highlighted the trend of values, the outliers and make it possible to compare across time and measures. The user study of ten participants suggests that the visualization interface helped them find the information in an efficient way.Master of Science in Information Scienc