65 research outputs found
VizRank: Data Visualization Guided by Machine Learning
Data visualization plays a crucial role in identifying interesting patterns in exploratory data analysis. Its use is, however, made difficult by the large number of possible data projections showing different attribute subsets that must be evaluated by the data analyst. In this paper, we introduce a method called VizRank, which is applied on classified data to automatically select the most useful data projections. VizRank can be used with any visualization method that maps attribute values to points in a two-dimensional visualization space. It assesses possible data projections and ranks them by their ability to visually discriminate between classes. The quality of class separation is estimated by computing the predictive accuracy of k-nearest neighbor classifier on the data set consisting of x and y positions of the projected data points and their class information. The paper introduces the method and presents experimental results which show that VizRank's ranking of projections highly agrees with subjective rankings by data analysts. The practical use of VizRank is also demonstrated by an application in the field of functional genomics
Simple and Effective Visual Models for Gene Expression Cancer Diagnostics
In the paper we show that diagnostic classes in cancer gene expression data sets, which most often include thousands of features (genes), may be effectively separated with simple two-dimensional plots such as scatterplot and radviz graph. The principal innovation proposed in the paper is a method called VizRank, which is able to score and identify the best among possibly millions of candidate projections for visualizations. Compared to recently much applied techniques in the field of cancer genomics that include neural networks, support vector machines and various ensemble-based approaches, VizRank is fast and finds visualization models that can be easily examined and interpreted by domain experts. Our experiments on a number of gene expression data sets show that VizRank was always able to find data visualizations with a small number of (two to seven) genes and excellent class separation. In addition to providing grounds for gene expression cancer diagnosis, VizRank and its visualizations also identify small sets of relevant genes, uncover interesting gene interactions and point to outliers and potential misclassifications in cancer data sets
Three-dimensional Radial Visualization of High-dimensional Datasets with Mixed Features
We develop methodology for 3D radial visualization (RadViz) of
high-dimensional datasets. Our display engine is called RadViz3D and extends
the classical 2D RadViz that visualizes multivariate data in the 2D plane by
mapping every record to a point inside the unit circle. We show that
distributing anchor points at least approximately uniformly on the 3D unit
sphere provides a better visualization with minimal artificial visual
correlation for data with uncorrelated variables. Our RadViz3D methodology
therefore places equi-spaced anchor points, one for every feature, exactly for
the five Platonic solids, and approximately via a Fibonacci grid for the other
cases. Our Max-Ratio Projection (MRP) method then utilizes the group
information in high dimensions to provide distinctive lower-dimensional
projections that are then displayed using Radviz3D. Our methodology is extended
to datasets with discrete and continuous features where a Gaussianized
distributional transform is used in conjunction with copula models before
applying MRP and visualizing the result using RadViz3D. A R package radviz3d
implementing our complete methodology is available.Comment: 12 pages, 10 figures, 1 tabl
Application of Data Visualization and Big Data Analysis in Intelligent Agriculture
Intelligent agriculture can renovate agricultural production and management, making agricultural production truly scientific and efficient. The existing data mining technology for agricultural information is powerful and professional. But the technology is not well adapted for intelligent agriculture. Therefore, this paper introduces data visualization and big data analysis into the application scenarios of intelligent agriculture. Firstly, an intelligent agriculture data visualization system was established, and the RadViz data visualization method was detailed for intelligent agriculture. Moreover, the intelligent agriculture data were processed using dimensionality reduction through principal component analysis (PCA) and further optimized through k-means clustering (KMC). Finally, the crop yield was predicted using the multiple regression algorithm and the residual principal component regression algorithm. The crop yield prediction model was proved effective through experiments
Three-dimensional Radial Visualization of High-dimensional Continuous or Discrete Data
This paper develops methodology for 3D radial visualization of high-dimensional datasets. Our display engine is called RadViz3D and extends the classic RadViz that visualizes multivariate data in the 2D plane by mapping every record to a point inside the unit circle. The classic RadViz display has equally-spaced anchor points on the unit circle, with each of them associated with an attribute or feature of the dataset. RadViz3D obtains equi-spaced anchor points exactly for the five Platonic solids and approximately for the other cases via a Fibonacci grid. We show that distributing anchor points at least approximately uniformly on the 3D unit sphere provides a better visualization than in 2D. We also propose a Max-Ratio Projection (MRP) method that utilizes the group information in high dimensions to provide distinctive lower-dimensional projections that are then displayed using Radviz3D. Our methodology is extended to datasets with discrete and mixed features where a generalized distributional transform is used in conjuction with copula models before applying MRP and RadViz3D visualization
ICE: An Interactive Configuration Explorer for High Dimensional Categorical Parameter Spaces
There are many applications where users seek to explore the impact of the
settings of several categorical variables with respect to one dependent
numerical variable. For example, a computer systems analyst might want to study
how the type of file system or storage device affects system performance. A
usual choice is the method of Parallel Sets designed to visualize multivariate
categorical variables. However, we found that the magnitude of the parameter
impacts on the numerical variable cannot be easily observed here. We also
attempted a dimension reduction approach based on Multiple Correspondence
Analysis but found that the SVD-generated 2D layout resulted in a loss of
information. We hence propose a novel approach, the Interactive Configuration
Explorer (ICE), which directly addresses the need of analysts to learn how the
dependent numerical variable is affected by the parameter settings given
multiple optimization objectives. No information is lost as ICE shows the
complete distribution and statistics of the dependent variable in context with
each categorical variable. Analysts can interactively filter the variables to
optimize for certain goals such as achieving a system with maximum performance,
low variance, etc. Our system was developed in tight collaboration with a group
of systems performance researchers and its final effectiveness was evaluated
with expert interviews, a comparative user study, and two case studies.Comment: 10 pages, Published by IEEE at VIS 2019 (Vancouver, BC, Canada
Visualising Mutually Non-dominating Solution Sets in Many-objective Optimisation
Copyright © 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.As many-objective optimization algorithms mature, the problem owner is faced with visualizing and understanding a set of mutually nondominating solutions in a high dimensional space. We review existing methods and present new techniques to address this problem. We address a common problem with the well-known heatmap visualization, since the often arbitrary ordering of rows and columns renders the heatmap unclear, by using spectral seriation to rearrange the solutions and objectives and thus enhance the clarity of the heatmap. A multiobjective evolutionary optimizer is used to further enhance the simultaneous visualization of solutions in objective and parameter space. Two methods for visualizing multiobjective solutions in the plane are introduced. First, we use RadViz and exploit interpretations of barycentric coordinates for convex polygons and simplices to map a mutually nondominating set to the interior of a regular convex polygon in the plane, providing an intuitive representation of the solutions and objectives. Second, we introduce a new measure of the similarity of solutions—the dominance distance—which captures the order relations between solutions. This metric provides an embedding in Euclidean space, which is shown to yield coherent visualizations in two dimensions. The methods are illustrated on standard test problems and data from a benchmark many-objective problem
High-dimensional Clustering onto Hamiltonian Cycle
Clustering aims to group unlabelled samples based on their similarities. It
has become a significant tool for the analysis of high-dimensional data.
However, most of the clustering methods merely generate pseudo labels and thus
are unable to simultaneously present the similarities between different
clusters and outliers. This paper proposes a new framework called
High-dimensional Clustering onto Hamiltonian Cycle (HCHC) to solve the above
problems. First, HCHC combines global structure with local structure in one
objective function for deep clustering, improving the labels as relative
probabilities, to mine the similarities between different clusters while
keeping the local structure in each cluster. Then, the anchors of different
clusters are sorted on the optimal Hamiltonian cycle generated by the cluster
similarities and mapped on the circumference of a circle. Finally, a sample
with a higher probability of a cluster will be mapped closer to the
corresponding anchor. In this way, our framework allows us to appreciate three
aspects visually and simultaneously - clusters (formed by samples with high
probabilities), cluster similarities (represented as circular distances), and
outliers (recognized as dots far away from all clusters). The experiments
illustrate the superiority of HCHC
- …