1,164 research outputs found

    Visual Discovery in Multivariate Binary Data

    Get PDF
    This paper presents the concept of Monotone Boolean Function Visual Analytics (MBFVA) and its application to the medical domain. The medical application is concerned with discovering breast cancer diagnostic rules (i) interactively with a radiologist, (ii) analytically with data mining algorithms, and (iii) visually. The coordinated visualization of these rules opens an opportunity to coordinate the rules, and to come up with rules that are meaningful for the expert in the field, and are confirmed with the database. This paper shows how to represent and visualize binary multivariate data in 2-D and 3-D. This representation preserves the structural relations that exist in multivariate data. It creates a new opportunity to guide the visual discovery of unknown patterns in the data. In particular, the structural representation allows us to convert a complex border between the patterns in multidimensional space into visual 2-D and 3-D forms. This decreases the information overload on the user. The visualization shows not only the border between classes, but also shows a location of the case of interest relative to the border between the patterns. A user does not need to see the thousands of previous cases that have been used to build a border between the patterns. If the abnormal case is deeply inside in the abnormal area, far away from the border between normal and abnormal patterns, then this shows that this case is very abnormal and needs immediate attention. The paper concludes with the outline of the scaling of the algorithm for the large data sets

    Information visualization for DNA microarray data analysis: A critical review

    Get PDF
    Graphical representation may provide effective means of making sense of the complexity and sheer volume of data produced by DNA microarray experiments that monitor the expression patterns of thousands of genes simultaneously. The ability to use ldquoabstractrdquo graphical representation to draw attention to areas of interest, and more in-depth visualizations to answer focused questions, would enable biologists to move from a large amount of data to particular records they are interested in, and therefore, gain deeper insights in understanding the microarray experiment results. This paper starts by providing some background knowledge of microarray experiments, and then, explains how graphical representation can be applied in general to this problem domain, followed by exploring the role of visualization in gene expression data analysis. Having set the problem scene, the paper then examines various multivariate data visualization techniques that have been applied to microarray data analysis. These techniques are critically reviewed so that the strengths and weaknesses of each technique can be tabulated. Finally, several key problem areas as well as possible solutions to them are discussed as being a source for future work

    Novel Algorithms and Datamining for Clustering Massive Datasets

    Get PDF
    Clustering proteomics data is a challenging problem for any traditional clustering algorithm. Usually, the number of samples is much smaller than the number of protein peaks. The use of a clustering algorithm which does not take into consideration the number of feature of variables (here the number of peaks) is needed. An innovative hierarchical clustering algorithm may be a good approach. This work proposes a new dissimilarity measure for the hierarchical clustering combined with a functional data analysis. This work presents a specific application of functional data analysis (FDA) to a highthrouput proteomics study. The high performance of the proposed algorithm is compared to two popular dissimilarity measures in the clustering of normal and Human T Cell Leukemia Virus Type 1 (HTLV-1)-infected patients samples. The difficulty in clustering spatial data is that the data is multi - dimensional and massive. Sometimes, an automated clustering algorithm may not be sufficient to cluster this type of data. An iterative clustering algorithm along with the capability of visual steering may be a good approach. This case study proposes a new iterative algorithm which is the combination of automated clustering methods like the bayesian clustering, detection of multivariate outliers, and the visual clustering. Simulated data from a plasma experiment and real astronomical data are used to test the performance of the algorithm

    Distributional Random Forests: Heterogeneity Adjustment and Multivariate Distributional Regression

    Full text link
    Random Forests (Breiman, 2001) is a successful and widely used regression and classification algorithm. Part of its appeal and reason for its versatility is its (implicit) construction of a kernel-type weighting function on training data, which can also be used for targets other than the original mean estimation. We propose a novel forest construction for multivariate responses based on their joint conditional distribution, independent of the estimation target and the data model. It uses a new splitting criterion based on the MMD distributional metric, which is suitable for detecting heterogeneity in multivariate distributions. The induced weights define an estimate of the full conditional distribution, which in turn can be used for arbitrary and potentially complicated targets of interest. The method is very versatile and convenient to use, as we illustrate on a wide range of examples. The code is available as Python and R packages drf

    14-08 Big Data Analytics to Aid Developing Livable Communities

    Get PDF
    In transportation, ubiquitous deployment of low-cost sensors combined with powerful computer hardware and high-speed network makes big data available. USDOT defines big data research in transportation as a number of advanced techniques applied to the capture, management and analysis of very large and diverse volumes of data. Data in transportation are usually well organized into tables and are characterized by relatively low dimensionality and yet huge numbers of records. Therefore, big data research in transportation has unique challenges on how to effectively process huge amounts of data records and data streams. The purpose of this study is to conduct research on the problems caused by large data volume and data streams and to develop applications for data analysis in transportation. To process large number of records efficiently, we have proposed to aggregate the data at multiple resolutions and to explore the data at various resolutions to balance between accuracy and speed. Techniques and algorithms in statistical analysis and data visualization have been developed for efficient data analytics using multiresolution data aggregation. Results will be helpful in setting up a primitive stage towards a rigorous framework for general analytical processing of big data in transportation

    Multi-threaded Implementation of Association Rule Mining with Visualization of the Pattern Tree

    Get PDF
    Motor Vehicle fatalities per 100,000 population in the United States has been reported to be 10.69% in the year 2012 as per NHTSA (National Highway Traffic Safety Administration). The fatality rate has increased by 0.27% in 2012 compared to the rate in the year 2011. As per the reports, there are many factors involved in increasing the fatality rate drastically such as driving under influence, testing while driving, and various other weather phenomena. Decision makers need to analyze the factors attributing to the increase in an accident rate to take implied measures. Current methods used to perform the data analysis process has to be reformed and optimized to make policies for controlling the high traffic accident rates. This research work is an extension to the data-mining algorithm implementation Most Associated Sequential Pattern (MASP). MASP uses association rule mining approach to mine interesting traffic accident data using a modified version of FP-growth algorithm. Owing to the huge amounts of available traffic accident data, MASP algorithm needs to be further modified to make it more efficient with respect to both space and time. Therefore, we present a parallel implementation to the MASP algorithm. In addition to this, pattern tree and apriori-tid algorithm implementation has been done. The application is designed in C# using .NET Framework and C# Task Parallel Library

    Learning to Discover Sparse Graphical Models

    Get PDF
    We consider structure discovery of undirected graphical models from observational data. Inferring likely structures from few examples is a complex task often requiring the formulation of priors and sophisticated inference procedures. Popular methods rely on estimating a penalized maximum likelihood of the precision matrix. However, in these approaches structure recovery is an indirect consequence of the data-fit term, the penalty can be difficult to adapt for domain-specific knowledge, and the inference is computationally demanding. By contrast, it may be easier to generate training samples of data that arise from graphs with the desired structure properties. We propose here to leverage this latter source of information as training data to learn a function, parametrized by a neural network that maps empirical covariance matrices to estimated graph structures. Learning this function brings two benefits: it implicitly models the desired structure or sparsity properties to form suitable priors, and it can be tailored to the specific problem of edge structure discovery, rather than maximizing data likelihood. Applying this framework, we find our learnable graph-discovery method trained on synthetic data generalizes well: identifying relevant edges in both synthetic and real data, completely unknown at training time. We find that on genetics, brain imaging, and simulation data we obtain performance generally superior to analytical methods
    • 

    corecore