37 research outputs found

    Analyzing the Fine Structure of Distributions

    Full text link
    One aim of data mining is the identification of interesting structures in data. For better analytical results, the basic properties of an empirical distribution, such as skewness and eventual clipping, i.e. hard limits in value ranges, need to be assessed. Of particular interest is the question of whether the data originate from one process or contain subsets related to different states of the data producing process. Data visualization tools should deliver a clear picture of the univariate probability density distribution (PDF) for each feature. Visualization tools for PDFs typically use kernel density estimates and include both the classical histogram, as well as the modern tools like ridgeline plots, bean plots and violin plots. If density estimation parameters remain in a default setting, conventional methods pose several problems when visualizing the PDF of uniform, multimodal, skewed distributions and distributions with clipped data, For that reason, a new visualization tool called the mirrored density plot (MD plot), which is specifically designed to discover interesting structures in continuous features, is proposed. The MD plot does not require adjusting any parameters of density estimation, which is what may make the use of this plot compelling particularly to non-experts. The visualization tools in question are evaluated against statistical tests with regard to typical challenges of explorative distribution analysis. The results of the evaluation are presented using bimodal Gaussian, skewed distributions and several features with already published PDFs. In an exploratory data analysis of 12 features describing quarterly financial statements, when statistical testing poses a great difficulty, only the MD plots can identify the structure of their PDFs. In sum, the MD plot outperforms the above mentioned methods.Comment: 66 pages, 81 figures, accepted in PLOS ON

    Robotic Wireless Sensor Networks

    Full text link
    In this chapter, we present a literature survey of an emerging, cutting-edge, and multi-disciplinary field of research at the intersection of Robotics and Wireless Sensor Networks (WSN) which we refer to as Robotic Wireless Sensor Networks (RWSN). We define a RWSN as an autonomous networked multi-robot system that aims to achieve certain sensing goals while meeting and maintaining certain communication performance requirements, through cooperative control, learning and adaptation. While both of the component areas, i.e., Robotics and WSN, are very well-known and well-explored, there exist a whole set of new opportunities and research directions at the intersection of these two fields which are relatively or even completely unexplored. One such example would be the use of a set of robotic routers to set up a temporary communication path between a sender and a receiver that uses the controlled mobility to the advantage of packet routing. We find that there exist only a limited number of articles to be directly categorized as RWSN related works whereas there exist a range of articles in the robotics and the WSN literature that are also relevant to this new field of research. To connect the dots, we first identify the core problems and research trends related to RWSN such as connectivity, localization, routing, and robust flow of information. Next, we classify the existing research on RWSN as well as the relevant state-of-the-arts from robotics and WSN community according to the problems and trends identified in the first step. Lastly, we analyze what is missing in the existing literature, and identify topics that require more research attention in the future

    A Bioinformatics View on Acute Myeloid Leukemia Surface Molecules by Combined Bayesian and ABC Analysis

    No full text
    “Big omics data” provoke the challenge of extracting meaningful information with clinical benefit. Here, we propose a two-step approach, an initial unsupervised inspection of the structure of the high dimensional data followed by supervised analysis of gene expression levels, to reconstruct the surface patterns on different subtypes of acute myeloid leukemia (AML). First, Bayesian methodology was used, focusing on surface molecules encoded by cluster of differentiation (CD) genes to assess whether AML is a homogeneous group or segregates into clusters. Gene expressions of 390 patient samples measured using microarray technology and 150 samples measured via RNA-Seq were compared. Beyond acute promyelocytic leukemia (APL), a well-known AML subentity, the remaining AML samples were separated into two distinct subgroups. Next, we investigated which CD molecules would best distinguish each AML subgroup against APL, and validated discriminative molecules of both datasets by searching the scientific literature. Surprisingly, a comparison of both omics analyses revealed that CD339 was the only overlapping gene differentially regulated in APL and other AML subtypes. In summary, our two-step approach for gene expression analysis revealed two previously unknown subgroup distinctions in AML based on surface molecule expression, which may guide the differentiation of subentities in a given clinical–diagnostic context

    Flow cytometry datasets consisting of peripheral blood and bone marrow samples for the evaluation of explainable artificial intelligence methods

    No full text
    Three different Flow Cytometry datasets consisting of diagnostic samples of either peripheral blood (pB) or bone marrow (BM) from patients without any sign of bone marrow disease at two different health care centers are provided. In Flow Cytometry, each cell rapidly passes through a laser beam one by one, and two light scatter, and eight surface parameters of more than 100.000 cells are measured per sample of each patient. The technology swiftly characterizes cells of the immune system at the single-cell level based on antigens presented on the cell surface that are targeted by a set of fluorochrome-conjugated antibodies. The first dataset consists of N=14 sample files measured in Marburg and the second dataset of N=44 data files measured in Dresden, of which half are BM samples and half are pB samples. The third dataset contains N=25 healthy bone marrow samples and N=25 leukemia bone marrow samples measured in Marburg. The data has been scaled to log between zero and six and used to identify cell populations that are simultaneously meaningful to the clinician and relevant to the distinction of pB vs BM, and BM vs leukemia. Explainable artificial intelligence methods should distinguish these samples and provide meaningful explanations for the classification without taking more than several hours to compute their results. The data described in this article are available in Mendeley Data

    Exploiting Distance-Based Structures in Data Using an Explainable AI for Stock Picking

    No full text
    In principle, the fundamental data of companies may be used to select stocks with a high probability of either increasing or decreasing price. Many of the commonly known rules or used explanations for such a stock-picking process are too vague to be applied in concrete cases, and at the same time, it is challenging to analyze high-dimensional data with a low number of cases in order to derive data-driven and usable explanations. This work proposes an explainable AI (XAI) approach on the quarterly available fundamental data of companies traded on the German stock market. In the XAI, distance-based structures in data (DSD) that guide decision tree induction are identified. The leaves of the appropriately selected decision tree contain subsets of stocks and provide viable explanations that can be rated by a human. The prediction of the future price trends of specific stocks is made possible using the explanations and a rating. In each quarter, stock picking by DSD-XAI is based on understanding the explanations and has a higher success rate than arbitrary stock picking, a hybrid AI system, and a recent unsupervised decision tree called eUD3.5

    Visualization and 3D printing of multivariate data of biomarkers

    Get PDF
    Dimensionality reduction by feature extraction is commonly used to project high-dimensional data into a lowdimensional space. With the aim to create a visualization of data, only projections onto two dimensions are considered here. Self-organizing maps were chosen as the projection method, which enabled the use of the U*- Matrix as an established method to visualize data as landscapes. Owing to the availability of the 3D printing technique, this allows presenting the structure of data in an intuitive way. For this purpose, information about the height of the landscapes is used to produce a three dimensional landscape with a 3D color printer. Similarities between high-dimensional data are observed as valleys and dissimilarities as mountains or ridges. These 3D prints provide topical experts a haptic grasp of high-dimensional structures. The method will be exemplarily demonstrated on multivariate data comprising pain-related bio responses. In addition, a new R package “Umatrix” is introduced that allows the user to generate landscapes with hypsometric tints

    Explainable AI Framework for Multivariate Hydrochemical Time Series

    No full text
    The understanding of water quality and its underlying processes is important for the protection of aquatic environments. With the rare opportunity of access to a domain expert, an explainable AI (XAI) framework is proposed that is applicable to multivariate time series. The XAI provides explanations that are interpretable by domain experts. In three steps, it combines a data-driven choice of a distance measure with supervised decision trees guided by projection-based clustering. The multivariate time series consists of water quality measurements, including nitrate, electrical conductivity, and twelve other environmental parameters. The relationships between water quality and the environmental parameters are investigated by identifying similar days within a cluster and dissimilar days between clusters. The framework, called DDS-XAI, does not depend on prior knowledge about data structure, and its explanations are tendentially contrastive. The relationships in the data can be visualized by a topographic map representing high-dimensional structures. Two state of the art XAIs called eUD3.5 and iterative mistake minimization (IMM) were unable to provide meaningful and relevant explanations from the three multivariate time series data. The DDS-XAI framework can be swiftly applied to new data. Open-source code in R for all steps of the XAI framework is provided and the steps are structured application-oriented
    corecore