42,406 research outputs found

    A Pseudo Nearest-Neighbor Approach for Missing Data Recovery on Gaussian Random Data Sets

    Get PDF
    Missing data handling is an important preparation step for most data discrimination or mining tasks. Inappropriate treatment of missing data may cause large errors or false results. In this paper, we study the effect of a missing data recovery method, namely the pseudo- nearest neighbor substitution approach, on Gaussian distributed data sets that represent typical cases in data discrimination and data mining applications. The error rate of the proposed recovery method is evaluated by comparing the clustering results of the recovered data sets to the clustering results obtained on the originally complete data sets. The results are also compared with that obtained by applying two other missing data handling methods, the constant default value substitution and the missing data ignorance (non-substitution) methods. The experiment results provided a valuable insight to the improvement of the accuracy for data discrimination and knowledge discovery on large data sets containing missing values

    Data analysis and navigation in high-dimensional chemical and biological spaces

    Get PDF
    The goal of this master thesis is to develop and validate a visual data-mining approach suitable for the screening of chemicals in the context of REACH [Registration, Evaluation, Authorization and Restriction of Chemicals]. The proposed approach will facilitate the development and validation of non-testing methods via the exploration of environmental endpoints and their relationship with the chemical structure and physicochemical properties of chemicals. The use of an interactive chemical space data exploration tool using 3D visualization and navigation will enrich the information available with additional variables like size, texture and color of the objects of the scene (compounds). The features that distinguish this approach and make it unique are (i) the integration of multiple data sources allowing the recovery in real time of complementary information of the studied compounds, (ii) the integration of several algorithms for the data analysis (dimensional reduction, generation of composite variables and clustering) and (iii) direct user interaction with the data through the virtual navigation mechanism. All this is achieved without the need for specialized hardware or the use of specific devices and high-cost virtual reality and mixed reality

    A General Spatio-Temporal Clustering-Based Non-local Formulation for Multiscale Modeling of Compartmentalized Reservoirs

    Full text link
    Representing the reservoir as a network of discrete compartments with neighbor and non-neighbor connections is a fast, yet accurate method for analyzing oil and gas reservoirs. Automatic and rapid detection of coarse-scale compartments with distinct static and dynamic properties is an integral part of such high-level reservoir analysis. In this work, we present a hybrid framework specific to reservoir analysis for an automatic detection of clusters in space using spatial and temporal field data, coupled with a physics-based multiscale modeling approach. In this work a novel hybrid approach is presented in which we couple a physics-based non-local modeling framework with data-driven clustering techniques to provide a fast and accurate multiscale modeling of compartmentalized reservoirs. This research also adds to the literature by presenting a comprehensive work on spatio-temporal clustering for reservoir studies applications that well considers the clustering complexities, the intrinsic sparse and noisy nature of the data, and the interpretability of the outcome. Keywords: Artificial Intelligence; Machine Learning; Spatio-Temporal Clustering; Physics-Based Data-Driven Formulation; Multiscale Modelin

    Grouping Method Of Image Fragments Of Adjacent Dislocation Etch Pits Of The Semiconductor Wafer

    Full text link
    An increase in production volumes of gallium arsenide semiconductor devices determines the need for better control of dislocations of semiconductor wafer.The grouping method of image fragments of adjacent dislocation etch pits of the semiconductor wafer is proposed in the article. Adjacent fragments will be allocated in the pre-binarized image of wafer surface, which contains adjacent fragments of etch pits of dislocation loops after treatment by the described method. Improved methods for determining the loop line width determines the edge line width of etch pits of suspected dislocations, given the variability of their display in the binarized image. The current loop line width is compared to the reference line width of the dislocation loop.The grouping method of image fragments of adjacent dislocation etch pits of the semiconductor wafer defines recovery of loop lines branching, takes into account various options of line adjacency and determines the direction of further recovery of loop line of dislocation etch pits. A step by step description of the method is given

    The application of univariate and distributional analyses to assess the impacts of diamond mining on marine macrofauna off the Namibian Coast

    Get PDF
    Bibliography: pages 114-116.This study is one of three based on grab samples of macrobenthos obtained before and at different times after mining for diamonds off the coast of Namibia. The first study dealt with multivariate clustering analysis of the first samples before and after mining. The second study focused on recovery times after mining and this study is aimed at estimating the amount of stress encountered by benthic communities, for comparision with the descriptive multivariate approach. Two research areas, classified as 'northern' and 'southern' were investigated. Data were aggregated and analysed at the genus level. Graphical and statistical analyses were conducted on the data which was classified in three ways. First, on all unmined sites from the two research areas together to test for natural site-to-site variability. Secondly and thirdly, each research area (north and south) was analysed separately to test for differences between unmined and mined sites at each area. Stress levels in the community were assessed by Caswell's neutral model (the Vstatistic) and by interpretation of the value of the W-statistic (a summary statistic of the ABC curves). Correlation techniques were applied to assess if there was any relationship between the diversity indices (as indicators of the influence of disturbance on community structure) on the one hand, and the environmental indicators of disturbance (percentage gravel, sand, mud) on the other

    Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data

    Full text link
    Subsequence clustering of multivariate time series is a useful tool for discovering repeated patterns in temporal data. Once these patterns have been discovered, seemingly complicated datasets can be interpreted as a temporal sequence of only a small number of states, or clusters. For example, raw sensor data from a fitness-tracking application can be expressed as a timeline of a select few actions (i.e., walking, sitting, running). However, discovering these patterns is challenging because it requires simultaneous segmentation and clustering of the time series. Furthermore, interpreting the resulting clusters is difficult, especially when the data is high-dimensional. Here we propose a new method of model-based clustering, which we call Toeplitz Inverse Covariance-based Clustering (TICC). Each cluster in the TICC method is defined by a correlation network, or Markov random field (MRF), characterizing the interdependencies between different observations in a typical subsequence of that cluster. Based on this graphical representation, TICC simultaneously segments and clusters the time series data. We solve the TICC problem through alternating minimization, using a variation of the expectation maximization (EM) algorithm. We derive closed-form solutions to efficiently solve the two resulting subproblems in a scalable way, through dynamic programming and the alternating direction method of multipliers (ADMM), respectively. We validate our approach by comparing TICC to several state-of-the-art baselines in a series of synthetic experiments, and we then demonstrate on an automobile sensor dataset how TICC can be used to learn interpretable clusters in real-world scenarios.Comment: This revised version fixes two small typos in the published versio
    corecore