53 research outputs found

    G-Tric: enhancing triclustering evaluation using three-way synthetic datasets with ground truth

    Get PDF
    Tese de mestrado, Ciência de Dados, Universidade de Lisboa, Faculdade de Ciências, 2020Three-dimensional datasets, or three-way data, started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations _ features _ contexts). With an increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount.These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real three-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. G-Tric can replicate real-world datasets and create new ones that match researchers’ needs across several properties, including data type (numeric or symbolic), dimension, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled by defining the number of missing values, noise, and errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches. Besides reviewing the current state-of-the-art regarding triclustering approaches, comparison studies and evaluation metrics, this work also analyzes how the lack of frameworks to generate synthetic data influences existent evaluation methodologies, limiting the scope of performance insights that can be extracted from each algorithm. As well as exemplifying how the set of decisions made on these evaluations can impact the quality and validity of those results. Alternatively, a different methodology that takes advantage of synthetic data with ground truth is presented. This approach, combined with the proposal of an extension to an existing clustering extrinsic measure, enables to assess solutions’ quality under new perspectives

    Predicting Purchase Proneness of Anonymous User in Mobile Commerce

    Get PDF
    In recent years, mobile commerce is developing rapidly because of the popularity of mobile devices. However, for the difficulty of the mobile device input, the users of the e-commerce websites usually don’t log on the website when they are browsing, which resulting in a situation that a large number of website visitors are anonymous users. In order to increase sales revenue and expand market share, an effective prediction of anonymous users’ purchases proneness is very helpful in providing targeted marketing strategy for website to induce anonymous users to purchase. In the past, customer segmentation was mainly analyzed and modeled by customers’ historical data. But the history data of anonymous users can’t be obtained on mobile commerce sites. This method is difficult to put into management practice. In order to solve this problem, this paper proposes a method based on random forest of using user clickstream data to forecast purchase proneness in real time. This method includes two stages: the model training part and the user purchasing proneness prediction part. In the model training part, a classifier based on random forest algorithm is trained. In the users\u27 predicting part, the classifier is used to predict the user\u27s purchase proneness in real time. The method proposed can be effectively applied in the real-time prediction of anonymous users\u27 purchasing proneness, and the results of prediction will help enterprises implement the marketing measures in real time

    Data-based fault detection in chemical processes: Managing records with operator intervention and uncertain labels

    Get PDF
    Developing data-driven fault detection systems for chemical plants requires managing uncertain data labels and dynamic attributes due to operator-process interactions. Mislabeled data is a known problem in computer science that has received scarce attention from the process systems community. This work introduces and examines the effects of operator actions in records and labels, and the consequences in the development of detection models. Using a state space model, this work proposes an iterative relabeling scheme for retraining classifiers that continuously refines dynamic attributes and labels. Three case studies are presented: a reactor as a motivating example, flooding in a simulated de-Butanizer column, as a complex case, and foaming in an absorber as an industrial challenge. For the first case, detection accuracy is shown to increase by 14% while operating costs are reduced by 20%. Moreover, regarding the de-Butanizer column, the performance of the proposed strategy is shown to be 10% higher than the filtering strategy. Promising results are finally reported in regard of efficient strategies to deal with the presented problemPeer ReviewedPostprint (author's final draft

    Applications of Diffusion Maps in Gene Expression Data-Based Cancer Diagnosis Analysis

    Get PDF
    Early detection of a tumor\u27s site of origin is particularly important for cancer diagnosis and treatment. The employment of gene expression profiles for different cancer types or subtypes has already shown significant advantages over traditional cancer classification methods. One of the major problems in cancer type recognition-oriented gene expression data analysis is the overwhelming number of measures of gene expression levels versus the small number of samples, which causes the curse of dimension issue. Here, we use diffusion maps, which interpret the eigenfunctions of Markov matrices as a system of coordinates on the original data set in order to obtain efficient representation of data geometric descriptions, for dimensionality reduction. The derived data are then clustered with Fuzzy ART to form the division of the cancer samples. Experimental results on the small round blue-cell tumor data set demonstrate the effectiveness of our proposed method in addressing multidimensional gene expression data and identifying different types of tumors

    Statistical Techniques for Exploratory Analysis of Structured Three-Way and Dynamic Network Data.

    Full text link
    In this thesis, I develop different techniques for the pattern extraction and visual exploration of a collection of data matrices. Specifically, I present methods to help home in on and visualize an underlying structure and its evolution over ordered (e.g., time) or unordered (e.g., experimental conditions) index sets. The first part of the thesis introduces a biclustering technique for such three dimensional data arrays. This technique is capable of discovering potentially overlapping groups of samples and variables that evolve similarly with respect to a subset of conditions. To facilitate and enhance visual exploration, I introduce a framework that utilizes kernel smoothing to guide the estimation of bicluster responses over the array. In the second part of the thesis, I introduce two matrix factorization models. The first is a data integration model that decomposes the data into two factors: a basis common to all data matrices, and a coefficient matrix that varies for each data matrix. The second model is meant for visual clustering of nodes in dynamic network data, which often contains complex evolving structure. Hence, this approach is more flexible and additionally lets the basis evolve for each matrix in the array. Both models utilize a regularization within the framework of non-negative matrix factorization to encourage local smoothness of the basis and coefficient matrices, which improves interpretability and highlights the structural patterns underlying the data, while mitigating noise effects. I also address computational aspects of applying regularized non-negative matrix factorization models to large data arrays by presenting multiple algorithms, including an approximation algorithm based on alternating least squares.PhDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99838/1/smankad_1.pd

    Optimizing Information Gathering for Environmental Monitoring Applications

    Get PDF
    The goal of environmental monitoring is to collect information from the environment and to generate an accurate model for a specific phenomena of interest. We can distinguish environmental monitoring applications into two macro areas that have different strategies for acquiring data from the environment. On one hand the use of fixed sensors deployed in the environment allows a constant monitoring and a steady flow of information coming from a predetermined set of locations in space. On the other hand the use of mobile platforms allows to adaptively and rapidly choose the sensing locations based on needs. For some applications (e.g. water monitoring) this can significantly reduce costs associated with monitoring compared with classical analysis made by human operators. However, both cases share a common problem to be solved. The data collection process must consider limited resources and the key problem is to choose where to perform observations (measurements) in order to most effectively acquire information from the environment and decrease the uncertainty about the analyzed phenomena. We can generalize this concept under the name of information gathering. In general, maximizing the information that we can obtain from the environment is an NP-hard problem. Hence, optimizing the selection of the sampling locations is crucial in this context. For example, in case of mobile sensors the problem of reducing uncertainty about a physical process requires to compute sensing trajectories constrained by the limited resources available, such as, the battery lifetime of the platform or the computation power available on board. This problem is usually referred to as Informative Path Planning (IPP). In the other case, observation with a network of fixed sensors requires to decide beforehand the specific locations where the sensors has to be deployed. Usually the process of selecting a limited set of informative locations is performed by solving a combinatorial optimization problem that model the information gathering process. This thesis focuses on the above mentioned scenario. Specifically, we investigate diverse problems and propose innovative algorithms and heuristics related to the optimization of information gathering techniques for environmental monitoring applications, both in case of deployment of mobile and fixed sensors. Moreover, we also investigate the possibility of using a quantum computation approach in the context of information gathering optimization

    Security and privacy of users\u27 personal Information on smartphones

    Full text link
     This research investigated the proliferation of malicious applications on smartphones and a framework that can efficiently detect and classify such applications based on behavioural patterns was proposed. Additionally the causes and impact of unauthorised disclosure of personal information by clean applications were examined and countermeasures to protect smartphone users’ privacy were proposed
    • …
    corecore