41 research outputs found

    A Survey on Particle Swarm Optimization for Association Rule Mining

    Get PDF
    Association rule mining (ARM) is one of the core techniques of data mining to discover potentially valuable association relationships from mixed datasets. In the current research, various heuristic algorithms have been introduced into ARM to address the high computation time of traditional ARM. Although a more detailed review of the heuristic algorithms based on ARM is available, this paper differs from the existing reviews in that we expected it to provide a more comprehensive and multi-faceted survey of emerging research, which could provide a reference for researchers in the field to help them understand the state-of-the-art PSO-based ARM algorithms. In this paper, we review the existing research results. Heuristic algorithms for ARM were divided into three main groups, including biologically inspired, physically inspired, and other algorithms. Additionally, different types of ARM and their evaluation metrics are described in this paper, and the current status of the improvement in PSO algorithms is discussed in stages, including swarm initialization, algorithm parameter optimization, optimal particle update, and velocity and position updates. Furthermore, we discuss the applications of PSO-based ARM algorithms and propose further research directions by exploring the existing problems.publishedVersio

    GENERIC FRAMEWORKS FOR INTERACTIVE PERSONALIZED INTERESTING PATTERN DISCOVERY

    Get PDF
    The traditional frequent pattern mining algorithms generate an exponentially large number of patterns of which a substantial portion are not much significant for many data analysis endeavours. Due to this, the discovery of a small number of interesting patterns from the exponentially large number of frequent patterns according to a particular user\u27s interest is an important task. Existing works on patter

    No Need to Know Physics: Resilience of Process-based Model-free Anomaly Detection for Industrial Control Systems

    Full text link
    In recent years, a number of process-based anomaly detection schemes for Industrial Control Systems were proposed. In this work, we provide the first systematic analysis of such schemes, and introduce a taxonomy of properties that are verified by those detection systems. We then present a novel general framework to generate adversarial spoofing signals that violate physical properties of the system, and use the framework to analyze four anomaly detectors published at top security conferences. We find that three of those detectors are susceptible to a number of adversarial manipulations (e.g., spoofing with precomputed patterns), which we call Synthetic Sensor Spoofing and one is resilient against our attacks. We investigate the root of its resilience and demonstrate that it comes from the properties that we introduced. Our attacks reduce the Recall (True Positive Rate) of the attacked schemes making them not able to correctly detect anomalies. Thus, the vulnerabilities we discovered in the anomaly detectors show that (despite an original good detection performance), those detectors are not able to reliably learn physical properties of the system. Even attacks that prior work was expected to be resilient against (based on verified properties) were found to be successful. We argue that our findings demonstrate the need for both more complete attacks in datasets, and more critical analysis of process-based anomaly detectors. We plan to release our implementation as open-source, together with an extension of two public datasets with a set of Synthetic Sensor Spoofing attacks as generated by our framework

    Business Intelligence Tools for Supporting Modern Corporations

    Get PDF
    Tato práce se zabývá významem businessové inteligence v moderních podnicích. Nejprve budou popsány konkrétní technologie a nástroje patřící do businessové inteligence. Prezentován bude i jejich význam pro celkovou image a jeho pomoc pro lepší komunikaci mezi jednotlivými odděleními společnosti. Pro vyhodnocení bude vytvořena anketa, která bude zaměřena na lidi, kteří mají něco společného s moderními podniky a bude jim položena otázka, jak dalece rozumí nástrojům businessové inteligence a jejich významu pro moderní podniky samotné. Na základě tohoto průzkumu lze zjistit mezeru ve znalostech v moderních podnicích a do jaké míry ovlivňuje firmy a jejich funkčnost. Na konec budou na základě analýzy průzkumu předloženy možná řešení, silné a slabé stránky.The present thesis deals with the importance of business intelligence in modern enterprises. At first specific technologies and tools in Business Intelligence will be described. Presented will also be their importance for the overall image and its help for better communication among each department of the company. For the evaluation, a survey will be created which will be aimed at people who have something in common with modern enterprises and they will be asked how far they understand Business Intelligence tools and their importance for the modern enterprises themselves. Based on this survey might be found a gap of knowledge in modern enterprises and how far it affects the enterprises and their functionality. Possible solutions, strengths and limitations based on the analysis of the survey will come up in the end.

    Exploring data mining for hydrological modelling

    No full text
    Technological advances in computer science, namely cloud computing and data mining, are reshaping the way the world looks at data. Data are becoming the drivers of discoveries and strategic developments. In environmental sciences, for instance, big volumes of information are produced by monitoring networks, satellites and model simulations and are processed to uncover hidden patterns, correlations and trends to, ultimately, support policy and decision making. Hydrologists, in particular, use models to simulate river discharges and estimate the concentration of pollutants as well as the risk of floods and droughts. The very first step of any hydrological modelling exercise consists of selecting an appropriate model. However, the choice is often made by the modeller based on his/her expertise rather than on the model's suitability to reproduce the most important processes for the area under study. Since this approach defeats the ``scientific method'' for its lack of reproducibility and consistency across experts as well as locations, a shift towards a data-driven selection process is deemed necessary. This work presents the design, development and testing results of a completely novel data mining algorithm, called AMCA, able to automatically identify the most suitable model configurations for a given catchment, using minimum data requirements and an inventory of model structures. In the design phase a transdisciplinary approach was adopted, borrowing techniques from the fields of machine learning, signal processing and marketing. The algorithm was tested on the Severn at Plynlimon flume catchment, in the Plynlimon study area (Wales, UK). This area was selected because of its reliable measurements and the homogeneity of its soils and vegetation. The Framework for Understanding Structural Errors (FUSE) was used as sample model inventory, but the methodology can easily be adapted to others, including more sophisticated model structures. The model configuration problem, that the AMCA attempts to solve, can be categorised as ``fully unsupervised'' if there is no prior knowledge of interactions and relationships amongst observed data at a certain location and available model structures and parameters. Therefore, the first set of tests was run on a synthetic dataset to evaluate the algorithm's performance against known outcomes. Most of the component of the synthetic model structure were clearly identified by the AMCA, which allowed to proceed with further testing using observed data. Using real observations, the AMCA efficiently selected the most suitable model structures and, when coupled with association rule mining techniques, could also identify optimal parameter ranges. The performance of the ensemble suggested by the combination of AMCA and association rules was calibrated and validated against four widely used models (Topmodel, ARNOVIC, PRMS and Sacramento). The ensemble configuration always returned the best average efficiency, characterised by the narrowest spread and, therefore, lowest uncertainty. As final application, the full set of FUSE models was used to predict the effect of land use changes on catchment flows. The predictive uncertainty improved significantly when the prior distributions of model structures and parameters were conditioned using the AMCA approach. It was also noticed that such improvement is due to constrains applied to both model and parameter space, however the parameter space seems to contribute more. These results confirm that a considerable part of the uncertainty in prediction is due to the definition of the prior choice of the model configuration and that more objective ways to constrain the prior using formal data-driven techniques are needed. AMCA is, however, a procedure that can only be applied to gauged catchment. Future experiments could test whether AMCA configurations could be regionalised or transferred to ungauged catchments on the basis of catchment characteristics.Open Acces

    MATRIX DECOMPOSITION FOR DATA DISCLOSURE CONTROL AND DATA MINING APPLICATIONS

    Get PDF
    Access to huge amounts of various data with private information brings out a dual demand for preservation of data privacy and correctness of knowledge discovery, which are two apparently contradictory tasks. Low-rank approximations generated by matrix decompositions are a fundamental element in this dissertation for the privacy preserving data mining (PPDM) applications. Two categories of PPDM are studied: data value hiding (DVH) and data pattern hiding (DPH). A matrix-decomposition-based framework is designed to incorporate matrix decomposition techniques into data preprocessing to distort original data sets. With respect to the challenge in the DVH, how to protect sensitive/confidential attribute values without jeopardizing underlying data patterns, we propose singular value decomposition (SVD)-based and nonnegative matrix factorization (NMF)-based models. Some discussion on data distortion and data utility metrics is presented. Our experimental results on benchmark data sets demonstrate that our proposed models have potential for outperforming standard data perturbation models regarding the balance between data privacy and data utility. Based on an equivalence between the NMF and K-means clustering, a simultaneous data value and pattern hiding strategy is developed for data mining activities using K-means clustering. Three schemes are designed to make a slight alteration on submatrices such that user-specified cluster properties of data subjects are hidden. Performance evaluation demonstrates the efficacy of the proposed strategy since some optimal solutions can be computed with zero side effects on nonconfidential memberships. Accordingly, the protection of privacy is simplified by one modified data set with enhanced performance by this dual privacy protection. In addition, an improved incremental SVD-updating algorithm is applied to speed up the real-time performance of the SVD-based model for frequent data updates. The performance and effectiveness of the improved algorithm have been examined on synthetic and real data sets. Experimental results indicate that the introduction of the incremental matrix decomposition produces a significant speedup. It also provides potential support for the use of the SVD technique in the On-Line Analytical Processing for business data analysis
    corecore