15 research outputs found

    Exploring the Existing and Unknown Side Effects of Privacy Preserving Data Mining Algorithms

    Get PDF
    The data mining sanitization process involves converting the data by masking the sensitive data and then releasing it to public domain. During the sanitization process, side effects such as hiding failure, missing cost and artificial cost of the data were observed. Privacy Preserving Data Mining (PPDM) algorithms were developed for the sanitization process to overcome information loss and yet maintain data integrity. While these PPDM algorithms did provide benefits for privacy preservation, they also made sure to solve the side effects that occurred during the sanitization process. Many PPDM algorithms were developed to reduce these side effects. There are several PPDM algorithms created based on different PPDM techniques. However, previous studies have not explored or justified why non-traditional side effects were not given much importance. This study reported the findings of the side effects for the PPDM algorithms in a newly created web repository. The research methodology adopted for this study was Design Science Research (DSR). This research was conducted in four phases, which were as follows. The first phase addressed the characteristics, similarities, differences, and relationships of existing side effects. The next phase found the characteristics of non-traditional side effects. The third phase used the Privacy Preservation and Security Framework (PPSF) tool to test if non-traditional side effects occur in PPDM algorithms. This phase also attempted to find additional unknown side effects which have not been found in prior studies. PPDM algorithms considered were Greedy, POS2DT, SIF_IDF, cpGA2DT, pGA2DT, sGA2DT. PPDM techniques associated were anonymization, perturbation, randomization, condensation, heuristic, reconstruction, and cryptography. The final phase involved creating a new online web repository to report all the side effects found for the PPDM algorithms. A Web repository was created using full stack web development. AngularJS, Spring, Spring Boot and Hibernate frameworks were used to build the web application. The results of the study implied various PPDM algorithms and their side effects. Additionally, the relationship and impact that hiding failure, missing cost, and artificial cost have on each other was also understood. Interestingly, the side effects and their relationship with the type of data (sensitive or non-sensitive or new) was observed. As the web repository acts as a quick reference domain for PPDM algorithms. Developing, improving, inventing, and reporting PPDM algorithms is necessary. This study will influence researchers or organizations to report, use, reuse, or develop better PPDM algorithms

    Monte Carlo Method with Heuristic Adjustment for Irregularly Shaped Food Product Volume Measurement

    Get PDF
    Volume measurement plays an important role in the production and processing of food products. Various methods have been proposed to measure the volume of food products with irregular shapes based on 3D reconstruction. However, 3D reconstruction comes with a high-priced computational cost. Furthermore, some of the volume measurement methods based on 3D reconstruction have a low accuracy. Another method for measuring volume of objects uses Monte Carlo method. Monte Carlo method performs volume measurements using random points. Monte Carlo method only requires information regarding whether random points fall inside or outside an object and does not require a 3D reconstruction. This paper proposes volume measurement using a computer vision system for irregularly shaped food products without 3D reconstruction based on Monte Carlo method with heuristic adjustment. Five images of food product were captured using five cameras and processed to produce binary images. Monte Carlo integration with heuristic adjustment was performed to measure the volume based on the information extracted from binary images. The experimental results show that the proposed method provided high accuracy and precision compared to the water displacement method. In addition, the proposed method is more accurate and faster than the space carving method

    New Fundamental Technologies in Data Mining

    Get PDF
    The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining

    Explainable, Security-Aware and Dependency-Aware Framework for Intelligent Software Refactoring

    Full text link
    As software systems continue to grow in size and complexity, their maintenance continues to become more challenging and costly. Even for the most technologically sophisticated and competent organizations, building and maintaining high-performing software applications with high-quality-code is an extremely challenging and expensive endeavor. Software Refactoring is widely recognized as the key component for maintaining high-quality software by restructuring existing code and reducing technical debt. However, refactoring is difficult to achieve and often neglected due to several limitations in the existing refactoring techniques that reduce their effectiveness. These limitation include, but not limited to, detecting refactoring opportunities, recommending specific refactoring activities, and explaining the recommended changes. Existing techniques are mainly focused on the use of quality metrics such as coupling, cohesion, and the Quality Metrics for Object Oriented Design (QMOOD). However, there are many other factors identified in this work to assist and facilitate different maintenance activities for developers: 1. To structure the refactoring field and existing research results, this dissertation provides the most scalable and comprehensive systematic literature review analyzing the results of 3183 research papers on refactoring covering the last three decades. Based on this survey, we created a taxonomy to classify the existing research, identified research trends and highlighted gaps in the literature for further research. 2. To draw attention to what should be the current refactoring research focus from the developers’ perspective, we carried out the first large scale refactoring study on the most popular online Q&A forum for developers, Stack Overflow. We collected and analyzed posts to identify what developers ask about refactoring, the challenges that practitioners face when refactoring software systems, and what should be the current refactoring research focus from the developers’ perspective. 3. To improve the detection of refactoring opportunities in terms of quality and security in the context of mobile apps, we designed a framework that recommends the files to be refactored based on user reviews. We also considered the detection of refactoring opportunities in the context of web services. We proposed a machine learning-based approach that helps service providers and subscribers predict the quality of service with the least costs. Furthermore, to help developers make an accurate assessment of the quality of their software systems and decide if the code should be refactored, we propose a clustering-based approach to automatically identify the preferred benchmark to use for the quality assessment of a project. 4. Regarding the refactoring generation process, we proposed different techniques to enhance the change operators and seeding mechanism by using the history of applied refactorings and incorporating refactoring dependencies in order to improve the quality of the refactoring solutions. We also introduced the security aspect when generating refactoring recommendations, by investigating the possible impact of improving different quality attributes on a set of security metrics and finding the best trade-off between them. In another approach, we recommend refactorings to prioritize fixing quality issues in security-critical files, improve quality attributes and remove code smells. All the above contributions were validated at the large scale on thousands of open source and industry projects in collaboration with industry partners and the open source community. The contributions of this dissertation are integrated in a cloud-based refactoring framework which is currently used by practitioners.Ph.D.College of Engineering & Computer ScienceUniversity of Michigan-Dearbornhttp://deepblue.lib.umich.edu/bitstream/2027.42/171082/1/Chaima Abid Final Dissertation.pdfDescription of Chaima Abid Final Dissertation.pdf : Dissertatio

    Autonomous Hypothesis Generation for Knowledge Discovery in Continuous Domains

    Full text link
    Advances of computational power, data collection and storage techniques are making new data available every day. This situation has given rise to hypothesis generation research, which complements conventional hypothesis testing research. Hypothesis generation research adopts techniques from machine learning and data mining to autonomously uncover causal relations among variables in the form of previously unknown hidden patterns and models from data. Those patterns and models can come in different forms (e.g. rules, classifiers, clusters, causal relations). In some situations, data are collected without prior supposition or imposition of a specific research goal or hypothesis. Sometimes domain knowledge for this type of problem is also limited. For example, in sensor networks, sensors constantly record data. In these data, not all forms of relationships can be described in advance. Moreover, the environment may change without prior knowledge. In a situation like this one, hypothesis generation techniques can potentially provide a paradigm to gain new insights about the data and the underlying system. This thesis proposes a general hypothesis generation framework, whereby assumptions about the observational data and the system are not predefined. The problem is decomposed into two interrelated sub-problems: (1) the associative hypothesis generation problem and (2) the causal hypothesis generation problem. The former defines a task of finding evidence of the potential causal relations in data. The latter defines a refined task of identifying casual relations. A novel association rule algorithm for continuous domains, called functional association rule mining, is proposed to address the first problem. An agent based causal search algorithm is then designed for the second problem. It systematically tests the potential causal relations by querying the system to generate specific data; thus allowing for causality to be asserted. Empirical experiments show that the functional association rule mining algorithm can uncover associative relations from data. If the underlying relationships in the data overlap, the algorithm decomposes these relationships into their constituent non-overlapping parts. Experiments with the causal search algorithm show a relative low error rate on the retrieved hidden causal structures. In summary, the contributions of this thesis are: (1) a general framework for hypothesis generation in continuous domains, which relaxes a number of conditions assumed in existing automatic causal modelling algorithms and defines a more general hypothesis generation problem; (2) a new functional association rule mining algorithm, which serves as a probing step to identify associative relations in a given dataset and provides a novel functional association rule definition and algorithms to the literature of association rule mining; (3) a new causal search algorithm, which identifies the hidden causal relations of an unknown system on the basis of functional association rule mining and relaxes a number of assumptions commonly used in automatic causal modelling

    Edge/Fog Computing Technologies for IoT Infrastructure

    Get PDF
    The prevalence of smart devices and cloud computing has led to an explosion in the amount of data generated by IoT devices. Moreover, emerging IoT applications, such as augmented and virtual reality (AR/VR), intelligent transportation systems, and smart factories require ultra-low latency for data communication and processing. Fog/edge computing is a new computing paradigm where fully distributed fog/edge nodes located nearby end devices provide computing resources. By analyzing, filtering, and processing at local fog/edge resources instead of transferring tremendous data to the centralized cloud servers, fog/edge computing can reduce the processing delay and network traffic significantly. With these advantages, fog/edge computing is expected to be one of the key enabling technologies for building the IoT infrastructure. Aiming to explore the recent research and development on fog/edge computing technologies for building an IoT infrastructure, this book collected 10 articles. The selected articles cover diverse topics such as resource management, service provisioning, task offloading and scheduling, container orchestration, and security on edge/fog computing infrastructure, which can help to grasp recent trends, as well as state-of-the-art algorithms of fog/edge computing technologies

    Exploring data mining for hydrological modelling

    No full text
    Technological advances in computer science, namely cloud computing and data mining, are reshaping the way the world looks at data. Data are becoming the drivers of discoveries and strategic developments. In environmental sciences, for instance, big volumes of information are produced by monitoring networks, satellites and model simulations and are processed to uncover hidden patterns, correlations and trends to, ultimately, support policy and decision making. Hydrologists, in particular, use models to simulate river discharges and estimate the concentration of pollutants as well as the risk of floods and droughts. The very first step of any hydrological modelling exercise consists of selecting an appropriate model. However, the choice is often made by the modeller based on his/her expertise rather than on the model's suitability to reproduce the most important processes for the area under study. Since this approach defeats the ``scientific method'' for its lack of reproducibility and consistency across experts as well as locations, a shift towards a data-driven selection process is deemed necessary. This work presents the design, development and testing results of a completely novel data mining algorithm, called AMCA, able to automatically identify the most suitable model configurations for a given catchment, using minimum data requirements and an inventory of model structures. In the design phase a transdisciplinary approach was adopted, borrowing techniques from the fields of machine learning, signal processing and marketing. The algorithm was tested on the Severn at Plynlimon flume catchment, in the Plynlimon study area (Wales, UK). This area was selected because of its reliable measurements and the homogeneity of its soils and vegetation. The Framework for Understanding Structural Errors (FUSE) was used as sample model inventory, but the methodology can easily be adapted to others, including more sophisticated model structures. The model configuration problem, that the AMCA attempts to solve, can be categorised as ``fully unsupervised'' if there is no prior knowledge of interactions and relationships amongst observed data at a certain location and available model structures and parameters. Therefore, the first set of tests was run on a synthetic dataset to evaluate the algorithm's performance against known outcomes. Most of the component of the synthetic model structure were clearly identified by the AMCA, which allowed to proceed with further testing using observed data. Using real observations, the AMCA efficiently selected the most suitable model structures and, when coupled with association rule mining techniques, could also identify optimal parameter ranges. The performance of the ensemble suggested by the combination of AMCA and association rules was calibrated and validated against four widely used models (Topmodel, ARNOVIC, PRMS and Sacramento). The ensemble configuration always returned the best average efficiency, characterised by the narrowest spread and, therefore, lowest uncertainty. As final application, the full set of FUSE models was used to predict the effect of land use changes on catchment flows. The predictive uncertainty improved significantly when the prior distributions of model structures and parameters were conditioned using the AMCA approach. It was also noticed that such improvement is due to constrains applied to both model and parameter space, however the parameter space seems to contribute more. These results confirm that a considerable part of the uncertainty in prediction is due to the definition of the prior choice of the model configuration and that more objective ways to constrain the prior using formal data-driven techniques are needed. AMCA is, however, a procedure that can only be applied to gauged catchment. Future experiments could test whether AMCA configurations could be regionalised or transferred to ungauged catchments on the basis of catchment characteristics.Open Acces

    Bioinformatics Applications Based On Machine Learning

    Get PDF
    The great advances in information technology (IT) have implications for many sectors, such as bioinformatics, and has considerably increased their possibilities. This book presents a collection of 11 original research papers, all of them related to the application of IT-related techniques within the bioinformatics sector: from new applications created from the adaptation and application of existing techniques to the creation of new methodologies to solve existing problems

    AVATAR - Machine Learning Pipeline Evaluation Using Surrogate Model

    Get PDF
    © 2020, The Author(s). The evaluation of machine learning (ML) pipelines is essential during automatic ML pipeline composition and optimisation. The previous methods such as Bayesian-based and genetic-based optimisation, which are implemented in Auto-Weka, Auto-sklearn and TPOT, evaluate pipelines by executing them. Therefore, the pipeline composition and optimisation of these methods requires a tremendous amount of time that prevents them from exploring complex pipelines to find better predictive models. To further explore this research challenge, we have conducted experiments showing that many of the generated pipelines are invalid, and it is unnecessary to execute them to find out whether they are good pipelines. To address this issue, we propose a novel method to evaluate the validity of ML pipelines using a surrogate model (AVATAR). The AVATAR enables to accelerate automatic ML pipeline composition and optimisation by quickly ignoring invalid pipelines. Our experiments show that the AVATAR is more efficient in evaluating complex pipelines in comparison with the traditional evaluation approaches requiring their execution

    Personality Identification from Social Media Using Deep Learning: A Review

    Get PDF
    Social media helps in sharing of ideas and information among people scattered around the world and thus helps in creating communities, groups, and virtual networks. Identification of personality is significant in many types of applications such as in detecting the mental state or character of a person, predicting job satisfaction, professional and personal relationship success, in recommendation systems. Personality is also an important factor to determine individual variation in thoughts, feelings, and conduct systems. According to the survey of Global social media research in 2018, approximately 3.196 billion social media users are in worldwide. The numbers are estimated to grow rapidly further with the use of mobile smart devices and advancement in technology. Support vector machine (SVM), Naive Bayes (NB), Multilayer perceptron neural network, and convolutional neural network (CNN) are some of the machine learning techniques used for personality identification in the literature review. This paper presents various studies conducted in identifying the personality of social media users with the help of machine learning approaches and the recent studies that targeted to predict the personality of online social media (OSM) users are reviewed
    corecore