8 research outputs found

    An Efficient Load Balancing Multi-core Frequent Patterns Mining Algorithm

    Get PDF
    Abstract-Mining frequent pattern from transactional database is an important problem in data mining. Many methods have been proposed to solve this problem. However, the computation time still increase significantly while the data size grows. Therefore, parallel computing is a good strategy to solve this problem. Researchers have proposed various parallel and distributed algorithms on cluster system, grid system. However, the construction and maintenance cost is pretty high. In this paper, a multi-core load balancing frequent pattern mining algorithm is presented. The main goal of the proposed algorithm is to reduce the massive duplicated candidates generated in previous method. In order to verify the performance, we also implemented the proposed algorithm as well as previous methods for comparison. The experimental results showed that our method could reduce the computation time dramatically with more threads. Moreover, we could observe that the workload was equally dispatched to each computing unit

    Efficient and Effective Methodologies for Exploring and Prediction Movement Patterns in Large Networks

    Get PDF
    In the era of Big Data the prevalence of networks of all kinds has grown dramatically, and analysing (mining) such networks to support decision-making processes has become an extremely important subject for research, typically with a view to some social and/or economic gain. This thesis describes research work within the theme of Movement Pattern Mining (MPM) as applied to large network data. MPM is a type of frequent pattern mining that provides observation into how information is exchanged between objects in large networks. In the context of the work described in this thesis, the focus is on how the concept of Movement Patterns (MPs) can be extracted from large networks efficiently and effectively, and how such movement patterns can best be utilised so as to predict future movement. The work describes how, by utilising big data facilities like Share/Distribute Memory Systems and Hadoop/MapReduce, novel data mining based techniques can be used, not only to extract MPs from large networks, but also how they can be utilised for prediction purposes. To this end, the works in this thesis are divided into two parts. The first part is concerned with an investigation of an efficient mechanism for MPM. The second part is concerned with the utilisation of the extracted MPs in the context of prediction. For evaluation purposes, two large network datasets were used: The Great Britain Cattle Tracking System database and the Jiayuan Social Network. The evaluation indicates that an efficient and effective mechanism for identifying and extracting MPs form large networks, and subsequently using then MPs for prediction purposes, has been established

    Machine Learning Approaches for Breast Cancer Survivability Prediction

    Get PDF
    Breast cancer is one of the leading causes of cancer death in women. If not diagnosed early, the 5-year survival rate of patients is just about 26\%. Furthermore, patients with similar phenotypes can respond differently to the same therapies, which means the therapies might not work well for some of them. Identifying biomarkers that can help predict a cancer class with high accuracy is at the heart of breast cancer studies because they are targets of the treatments and drug development. Genomics data have been shown to carry useful information for breast cancer diagnosis and prognosis, as well as uncovering the disease’s mechanism. Machine learning methods are powerful tools to find such information. Feature selection methods are often utilized in supervised learning and unsupervised learning tasks to deal with data containing a large number of features in which only a small portion of them are useful to the classification task. On the other hand, analyzing only one type of data, without reference to the existing knowledge about the disease and the therapies, might mislead the findings. Effective data integration approaches are necessary to uncover this complex disease. In this thesis, we apply and develop machine learning methods to identify meaningful biomarkers for breast cancer survivability prediction after a certain treatment. They include applying feature selection methods on gene-expression data to derived gene-signatures, where the initial genes are collected concerning the mechanism of some drugs used breast cancer therapies. We also propose a new feature selection method, named PAFS, and apply it to discover accurate biomarkers. In addition, it has been increasingly supported that, sub-network biomarkers are more robust and accurate than gene biomarkers. We proposed two network-based approaches to identify sub-network biomarkers for breast cancer survivability prediction after a treatment. They integrate gene-expression data with protein-protein interactions during the optimal sub-network searching process and use cancer-related genes and pathways to prioritize the extracted sub-networks. The sub-network search space is usually huge and many proteins interact with thousands of other proteins. Thus, we apply some heuristics to avoid generating and evaluating redundant sub-networks

    Pattern Mining and Sense-Making Support for Enhancing the User Experience

    Get PDF
    While data mining techniques such as frequent itemset and sequence mining are well established as powerful pattern discovery tools in domains from science, medicine to business, a detriment is the lack of support for interactive exploration of high numbers of patterns generated with diverse parameter settings and the relationships among the mined patterns. To enhance the user experience, real-time query turnaround times and improved support for interactive mining are desired. There is also an increasing interest in applying data mining solutions for mobile data. Patterns mined over mobile data may enable context-aware applications ranging from automating frequently repeated tasks to providing personalized recommendations. Overall, this dissertation addresses three problems that limit the utility of data mining, namely, (a.) lack of interactive exploration tools for mined patterns, (b.) insufficient support for mining localized patterns, and (c.) high computational mining requirements prohibiting mining of patterns on smaller compute units such as a smartphone. This dissertation develops interactive frameworks for the guided exploration of mined patterns and their relationships. Contributions include the PARAS pre- processing and indexing framework; enabling analysts to gain key insights into rule relationships in a parameter space view due to the compact storage of rules that enables query-time reconstruction of complete rulesets. Contributions also include the visual rule exploration framework FIRE that presents an interactive dual view of the parameter space and the rule space, that together enable enhanced sense-making of rule relationships. This dissertation also supports the online mining of localized association rules computed on data subsets by selectively deploying alternative execution strategies that leverage multidimensional itemset-based data partitioning index. Finally, we designed OLAPH, an on-device context-aware service that learns phone usage patterns over mobile context data such as app usage, location, call and SMS logs to provide device intelligence. Concepts introduced for modeling mobile data as sequences include compressing context logs to intervaled context events, adding generalized time features, and identifying meaningful sequences via filter expressions

    Mining localized co-expressed gene patterns from microarray data

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Integrated scalable system for smart energy management

    Get PDF
    The planet's reserves are encountering vital challenges and suffer inequitable consumption. The outcomes of the prostration of natural reserves have started affecting every single organism on the globe. Energy is a critical key factor in this aspect because a considerable part of the destruction is triggered by utilising the planet reserves to produce power in diverse forms. The increasing environmental awareness in humans' minds, and the rapid development of smart concepts, home automation technologies in both hardware and software fields, played an essential role in speeding up the progress to apply smart energy management which is needed to revert the situation to its appropriate track by focusing on two main divisions: firstly, producing clean and renewable energy and secondly, reducing the loss of the total generated energy. This research will concentrate on the second approach by proposing, implementing and evaluating a contemporary integrated, scalable, smart energy management framework that assists in reducing the energy consumption in the household sector, covering a range of single households till huge communities and big organisations with thousands of appliances. A number of correspondent strategies and policies which utilise a set of observed and predicted system entities are applied to keep meetings the most relevant quality attributes such as integrability, scalability, interoperability and availability. IoT concepts are applied in this context to connect conventional household appliances to a farm of microservices that implement predictive analytics techniques to reduce energy consumption by applying two main strategies; appliance substitution based on the energy consumption and creating automatic schedules to run appliances based on predictions. A case study is presented on two sample appliances within the household to illustrate the framework validity and deliver percentage figures of the saved energy. Additionally, the framework offers a number of possibilities to provide relevant third parties such as local energy providers, apparatuses' manufacturers, or pertinent government offices with various appliances’ operational behaviours under real-life conditions

    Data mining in computational finance

    Get PDF
    Computational finance is a relatively new discipline whose birth can be traced back to early 1950s. Its major objective is to develop and study practical models focusing on techniques that apply directly to financial analyses. The large number of decisions and computationally intensive problems involved in this discipline make data mining and machine learning models an integral part to improve, automate, and expand the current processes. One of the objectives of this research is to present a state-of-the-art of the data mining and machine learning techniques applied in the core areas of computational finance. Next, detailed analysis of public and private finance datasets is performed in an attempt to find interesting facts from data and draw conclusions regarding the usefulness of features within the datasets. Credit risk evaluation is one of the crucial modern concerns in this field. Credit scoring is essentially a classification problem where models are built using the information about past applicants to categorise new applicants as ‘creditworthy’ or ‘non-creditworthy’. We appraise the performance of a few classical machine learning algorithms for the problem of credit scoring. Typically, credit scoring databases are large and characterised by redundant and irrelevant features, making the classification task more computationally-demanding. Feature selection is the process of selecting an optimal subset of relevant features. We propose an improved information-gain directed wrapper feature selection method using genetic algorithms and successfully evaluate its effectiveness against baseline and generic wrapper methods using three benchmark datasets. One of the tasks of financial analysts is to estimate a company’s worth. In the last piece of work, this study predicts the growth rate for earnings of companies using three machine learning techniques. We employed the technique of lagged features, which allowed varying amounts of recent history to be brought into the prediction task, and transformed the time series forecasting problem into a supervised learning problem. This work was applied on a private time series dataset
    corecore