36 research outputs found

    Efficient itemset generator discovery over a stream sliding window

    Get PDF
    ABSTRACT Mining generator patterns has raised great research interest in recent years. The main purpose of mining itemset generators is that they can form equivalence classes together with closed itemsets, and can be used to generate simple classification rules according to the MDL principle. In this paper, we devise an efficient algorithm called StreamGen to mine frequent itemset generators over a stream sliding window. We adopt a novel enumeration tree structure to help keep the information of mined generators and the border between generators and non-generators, and propose some optimization techniques to speed up the mining process. We further extend the algorithm to directly mine a set of high quality classification rules over stream sliding windows while keeping high performance. The extensive performance study shows that our algorithm outperforms other state-of-the-art algorithms which perform similar tasks in terms of both runtime and memory usage efficiency, and has high utility in terms of classification

    Mining Time-Changing Data Streams

    Get PDF
    Streaming data have gained considerable attention in database and data mining communities because of the emergence of a class of applications, such as financial marketing, sensor networks, internet IP monitoring, and telecommunications that produce these data. Data streams have some unique characteristics that are not exhibited by traditional data: unbounded, fast-arriving, and time-changing. Traditional data mining techniques that make multiple passes over data or that ignore distribution changes are not applicable to dynamic data streams. Mining data streams has been an active research area to address requirements of the streaming applications. This thesis focuses on developing techniques for distribution change detection and mining time-changing data streams. Two techniques are proposed that can detect distribution changes in generic data streams. One approach for tackling one of the most popular stream mining tasks, frequent itemsets mining, is also presented in this thesis. All the proposed techniques are implemented and empirically studied. Experimental results show that the proposed techniques can achieve promising performance for detecting changes and mining dynamic data streams

    Mining semi-structured data, theoretical and experimental aspects of pattern evaluation

    Get PDF
    In dit proefschrift worden verschillende manieren onderzocht om semi-gestructureerde gegevens te analyseren, bijv. HTML bestanden. HTML bestanden hebben een structuur/opbouw, maar waar en hoe vaak je een tekst bold of italic maakt varieert voor elke HTML. Er is gekeken naar verschillende manieren om de voorkomens van een patroon (bijvoorbeeld alle moleculen in onze dataset bevatten een bepaalde set van atomen en verbindingen) te tellen om zo interessante patronen te vinden. Het juist presenteren van de resultaten aan de gebruiker is ook van belang. Dit proefschrift behandelt de visuele weergave van resultaten van de analyse (mining) van semi-gestructureerde gegevens, zodat de gebruiker eenvoudiger interessante patronen kan vinden. De conclusies zijn moeilijk kort samen te vatten. Echter het blijkt dat sommige patronen interessanter waren wanneer zij heel vlak achter elkaar voorkwamen en andere wanneer zij bijvoorbeeld wekelijks voorkwamen. Om nog meer interessante patronen te vinden is het aan te raden rekening te houden met dit element van tijd. Verder blijkt het dat visualisaties nodig zijn om de grote hoeveelheid patronen effectief te presenteren, bijvoorbeeld de gebruiker ziet in __n oog opslag substructuren van moleculen die voorkomen. Het onderzoek in dit proefschrift is belangrijk voor de analyse van data. Denk bijvoorbeeld aan de analyse van het gedrag van klanten. Het is interessant voor bedrijven om te weten dat klanten bepaalde producten aanschaffen bijvoorbeeld elke maandag. Dit is vernieuwend omdat wij subgroepen van producten ontdekken, maar wij tellen subgroepen met de juiste eigenschappen voor tijd zwaarder dan subgroepen die gewoon zomaar voorkomen. De visualisatie van samen voorkomende molecuul substructuren kan de analyse van deze versnellen en deze manier van plotten is nieuw.UBL - phd migration 201

    Predictive trend mining for social network analysis

    Get PDF
    This thesis describes research work within the theme of trend mining as applied to social network data. Trend mining is a type of temporal data mining that provides observation into how information changes over time. In the context of the work described in this thesis the focus is on how information contained in social networks changes with time. The work described proposes a number of data mining based techniques directed at mechanisms to not only detect change, but also support the analysis of change, with respect to social network data. To this end a trend mining framework is proposed to act as a vehicle for evaluating the ideas presented in this thesis. The framework is called the Predictive Trend Mining Framework (PTMF). It is designed to support "end-to-end" social network trend mining and analysis. The work described in this thesis is divided into two elements: Frequent Pattern Trend Analysis (FPTA) and Prediction Modeling (PM). For evaluation purposes three social network datasets have been considered: Great Britain Cattle Movement, Deeside Insurance and Malaysian Armed Forces Logistic Cargo. The evaluation indicates that a sound mechanism for identifying and analysing trends, and for using this trend knowledge for prediction purposes, has been established

    Unsupervised learning for anomaly detection in Australian medical payment data

    Full text link
    Fraudulent or wasteful medical insurance claims made by health care providers are costly for insurers. Typically, OECD healthcare organisations lose 3-8% of total expenditure due to fraud. As Australia’s universal public health insurer, Medicare Australia, spends approximately A34billionperannumontheMedicareBenefitsSchedule(MBS)andPharmaceuticalBenefitsScheme,wastedspendingofA 34 billion per annum on the Medicare Benefits Schedule (MBS) and Pharmaceutical Benefits Scheme, wasted spending of A1–2.7 billion could be expected.However, fewer than 1% of claims to Medicare Australia are detected as fraudulent, below international benchmarks. Variation is common in medicine, and health conditions, along with their presentation and treatment, are heterogenous by nature. Increasing volumes of data and rapidly changing patterns bring challenges which require novel solutions. Machine learning and data mining are becoming commonplace in this field, but no gold standard is yet available. In this project, requirements are developed for real-world application to compliance analytics at the Australian Government Department of Health and Aged Care (DoH), covering: unsupervised learning; problem generalisation; human interpretability; context discovery; and cost prediction. Three novel methods are presented which rank providers by potentially recoverable costs. These methods used association analysis, topic modelling, and sequential pattern mining to provide interpretable, expert-editable models of typical provider claims. Anomalous providers are identified through comparison to the typical models, using metrics based on costs of excess or upgraded services. Domain knowledge is incorporated in a machine-friendly way in two of the methods through the use of the MBS as an ontology. Validation by subject-matter experts and comparison to existing techniques shows that the methods perform well. The methods are implemented in a software framework which enables rapid prototyping and quality assurance. The code is implemented at the DoH, and further applications as decision-support systems are in progress. The developed requirements will apply to future work in this fiel

    Space-Efficient String Mining under Frequency Constraints

    Get PDF
    Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet S, with overall length n. We study the problem of mining discriminative patterns between D1 and D2, e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other frequency-related constraints. Using the algorithmic framework by Hui (CPM 1992), one can solve several variants of this problem in the optimal linear time with the aid of suffix trees or suffix arrays. This stands in high contrast to other pattern domains such as itemsets or subgraphs, where super-linear lower bounds are known. However, the space requirement of existing solutions is O(n log n) bits, which is not optimal for |S| << n (in particular for constant |S|), as the databases themselves occupy only n log |S| bits. Because in many real-life applications space is a more critical resource than time, the aim of this article is to reduce the space, at the cost of an increased running time. In particular, we give a solution for the above problems that uses O(n log n+d log n) bits, while the time requirement is increased from the optimal linear time to O(n log n). Our new method is tested extensively on a biologically relevant datasets and shown to be usable even on a genome-scale data

    Advances in knowledge discovery and data mining Part II

    Get PDF
    19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p

    Unsupervised monitoring of an elderly person\u27s activities of daily living using Kinect sensors and a power meter

    Get PDF
    The need for greater independence amongst the growing population of elderly people has made the concept of “ageing in place” an important area of research. Remote home monitoring strategies help the elderly deal with challenges involved in ageing in place and performing the activities of daily living (ADLs) independently. These monitoring approaches typically involve the use of several sensors, attached to the environment or person, in order to acquire data about the ADLs of the occupant being monitored. Some key drawbacks associated with many of the ADL monitoring approaches proposed for the elderly living alone need to be addressed. These include the need to label a training dataset of activities, use wearable devices or equip the house with many sensors. These approaches are also unable to concurrently monitor physical ADLs to detect emergency situations, such as falls, and instrumental ADLs to detect deviations from the daily routine. These are all indicative of deteriorating health in the elderly. To address these drawbacks, this research aimed to investigate the feasibility of unsupervised monitoring of both physical and instrumental ADLs of elderly people living alone via inexpensive minimally intrusive sensors. A hybrid framework was presented which combined two approaches for monitoring an elderly occupant’s physical and instrumental ADLs. Both approaches were trained based on unlabelled sensor data from the occupant’s normal behaviours. The data related to physical ADLs were captured from Kinect sensors and those related to instrumental ADLs were obtained using a combination of Kinect sensors and a power meter. Kinect sensors were employed in functional areas of the monitored environment to capture the occupant’s locations and 3D structures of their physical activities. The power meter measured the power consumption of home electrical appliances (HEAs) from the electricity panel. A novel unsupervised fuzzy approach was presented to monitor physical ADLs based on depth maps obtained from Kinect sensors. Epochs of activities associated with each monitored location were automatically identified, and the occupant’s behaviour patterns during each epoch were represented through the combinations of fuzzy attributes. A novel membership function generation technique was presented to elicit membership functions for attributes by analysing the data distribution of attributes while excluding noise and outliers in the data. The occupant’s behaviour patterns during each epoch of activity were then classified into frequent and infrequent categories using a data mining technique. Fuzzy rules were learned to model frequent behaviour patterns. An alarm was raised when the occupant’s behaviour in new data was recognised as frequent with a longer than usual duration or infrequent with a duration exceeding a data-driven value. Another novel unsupervised fuzzy approach to monitor instrumental ADLs took unlabelled training data from Kinect sensors and a power meter to model the key features of instrumental ADLs. Instrumental ADLs in the training dataset were identified based on associating the occupant’s locations with specific power signatures on the power line. A set of fuzzy rules was then developed to model the frequency and regularity of the instrumental activities tailored to the occupant. This set was subsequently used to monitor new data and to generate reports on deviations from normal behaviour patterns. As a proof of concept, the proposed monitoring approaches were evaluated using a dataset collected from a real-life setting. An evaluation of the results verified the high accuracy of the proposed technique to identify the epochs of activities over alternative techniques. The approach adopted for monitoring physical ADLs was found to improve elderly monitoring. It generated fuzzy rules that could represent the person’s physical ADLs and exclude noise and outliers in the data more efficiently than alternative approaches. The performance of different membership function generation techniques was compared. The fuzzy rule set obtained from the output of the proposed technique could accurately classify more scenarios of normal and abnormal behaviours. The approach for monitoring instrumental ADLs was also found to reliably distinguish power signatures generated automatically by self-regulated devices from those generated as a result of an elderly person’s instrumental ADLs. The evaluations also showed the effectiveness of the approach in correctly identifying elderly people’s interactions with specific HEAs and tracking simulated upward and downward deviations from normal behaviours. The fuzzy inference system in this approach was found to be robust in regards to errors when identifying instrumental ADLs as it could effectively classify normal and abnormal behaviour patterns despite errors in the list of the used HEAs
    corecore