2,200 research outputs found

    Implementation of an interactive pattern mining framework on electronic health record datasets

    Get PDF
    Large collections of electronic patient records contain a broad range of clinical information highly relevant for data analysis. However, they are maintained primarily for patient administration, and automated methods are required to extract valuable knowledge for predictive, preventive, personalized and participatory medicine. Sequential pattern mining is a fundamental task in data mining which can be used to find statistically relevant, non-trivial temporal dependencies of events such as disease comorbidities. This works objective is to use this mining technique to identify disease associations based on ICD-9-CM codes data of the entire Taiwanese population obtained from Taiwan’s National Health Insurance Research Database. This thesis reports the development and implementation of the Disease Pattern Miner – a pattern mining framework in a medical domain. The framework was designed as a Web application which can be used to run several state-of-the-art sequence mining algorithms on electronic health records, collect and filter the results to reduce the number of patterns to a meaningful size, and visualize the disease associations as an interactive model in a specific population group. This may be crucial to discover new disease associations and offer novel insights to explain disease pathogenesis. A structured evaluation of the data and models are required before medical data-scientist may use this application as a tool for further research to get a better understanding of disease comorbidities

    Enchancing RFID data quality and reliability using approximate filtering techniques

    Get PDF
    Radio Frequency Identification (RFID) is an emerging auto-identification technology that uses radio waves to identify and track physical objects without the line of sight. While delivering significant improvements in various aspects, such as, stock management and inventory accuracy, there are serious data management issues that affect RFID data quality in preparing reliable solutions. The raw read rate in real world RFID deployments is often in the 60-70% range and naturally unreliable because of redundant and false readings. The redundant readings result in unnecessary storage and affect the efficiency of data processing. Furthermore, false readings that focused on false positive readings generated by cloned tag could be mistakenly considered as valid and affects the final results and decisions. Therefore, two approaches to enhance the RFID data quality and reliability were proposed. A redundant reading filtering approach based on modified Bloom Filter is presented as the existing Bloom Filter based approaches are quite intricate. Meanwhile, even though tag cloning has been identified as one of the serious RFID security issue, it only received little attention in the literature. Therefore we developed a lightweight anti-cloning approach based on modified Count- Min sketch vector and tag reading frequency from e-pedigree in observing identical Electronic Product Code (EPC) of the low cost tag in local site and distributed region in supply chain. Experimental results showed, that the first proposed approach, Duplicate Filtering Hash (DFH) achieved the lowest false positive rate of 0.06% and the highest true positive rate of 89.94% as compared to other baseline approaches. DFH is 71.1% faster than d-Left Time Bloom Filter (DLTBF) while reducing amount of hashing and achieved 100% true negative rate. The second proposed approach, Managing Counterfeit Hash (MCH) performs fastest and 25.7% faster than baseline protocol (BASE) and achieved 99% detection accuracy while DeClone 64% and BASE 77%. Thus, this study successfully proposed approaches that can enhance the RFID data quality and reliability

    Fast implementation of pattern mining algorithms with time stamp uncertainties and temporal constraints

    Get PDF
    Pattern mining is a powerful tool for analysing big datasets. Temporal datasets include time as an additional parameter. This leads to complexity in algorithmic formulation, and it can be challenging to process such data quickly and efficiently. In addition, errors or uncertainty can exist in the timestamps of data, for example in manually recorded health data. Sometimes we wish to find patterns only within a certain temporal range. In some cases real-time processing and decision-making may be desirable. All these issues increase algorithmic complexity, processing times and storage requirements. In addition, it may not be possible to store or process confidential data on public clusters or the cloud that can be accessed by many people. Hence it is desirable to optimise algorithms for standalone systems. In this paper we present an integrated approach which can be used to write efficient codes for pattern mining problems. The approach includes: (1) cleaning datasets with removal of infrequent events, (2) presenting a new scheme for time-series data storage, (3) exploiting the presence of prior information about a dataset when available, (4) utilising vectorisation and multicore parallelisation. We present two new algorithms, FARPAM (FAst Robust PAttern Mining) and FARPAMp (FARPAM with prior information about prior uncertainty, allowing faster searching). The algorithms are applicable to a wide range of temporal datasets. They implement a new formulation of the pattern searching function which reproduces and extends existing algorithms (such as SPAM and RobustSPAM), and allows for significantly faster calculation. The algorithms also include an option of temporal restrictions in patterns, which is available neither in SPAM nor in RobustSPAM. The searching algorithm is designed to be flexible for further possible extensions. The algorithms are coded in C++, and are highly optimised and parallelised for a modern standalone multicore workstation, thus avoiding security issues connected with transfers of confidential data onto clusters. FARPAM has been successfully tested on a publicly available weather dataset and on a confidential adult social care dataset, reproducing results obtained by previous algorithms in both cases. It has been profiled against the widely used SPAM algorithm (for sequential pattern mining) and RobustSPAM (developed for datasets with errors in time points). The algorithm outperforms SPAM by up to 20 times and RobustSPAM by up to 6000 times. In both cases the new algorithm has better scalability

    Why (and How) Networks Should Run Themselves

    Full text link
    The proliferation of networked devices, systems, and applications that we depend on every day makes managing networks more important than ever. The increasing security, availability, and performance demands of these applications suggest that these increasingly difficult network management problems be solved in real time, across a complex web of interacting protocols and systems. Alas, just as the importance of network management has increased, the network has grown so complex that it is seemingly unmanageable. In this new era, network management requires a fundamentally new approach. Instead of optimizations based on closed-form analysis of individual protocols, network operators need data-driven, machine-learning-based models of end-to-end and application performance based on high-level policy goals and a holistic view of the underlying components. Instead of anomaly detection algorithms that operate on offline analysis of network traces, operators need classification and detection algorithms that can make real-time, closed-loop decisions. Networks should learn to drive themselves. This paper explores this concept, discussing how we might attain this ambitious goal by more closely coupling measurement with real-time control and by relying on learning for inference and prediction about a networked application or system, as opposed to closed-form analysis of individual protocols
    • …
    corecore