1,302 research outputs found

    A survey of temporal knowledge discovery paradigms and methods

    Get PDF
    With the increase in the size of data sets, data mining has recently become an important research topic and is receiving substantial interest from both academia and industry. At the same time, interest in temporal databases has been increasing and a growing number of both prototype and implemented systems are using an enhanced temporal understanding to explain aspects of behavior associated with the implicit time-varying nature of the universe. This paper investigates the confluence of these two areas, surveys the work to date, and explores the issues involved and the outstanding problems in temporal data mining

    Online Spectral Clustering on Network Streams

    Get PDF
    Graph is an extremely useful representation of a wide variety of practical systems in data analysis. Recently, with the fast accumulation of stream data from various type of networks, significant research interests have arisen on spectral clustering for network streams (or evolving networks). Compared with the general spectral clustering problem, the data analysis of this new type of problems may have additional requirements, such as short processing time, scalability in distributed computing environments, and temporal variation tracking. However, to design a spectral clustering method to satisfy these requirements certainly presents non-trivial efforts. There are three major challenges for the new algorithm design. The first challenge is online clustering computation. Most of the existing spectral methods on evolving networks are off-line methods, using standard eigensystem solvers such as the Lanczos method. It needs to recompute solutions from scratch at each time point. The second challenge is the parallelization of algorithms. To parallelize such algorithms is non-trivial since standard eigen solvers are iterative algorithms and the number of iterations can not be predetermined. The third challenge is the very limited existing work. In addition, there exists multiple limitations in the existing method, such as computational inefficiency on large similarity changes, the lack of sound theoretical basis, and the lack of effective way to handle accumulated approximate errors and large data variations over time. In this thesis, we proposed a new online spectral graph clustering approach with a family of three novel spectrum approximation algorithms. Our algorithms incrementally update the eigenpairs in an online manner to improve the computational performance. Our approaches outperformed the existing method in computational efficiency and scalability while retaining competitive or even better clustering accuracy. We derived our spectrum approximation techniques GEPT and EEPT through formal theoretical analysis. The well established matrix perturbation theory forms a solid theoretic foundation for our online clustering method. We facilitated our clustering method with a new metric to track accumulated approximation errors and measure the short-term temporal variation. The metric not only provides a balance between computational efficiency and clustering accuracy, but also offers a useful tool to adapt the online algorithm to the condition of unexpected drastic noise. In addition, we discussed our preliminary work on approximate graph mining with evolutionary process, non-stationary Bayesian Network structure learning from non-stationary time series data, and Bayesian Network structure learning with text priors imposed by non-parametric hierarchical topic modeling

    AI Solutions for MDS: Artificial Intelligence Techniques for Misuse Detection and Localisation in Telecommunication Environments

    Get PDF
    This report considers the application of Articial Intelligence (AI) techniques to the problem of misuse detection and misuse localisation within telecommunications environments. A broad survey of techniques is provided, that covers inter alia rule based systems, model-based systems, case based reasoning, pattern matching, clustering and feature extraction, articial neural networks, genetic algorithms, arti cial immune systems, agent based systems, data mining and a variety of hybrid approaches. The report then considers the central issue of event correlation, that is at the heart of many misuse detection and localisation systems. The notion of being able to infer misuse by the correlation of individual temporally distributed events within a multiple data stream environment is explored, and a range of techniques, covering model based approaches, `programmed' AI and machine learning paradigms. It is found that, in general, correlation is best achieved via rule based approaches, but that these suffer from a number of drawbacks, such as the difculty of developing and maintaining an appropriate knowledge base, and the lack of ability to generalise from known misuses to new unseen misuses. Two distinct approaches are evident. One attempts to encode knowledge of known misuses, typically within rules, and use this to screen events. This approach cannot generally detect misuses for which it has not been programmed, i.e. it is prone to issuing false negatives. The other attempts to `learn' the features of event patterns that constitute normal behaviour, and, by observing patterns that do not match expected behaviour, detect when a misuse has occurred. This approach is prone to issuing false positives, i.e. inferring misuse from innocent patterns of behaviour that the system was not trained to recognise. Contemporary approaches are seen to favour hybridisation, often combining detection or localisation mechanisms for both abnormal and normal behaviour, the former to capture known cases of misuse, the latter to capture unknown cases. In some systems, these mechanisms even work together to update each other to increase detection rates and lower false positive rates. It is concluded that hybridisation offers the most promising future direction, but that a rule or state based component is likely to remain, being the most natural approach to the correlation of complex events. The challenge, then, is to mitigate the weaknesses of canonical programmed systems such that learning, generalisation and adaptation are more readily facilitated

    Periodic Pattern Mining a Algorithms and Applications

    Get PDF
    Owing to a large number of applications periodic pattern mining has been extensively studied for over a decade Periodic pattern is a pattern that repeats itself with a specific period in a give sequence Periodic patterns can be mined from datasets like biological sequences continuous and discrete time series data spatiotemporal data and social networks Periodic patterns are classified based on different criteria Periodic patterns are categorized as frequent periodic patterns and statistically significant patterns based on the frequency of occurrence Frequent periodic patterns are in turn classified as perfect and imperfect periodic patterns full and partial periodic patterns synchronous and asynchronous periodic patterns dense periodic patterns approximate periodic patterns This paper presents a survey of the state of art research on periodic pattern mining algorithms and their application areas A discussion of merits and demerits of these algorithms was given The paper also presents a brief overview of algorithms that can be applied for specific types of datasets like spatiotemporal data and social network

    Efficient algorithms for analyzing large scale network dynamics: Centrality, community and predictability

    Get PDF
    Large scale networks are an indispensable part of our daily life; be it biological network, smart grids, academic collaboration networks, social networks, vehicular networks, or the networks as part of various smart environments, they are fast becoming ubiquitous. The successful realization of applications and services over them depend on efficient solution to their computational challenges that are compounded with network dynamics. The core challenges underlying large scale networks, for example: determining central (influential) nodes (and edges), interactions and contacts among nodes, are the basis behind the success of applications and services. Though at first glance these challenges seem to be trivial, the network characteristics affect their effective and efficient evaluation strategy. We thus propose to leverage large scale network structural characteristics and temporal dynamics in addressing these core conceptual challenges in this dissertation. We propose a divide and conquer based computationally efficient algorithm that leverages the underlying network community structure for deterministic computation of betweenness centrality indices for all nodes. As an integral part of it, we also propose a computationally efficient agglomerative hierarchical community detection algorithm. Next, we propose a network structure evolution based novel probabilistic link prediction algorithm that predicts set of links occurring over subsequent time periods with higher accuracy. To best capture the evolution process and have higher prediction accuracy we propose multiple time scales with the Markov prediction model. Finally, we propose to capture the multi-periodicity of human mobility pattern with sinusoidal intensity function of a cascaded nonhomogeneous Poisson process, to predict the future contacts over mobile networks. We use real data set and benchmarked approaches to validate the better performance of our proposed approaches --Abstract, page iii

    Fault diagnosis for IP-based network with real-time conditions

    Get PDF
    BACKGROUND: Fault diagnosis techniques have been based on many paradigms, which derive from diverse areas and have different purposes: obtaining a representation model of the network for fault localization, selecting optimal probe sets for monitoring network devices, reducing fault detection time, and detecting faulty components in the network. Although there are several solutions for diagnosing network faults, there are still challenges to be faced: a fault diagnosis solution needs to always be available and able enough to process data timely, because stale results inhibit the quality and speed of informed decision-making. Also, there is no non-invasive technique to continuously diagnose the network symptoms without leaving the system vulnerable to any failures, nor a resilient technique to the network's dynamic changes, which can cause new failures with different symptoms. AIMS: This thesis aims to propose a model for the continuous and timely diagnosis of IP-based networks faults, independent of the network structure, and based on data analytics techniques. METHOD(S): This research's point of departure was the hypothesis of a fault propagation phenomenon that allows the observation of failure symptoms at a higher network level than the fault origin. Thus, for the model's construction, monitoring data was collected from an extensive campus network in which impact link failures were induced at different instants of time and with different duration. These data correspond to widely used parameters in the actual management of a network. The collected data allowed us to understand the faults' behavior and how they are manifested at a peripheral level. Based on this understanding and a data analytics process, the first three modules of our model, named PALADIN, were proposed (Identify, Collection and Structuring), which define the data collection peripherally and the necessary data pre-processing to obtain the description of the network's state at a given moment. These modules give the model the ability to structure the data considering the delays of the multiple responses that the network delivers to a single monitoring probe and the multiple network interfaces that a peripheral device may have. Thus, a structured data stream is obtained, and it is ready to be analyzed. For this analysis, it was necessary to implement an incremental learning framework that respects networks' dynamic nature. It comprises three elements, an incremental learning algorithm, a data rebalancing strategy, and a concept drift detector. This framework is the fourth module of the PALADIN model named Diagnosis. In order to evaluate the PALADIN model, the Diagnosis module was implemented with 25 different incremental algorithms, ADWIN as concept-drift detector and SMOTE (adapted to streaming scenario) as the rebalancing strategy. On the other hand, a dataset was built through the first modules of the PALADIN model (SOFI dataset), which means that these data are the incoming data stream of the Diagnosis module used to evaluate its performance. The PALADIN Diagnosis module performs an online classification of network failures, so it is a learning model that must be evaluated in a stream context. Prequential evaluation is the most used method to perform this task, so we adopt this process to evaluate the model's performance over time through several stream evaluation metrics. RESULTS: This research first evidences the phenomenon of impact fault propagation, making it possible to detect fault symptoms at a monitored network's peripheral level. It translates into non-invasive monitoring of the network. Second, the PALADIN model is the major contribution in the fault detection context because it covers two aspects. An online learning model to continuously process the network symptoms and detect internal failures. Moreover, the concept-drift detection and rebalance data stream components which make resilience to dynamic network changes possible. Third, it is well known that the amount of available real-world datasets for imbalanced stream classification context is still too small. That number is further reduced for the networking context. The SOFI dataset obtained with the first modules of the PALADIN model contributes to that number and encourages works related to unbalanced data streams and those related to network fault diagnosis. CONCLUSIONS: The proposed model contains the necessary elements for the continuous and timely diagnosis of IPbased network faults; it introduces the idea of periodical monitorization of peripheral network elements and uses data analytics techniques to process it. Based on the analysis, processing, and classification of peripherally collected data, it can be concluded that PALADIN achieves the objective. The results indicate that the peripheral monitorization allows diagnosing faults in the internal network; besides, the diagnosis process needs an incremental learning process, conceptdrift detection elements, and rebalancing strategy. The results of the experiments showed that PALADIN makes it possible to learn from the network manifestations and diagnose internal network failures. The latter was verified with 25 different incremental algorithms, ADWIN as concept-drift detector and SMOTE (adapted to streaming scenario) as the rebalancing strategy. This research clearly illustrates that it is unnecessary to monitor all the internal network elements to detect a network's failures; instead, it is enough to choose the peripheral elements to be monitored. Furthermore, with proper processing of the collected status and traffic descriptors, it is possible to learn from the arriving data using incremental learning in cooperation with data rebalancing and concept drift approaches. This proposal continuously diagnoses the network symptoms without leaving the system vulnerable to failures while being resilient to the network's dynamic changes.Programa de Doctorado en Ciencia y TecnologĂ­a InformĂĄtica por la Universidad Carlos III de MadridPresidente: JosĂŠ Manuel Molina LĂłpez.- Secretario: Juan Carlos DueĂąas LĂłpez.- Vocal: Juan Manuel Corchado RodrĂ­gue
    • …
    corecore