671 research outputs found

    Periodic subgraph mining in dynamic networks

    Get PDF
    La tesi si prefigge di scoprire interazioni periodiche frequenti tra i membri di una popolazione il cui comportamento viene studiato in un certo arco di tempo. Le interazioni tra i membri della popolazione sono rappresentate da archi E tra vertici V di un grafo. Una rete dinamica consiste in una serie di T timestep per ciascuno dei quali esiste un grafo che rappresenta le interazioni attive in quel dato istante. Questa tesi presenta ListMiner, un algoritmo per il problema dell’estrazione di sottografi periodici. La complessità computazionale di tale algoritmo è O((V+E) T2 ln (T /σ)), dove σ è il minimo numero di ripetizioni periodiche necessarie per riportare il sottografo estratto in output. Questa complessità propone un miglioramento di un fattore T rispetto l’unico algoritmo noto in letteratura, PSEMiner. Nella tesi sono inoltre presenti un’analisi dei risultati ottenuti e una presentazione di una variante del problem

    Parallel and Distributed Data Mining

    Get PDF

    Periodic pattern mining from spatio-temporal trajectory data

    Get PDF
    Rapid development in GPS tracking techniques produces a large number of spatio-temporal trajectory data. The analysis of these data provides us with a new opportunity to discover useful behavioural patterns. Spatio-temporal periodic pattern mining is employed to find temporal regularities for interesting places. Mining periodic patterns from spatio-temporal trajectories can reveal useful, important and valuable information about people's regular and recurrent movements and behaviours. Previous studies have been proposed to extract people's regular and repeating movement behavior from spatio-temporal trajectories. These previous approaches can target three following issues, (1) long individual trajectory; (2) spatial fuzziness; and (3) temporal fuzziness. First, periodic pattern mining is different to other pattern mining, such as association rule ming and sequential pattern mining, periodic pattern mining requires a very long trajectory from an individual so that the regular period can be extracted from this long single trajectory, for example, one month or one year period. Second, spatial fuzziness shows although a moving object can regularly move along the similar route, it is impossible for it to appear at the exactly same location. For instance, Bob goes to work everyday, and although he can follow a similar path from home to his workplace, the same location cannot be repeated across different days. Third, temporal fuzziness shows that periodicity is complicated including partial time span and multiple interleaving periods. In reality, the period is partial, it is highly impossible to occur through the whole movement of the object. Alternatively, the moving object has only a few periods, such as a daily period for work, or yearly period for holidays. However, it is insufficient to find effective periodic patterns considering these three issues only. This thesis aims to develop a new framework to extract more effective, understandable and meaningful periodic patterns by taking more features of spatio-temporal trajectories into account. The first feature is trajectory sequence, GPS trajectory data is temporally ordered sequences of geolocation which can be represented as consecutive trajectory segments, where each entry in each trajectory segment is closely related to the previous sampled point (trajectory node) and the latter one, rather than being isolated. Existing approaches disregard the important sequential nature of trajectory. Furthermore, they introduce both unwanted false positive reference spots and false negative reference spots. The second feature is spatial and temporal aspects. GPS trajectory data can be presented as triple data (x; y; t), x and y represent longitude and latitude respectively whilst t shows corresponding time in this location. Obviously, spatial and temporal aspects are two key factors. Existing methods do not consider these two aspects together in periodic pattern mining. Irregular time interval is the third feature of spatio-temporal trajectory. In reality, due to weather conditions, device malfunctions, or battery issues, the trajectory data are not always regularly sampled. Existing algorithms cannot deal with this issue but instead require a computationally expensive trajectory interpolation process, or it is assumed that trajectory is with regular time interval. The fourth feature is hierarchy of space. Hierarchy is an inherent property of spatial data that can be expressed in different levels, such as a country includes many states, a shopping mall is comprised of many shops. Hierarchy of space can find more hidden and valuable periodic patterns. Existing studies do not consider this inherent property of trajectory. Hidden background semantic information is the final feature. Aspatial semantic information is one of important features in spatio-temporal data, and it is embedded into the trajectory data. If the background semantic information is considered, more meaningful, understandable and useful periodic patterns can be extracted. However, existing methods do not consider the geographical information underlying trajectories. In addition, at times we are interested in finding periodic patterns among trajectory paths rather than trajectory nodes for different applications. This means periodic patterns should be identified and detected against trajectory paths rather than trajectory nodes for some applications. Existing approaches for periodic pattern mining focus on trajectories nodes rather than paths. To sum up, the aim of this thesis is to investigate solutions to these problems in periodic pattern mining in order to extract more meaningful, understandable periodic patterns. Each of three chapters addresses a different problem and then proposes adequate solutions to problems currently not addressed in existing studies. Finally, this thesis proposes a new framework to address all problems. First, we investigated a path-based solution which can target trajectory sequence and spatio-temporal aspects. We proposed an algorithm called Traclus (spatio-temporal) which can take spatial and temporal aspects into account at the same time instead of only considering spatial aspect. The result indicated our method produced more effective periodic patterns based on trajectory paths than existing node-based methods using two real-world trajectories. In order to consider hierarchy of space, we investigated existing hierarchical clustering approaches to obtain hierarchical reference spots (trajectory paths) for periodic pattern mining. HDBSCAN is an incremental version of DBSCAN which is able to handle clusters with different densities to generate a hierarchical clustering result using the single-linkage method, and then it automatically extracts clusters from a hierarchical tree. Thus, we modified traditional clustering method DBSCAN in Traclus (spatio-temporal) to HDBSCAN for extraction of hierarchical reference spots. The result is convincing, and reveals more periodic patterns than those of existing methods. Second, we introduced a stop/move method to annotate each spatio-temporal entry with a semantic label, such as restaurant, university and hospital. This method can enrich a trajectory with background semantic information so that we can easily infer people's repeating behaviors. In addition, existing methods use interpolation to make trajectory regular and then apply Fourier transform and autocorrelation to automatically detect period for each reference spot. An increasing number of trajectory nodes leads to an exponential increase of running time. Thus, we employed Lomb-Scargle periodogram to detect period for each reference spot based on raw trajectory without requiring any interpolation method. The results showed our method outperformed existing approaches on effectiveness and efficiency based on two real datasets. For hierarchical aspect, we extended previous work to find hierarchical semantic periodic patterns by applying HDBSCAN. The results were promising. Third, we apply our methodology to a case study, which reveals many interesting medical periodic patterns. These patterns can effectively explore human movement behaviors for positive medical outcomes. To sum up, this research proposed a new framework to gradually target the problems that existing methods cannot handle. These include: how to consider trajectory sequence, how to consider spatial temporal aspects together, how to deal with trajectory with irregular time interval, how to consider hierarchy of space and how to extract semantic information behind trajectory. After addressing all these problems, the experimental results demonstrate that our method can find more understandable, meaningful and effective periodic patterns than existing approaches

    Evaluation of a Simple, Scalable, Parallel Best-First Search Strategy

    Get PDF
    Large-scale, parallel clusters composed of commodity processors are increasingly available, enabling the use of vast processing capabilities and distributed RAM to solve hard search problems. We investigate Hash-Distributed A* (HDA*), a simple approach to parallel best-first search that asynchronously distributes and schedules work among processors based on a hash function of the search state. We use this approach to parallelize the A* algorithm in an optimal sequential version of the Fast Downward planner, as well as a 24-puzzle solver. The scaling behavior of HDA* is evaluated experimentally on a shared memory, multicore machine with 8 cores, a cluster of commodity machines using up to 64 cores, and large-scale high-performance clusters, using up to 2400 processors. We show that this approach scales well, allowing the effective utilization of large amounts of distributed memory to optimally solve problems which require terabytes of RAM. We also compare HDA* to Transposition-table Driven Scheduling (TDS), a hash-based parallelization of IDA*, and show that, in planning, HDA* significantly outperforms TDS. A simple hybrid which combines HDA* and TDS to exploit strengths of both algorithms is proposed and evaluated.Comment: in press, to appear in Artificial Intelligenc

    A Comprehensive Survey on Distributed Training of Graph Neural Networks

    Full text link
    Graph neural networks (GNNs) have been demonstrated to be a powerful algorithmic model in broad application fields for their effectiveness in learning over graphs. To scale GNN training up for large-scale and ever-growing graphs, the most promising solution is distributed training which distributes the workload of training across multiple computing nodes. At present, the volume of related research on distributed GNN training is exceptionally vast, accompanied by an extraordinarily rapid pace of publication. Moreover, the approaches reported in these studies exhibit significant divergence. This situation poses a considerable challenge for newcomers, hindering their ability to grasp a comprehensive understanding of the workflows, computational patterns, communication strategies, and optimization techniques employed in distributed GNN training. As a result, there is a pressing need for a survey to provide correct recognition, analysis, and comparisons in this field. In this paper, we provide a comprehensive survey of distributed GNN training by investigating various optimization techniques used in distributed GNN training. First, distributed GNN training is classified into several categories according to their workflows. In addition, their computational patterns and communication patterns, as well as the optimization techniques proposed by recent work are introduced. Second, the software frameworks and hardware platforms of distributed GNN training are also introduced for a deeper understanding. Third, distributed GNN training is compared with distributed training of deep neural networks, emphasizing the uniqueness of distributed GNN training. Finally, interesting issues and opportunities in this field are discussed.Comment: To Appear in Proceedings of the IEE

    Pattern Discovery from Event Data

    Get PDF
    Events are ubiquitous in real-life. With the rapid rise of the popularity of social media channels, massive amounts of event data, such as information about festivals, concerts, or meetings, are increasingly created and shared by users on the Internet. Deriving insights or knowledge from such social media data provides a semantically rich basis for many applications, for instance, social media marketing, service recommendation, sales promotion, or enrichment of existing data sources. In spite of substantial research on discovering valuable knowledge from various types of social media data such as microblog data, check-in data, or GPS trajectories, interestingly there has been only little work on mining event data for useful patterns. In this thesis, we focus on the discovery of interesting, useful patterns from datasets of events, where information about these events is shared by and spread across social media platforms. To deal with the existence of heterogeneous event data sources, we propose a comprehensive framework to model events for pattern mining purposes, where each event is described by three components: context, time, and location. This framework allows one to easily define how events are related in terms of conceptual, temporal, and spatial (geographic) relationships. Moreover, we also take into account hierarchies for contexts, time, and locations of events, which naturally exist as useful background knowledge to derive patterns at different levels of abstraction and granularity. Based on this framework, we focus on the following problems: (i) mining interval-based event sequence patterns, (ii) mining periodic event patterns, and (iii) extracting semantic annotations for locations of events. Generally, the first two problems consider correlations of events whereas the last one takes correlations of event components into account. In particular, the first problem is a generalization of mining sequential patterns from traditional data, where patterns representing complex temporal relationships among events can be discovered at different levels of abstraction and granularity. The second problem is to find periodic event patterns, where a notion of relaxed periodicity is formulated for events as well as for groups of events that co-occur. The third~problem is to extract semantic annotations for locations on the basis of exploiting correlations of contexts, time, and locations of events. For the three problems above, we respectively propose novel and efficient approaches. Our experiments clearly indicate that extracted patterns and knowledge can be well utilized in various useful tasks, such as event prediction, semantic search for locations, or topic-based clustering of locations

    Preserving the Quality of Architectural Tactics in Source Code

    Get PDF
    In any complex software system, strong interdependencies exist between requirements and software architecture. Requirements drive architectural choices while also being constrained by the existing architecture and by what is economically feasible. This makes it advisable to concurrently specify the requirements, to devise and compare alternative architectural design solutions, and ultimately to make a series of design decisions in order to satisfy each of the quality concerns. Unfortunately, anecdotal evidence has shown that architectural knowledge tends to be tacit in nature, stored in the heads of people, and lost over time. Therefore, developers often lack comprehensive knowledge of underlying architectural design decisions and inadvertently degrade the quality of the architecture while performing maintenance activities. In practice, this problem can be addressed through preserving the relationships between the requirements, architectural design decisions and their implementations in the source code, and then using this information to keep developers aware of critical architectural aspects of the code. This dissertation presents a novel approach that utilizes machine learning techniques to recover and preserve the relationships between architecturally significant requirements, architectural decisions and their realizations in the implemented code. Our approach for recovering architectural decisions includes the two primary stages of training and classification. In the first stage, the classifier is trained using code snippets of different architectural decisions collected from various software systems. During this phase, the classifier learns the terms that developers typically use to implement each architectural decision. These ``indicator terms\u27\u27 represent method names, variable names, comments, or the development APIs that developers inevitably use to implement various architectural decisions. A probabilistic weight is then computed for each potential indicator term with respect to each type of architectural decision. The weight estimates how strongly an indicator term represents a specific architectural tactics/decisions. For example, a term such as \emph{pulse} is highly representative of the heartbeat tactic but occurs infrequently in the authentication. After learning the indicator terms, the classifier can compute the likelihood that any given source file implements a specific architectural decision. The classifier was evaluated through several different experiments including classical cross-validation over code snippets of 50 open source projects and on the entire source code of a large scale software system. Results showed that classifier can reliably recognize a wide range of architectural decisions. The technique introduced in this dissertation is used to develop the Archie tool suite. Archie is a plug-in for Eclipse and is designed to detect wide range of architectural design decisions in the code and to protect them from potential degradation during maintenance activities. It has several features for performing change impact analysis of architectural concerns at both the code and design level and proactively keep developers informed of underlying architectural decisions during maintenance activities. Archie is at the stage of technology transfer at the US Department of Homeland Security where it is purely used to detect and monitor security choices. Furthermore, this outcome is integrated into the Department of Homeland Security\u27s Software Assurance Market Place (SWAMP) to advance research and development of secure software systems

    AcDWH - A patented method for active data warehousing

    Get PDF
    The traditional needs of data warehousing from monthly, weekly or nightly batch processing have evolved to near real-time refreshment cycles of the data, called active data warehousing. While the traditional data warehousing methods have been used to batch load large sets of data in the past, the business need for extremely fresh data in the data warehouse has increased. Previous studies have reviewed different aspects of the process along with the different methods to process data in data warehouses in near real-time fashion. To date, there has been little research of using partitioned staging tables within relational databases, combined with a crafted metadata driven system and parallelized loading processes for active data warehousing. This study provides a throughout description and suitability assessment of the patented AcDWH method for active data warehousing. In addition, this study provides a review and a summary of existing research on the data warehousing area from the era of start of data warehousing in the 1990’s to the year 2020. The review focuses on different parts of the data warehousing process and highlights the differences compared to the AcDWH method. Related to the AcDWH, the usage of partitioned staging tables within a relational database in combination of meta data structures used to manage the system is discussed in detail. In addition, two real-life applications are disclosed and discussed on high level. Potential future extensions to the methodology are discussed, and briefly summarized. The results indicate that the utilization of AcDWH method using parallelized loading pipelines and partitioned staging tables can provide enhanced throughput in the data warehouse loading processes. This is a clear improvement on the study’s field. Previous studies have not been considering using partitioned staging tables in conjunction with the loading processes and pipeline parallelization. Review of existing literature against the AcDWH method together with trial and error -approach show that the results and conclusions of this study are genuine. The results of this study confirm the fact that also technical level inventions within the data warehousing processes have significant contribution to the advance of methodologies. Compared to the previous studies in the field, this study suggests a simple yet novel method to achieve near real-time capabilities in active data warehousing.AcDWH – Patentoitu menetelmä aktiiviseen tietovarastointiin Perinteiset tarpeet tietovarastoinnille kuukausittaisen, viikoittaisen tai yöllisen käsittelyn osalta ovat kehittyneet lähes reaaliaikaista päivitystä vaativaksi aktiiviseksi tietovarastoinniksi. Vaikka perinteisiä menetelmiä on käytetty suurten tietomäärien lataukseen menneisyydessä, liiketoiminnan tarve erittäin ajantasaiselle tiedolle tietovarastoissa on kasvanut. Aikaisemmat tutkimukset ovat tarkastelleet erilaisia prosessin osa-alueita sekä erilaisia menetelmiä tietojen käsittelyyn lähes reaaliaikaisissa tietovarastoissa. Tutkimus partitioitujen relaatiotietokantojen väliaikaistaulujen käytöstä aktiivisessa tietovarastoinnissa yhdessä räätälöidyn metatieto-ohjatun järjestelmän ja rinnakkaislatauksen kanssa on ollut kuitenkin vähäistä. Tämä tutkielma tarjoaa kattavan kuvauksen sekä arvioinnin patentoidun AcDWH-menetelmän käytöstä aktiivisessa tietovarastoinnissa. Työ sisältää katsauksen ja yhteenvedon olemassa olevaan tutkimukseen tietovarastoinnin alueella 1990-luvun alusta vuoteen 2020. Kirjallisuuskatsaus keskittyy eri osa-alueisiin tietovarastointiprosessissa ja havainnollistaa eroja verrattuna AcDWH-menetelmään. AcDWH-menetelmän osalta käsitellään partitioitujen väliaikaistaulujen käyttöä relaatiotietokannassa, yhdessä järjestelmän hallitsemiseen käytettyjen metatietorakenteiden kanssa. Lisäksi kahden reaalielämän järjestelmän sovellukset kuvataan korkealla tasolla. Tutkimuksessa käsitellään myös menetelmän mahdollisia tulevia laajennuksia menetelmään tiivistetysti. Tulokset osoittavat, että AcDWH-menetelmän käyttö rinnakkaisilla latausputkilla ja partitioitujen välitaulujen käytöllä tarjoaa tehokkaan tietovaraston latausprosessin. Tämä on selvä parannus aikaisempaan tutkimukseen verrattuna. Aikaisemmassa tutkimuksessa ei ole käsitelty partitioitujen väliaikaistaulujen käyttöä ja soveltamista latausprosessin rinnakkaistamisessa. Tämän tutkimuksen tulokset vahvistavat, että myös tekniset keksinnöt tietovarastointiprosesseissa ovat merkittävässä roolissa menetelmien kehittymisessä. Aikaisempaan alan tutkimukseen verrattuna tämä tutkimus ehdottaa yksinkertaista mutta uutta menetelmää lähes reaaliaikaisten ominaisuuksien saavuttamiseksi aktiivisessa tietovarastoinnissa
    corecore