5,757 research outputs found

    Significance of Data Structures and Data Retrieval Techniques on Sequence Rule Mining Efficacy

    Get PDF
    Sequence mining intends to discover rules from diverse datasets by implementing Rule Mining Algorithms with efficient data structures and data retrieval techniques. Traditional algorithms struggle in handling variable support measures which may involve repeated reconstruction of the underlying data structures with changing thresholds. To address these issues the premiere Sequence Mining Algorithm, AprioriAll is implemented against an Educational and a Financial Dataset, using the HASH and the TRIE data structures with scan reduction techniques. Primary idea is to study the impact of data structures and retrieval techniques on the rule mining process in handling diverse datasets. Performance Evaluation Matrices- Support, Confidence and Lifts are considered for testing the efficacies of the algorithm in terms of memory requirements and execution time complexities. Results unveil the excellence of Hashing in tree construction time and memory overhead for fixed sets of pre-defined support thresholds. Whereas, TRIE may avoid reconstruction and is capable of handling dynamic support thresholds, leading to shorter rule discovery time but higher memory consumption. This study highlights the effectiveness of Hash and TRIE data structures considering the dataset characteristics during rule mining. It underscores the importance of appropriate data structures based on dataset features, scanning techniques, and user-defined parameters

    Learning from non-stationary data using a growing network of prototypes

    Get PDF
    Proceeding of: 2013 IEEE Congress on Evolutionary Computation (CEC), Cancun, 20-23 June 2013Learning from non-stationary data requires methods that are able to deal with a continuous stream of data instances, possibly of infinite size, where the class distributions are potentially drifting over time. For handling such datasets, we are proposing a new method that incrementally creates and adapts a network of prototypes for classifying complex data received in an online fashion. The algorithm includes both an accuracy-based and time-based forgetting mechanisms that ensure that the model size does not grow indefinitely with large datasets. We have performed tests on seven benchmarking datasets for comparing our proposal with several approaches found in the literature, including ensemble algorithms associated to two different base classifiers. Performances obtained show that our algorithm is comparable to the best of the ensemble classifiers in terms of accuracy/time trade-off. Moreover, our approach appears to have significant advantages for dealing with data that has a complex, non-linearly separable topology.Spanish Ministry of Science and Innovation under the project MOVES, grant reference TIN2011-28336, and NSERC-CanadaThis article has been funded by the Spanish Ministry of Science and Innovation under the project MOVES with grant reference TIN2011-28336, and NSERC-Canada.Publicad

    Social Fingerprinting: detection of spambot groups through DNA-inspired behavioral modeling

    Full text link
    Spambot detection in online social networks is a long-lasting challenge involving the study and design of detection techniques capable of efficiently identifying ever-evolving spammers. Recently, a new wave of social spambots has emerged, with advanced human-like characteristics that allow them to go undetected even by current state-of-the-art algorithms. In this paper, we show that efficient spambots detection can be achieved via an in-depth analysis of their collective behaviors exploiting the digital DNA technique for modeling the behaviors of social network users. Inspired by its biological counterpart, in the digital DNA representation the behavioral lifetime of a digital account is encoded in a sequence of characters. Then, we define a similarity measure for such digital DNA sequences. We build upon digital DNA and the similarity between groups of users to characterize both genuine accounts and spambots. Leveraging such characterization, we design the Social Fingerprinting technique, which is able to discriminate among spambots and genuine accounts in both a supervised and an unsupervised fashion. We finally evaluate the effectiveness of Social Fingerprinting and we compare it with three state-of-the-art detection algorithms. Among the peculiarities of our approach is the possibility to apply off-the-shelf DNA analysis techniques to study online users behaviors and to efficiently rely on a limited number of lightweight account characteristics

    Sistem Prediksi Transaksi Nasabah Bank Swasta Memanfaatkan Fuzzy Time Interval Sequential Pattern Mining

    Get PDF
    Layanan perbankan saat ini memang dirancang sebagai salah satu cara untuk memuaskan para nasabah. Pelayanan operasional adalah pelayanan yang penting karena terjadi secara langsung. Kebutuhan seorang nasabah yang terjadi sewaktu-waktu sehingga bank harus siap dalam hal dana tunai. Transaksi yang terjadi pada sebuah bank tidak dapat diprediksi dengan kasat mata dikarenakan situasi dan kondisi perekonomian yang labil sehingga bank harus memperhatikan jumlah dana tunai yang tersedia. Oleh sebab itu perlu dibangun sebuah sistem prediksi yang dapat memprediksi transaksi nasabah guna untuk mengetahui pada saat momen apa, transaksi apa yang akan dilakukan serta dalam waktu atau tempo yang sebentar, sedang atau lama transaksi kedua akan dilakukan. Sistem ini menggunakan metode fuzzy time interval sequential pattern yang dapat memprediksi transaksi nasabah dikolaborasi dengan momen

    An Event-based Analysis Framework for Open Source Software Development Projects

    Get PDF
    The increasing popularity and success of Open Source Software (OSS) development projects has drawn significant attention of academics and open source participants over the last two decades. As one of the key areas in OSS research, assessing and predicting OSS performance is of great value to both OSS communities and organizations who are interested in investing in OSS projects. Most existing research, however, has considered OSS project performance as the outcome of static cross-sectional factors such as number of developers, project activity level, and license choice. While variance studies can identify some predictors of project outcomes, they tend to neglect the actual process of development. Without a closer examination of how events occur, an understanding of OSS projects is incomplete. This dissertation aims to combine both process and variance strategy, to investigate how OSS projects change over time through their development processes; and to explore how these changes affect project performance. I design, instantiate, and evaluate a framework and an artifact, EventMiner, to analyze OSS projects’ evolution through development activities. This framework integrates concepts from various theories such as distributed cognition (DCog) and complexity theory, applying data mining techniques such as decision trees, motif analysis, and hidden Markov modeling to automatically analyze and interpret the trace data of 103 OSS projects from an open source repository. The results support the construction of process theories on OSS development. The study contributes to literature in DCog, design routines, OSS development, and OSS performance. The resulting framework allows OSS researchers who are interested in OSS development processes to share and reuse data and data analysis processes in an open-source manner

    MINING FREQUENT PATTERNS FROM PRECISE AND UNCERTAIN DATA // MINERAÇÃO DE PADRÕES FREQUENTES A PARTIR DE DADOS PRECISOS E INCERTOS

    Get PDF
    Data mining has gained popularity over the past two decades and has been considered one of the most prominent areas of current database research. Common data mining tasks include finding frequent patterns, clustering and classifying objects, as well as detecting anomalies. To handle these tasks, techniques from different fields—such as database systems, machine learning, statistics, information retrieval, and data visualization—are applied to provide business intelligent (BI) solutions to various real-life problems. In this survey, we focus on the task of frequent pattern mining, which non-trivially extracts implicit, previously unknown and potentially useful information in the form of frequently occurring sets of items. Mined frequent patterns can be considered as building blocks for association rules, which help reveal associative relationships between items or events on the antecedent and the consequent of rules. Here, we describe some classical algorithms, as well as some recent innovative algorithms, for mining precise data (in which users are certain about the presence or absence of data items) and uncertain data (in which users are uncertain about the presence or absence of data items and they only know that data items probably occur). Mineração de Dados ganhou popularidade nas últimas duas décadas e tem sido considerada uma das mais proeminentes áreas dentro da área de Banco de Dados. Dentre as tarefas comumente realizadas em mineração de dados encontram-se busca de padrões frequentes, clusterização e classificação de objetos, como também detecção de anomalias. Para manipular estas tarefas, técnicas de diferentes campos – tais como sistemas de banco de dados, máquinas de aprendizado, estatística, recuperação de informações e visualização de dados – são aplicadas para oferecer soluções para problemas em nível de Business Intelligent (BI). Nesta pesquisa, nós focamos em tarefas relacionadas a mineração de padrões frequentes, que implica na extração de informações potencialmente úteis, não triviais e previamente desconhecidas, na forma de ocorrências de conjunto de itens frequentes. Mineração de padrões frequentes pode ser considerados como blocos de informações para a construção de regras de associação, os quais auxiliam na identificação de relacionamentos entre itens ou eventos que participam das partes antecedente e consequente de uma regra. Neste trabalho são descritos alguns algoritmos clássicos, como também alguns algoritmos inovadores recentes, para mineração de dados precisos (para os quais o usuário têm certeza da presença ou ausência dos itens de dados) e dados incertos (para os quais usuários tem somente uma certeza probabilística da presença ou ausência de determinados itens de dados)

    Failure prediction for high-performance computing systems

    Get PDF
    The failure rate in high-performance computing (HPC) systems continues to escalate as the number of components in these systems increases. This affects the scalability and the performance of parallel applications in large-scale HPC systems. Fault tolerance (FT) mechanisms help mitigating the impact of failures on parallel applications. However, utilizing such mechanisms requires additional overhead. Besides, the overuse of FT mechanisms results in unnecessarily large overhead in the parallel applications. Knowing when and where failures will occur can greatly reduce the excessive overhead. As such, failure prediction is critical in order to effectively utilize FT mechanisms. In addition, it also helps in system administration and management, as the predicted failure can be handled beforehand with limited impact to the running systems. This dissertation proposes new proficiency metrics for failure prediction based on failure impact in UPC environment that the existing proficiency metrics tire unable to reflect. Furthermore, an efficient log message clustering algorithm is proposed for system event log data preprocessing and analysis. Then, two novel association rule mining approaches are introduced and employed for HPC failure prediction. Finally, the performances of the existing and the proposed association rule mining methods are compared and analyzed

    Personalized Market Basket Prediction with Temporal Annotated Recurring Sequences

    Get PDF
    Nowadays, a hot challenge for supermarket chains is to offer personalized services to their customers. Market basket prediction, i.e., supplying the customer a shopping list for the next purchase according to her current needs, is one of these services. Current approaches are not capable of capturing at the same time the different factors influencing the customer's decision process: co-occurrence, sequentuality, periodicity and recurrency of the purchased items. To this aim, we define a pattern Temporal Annotated Recurring Sequence (TARS) able to capture simultaneously and adaptively all these factors. We define the method to extract TARS and develop a predictor for next basket named TBP (TARS Based Predictor) that, on top of TARS, is able to understand the level of the customer's stocks and recommend the set of most necessary items. By adopting the TBP the supermarket chains could crop tailored suggestions for each individual customer which in turn could effectively speed up their shopping sessions. A deep experimentation shows that TARS are able to explain the customer purchase behavior, and that TBP outperforms the state-of-the-art competitors
    • …
    corecore