5,757 research outputs found
Significance of Data Structures and Data Retrieval Techniques on Sequence Rule Mining Efficacy
Sequence mining intends to discover rules from diverse datasets by implementing Rule Mining Algorithms with efficient data structures and data retrieval techniques. Traditional algorithms struggle in handling variable support measures which may involve repeated reconstruction of the underlying data structures with changing thresholds. To address these issues the premiere Sequence Mining Algorithm, AprioriAll is implemented against an Educational and a Financial Dataset, using the HASH and the TRIE data structures with scan reduction techniques. Primary idea is to study the impact of data structures and retrieval techniques on the rule mining process in handling diverse datasets. Performance Evaluation Matrices- Support, Confidence and Lifts are considered for testing the efficacies of the algorithm in terms of memory requirements and execution time complexities. Results unveil the excellence of Hashing in tree construction time and memory overhead for fixed sets of pre-defined support thresholds. Whereas, TRIE may avoid reconstruction and is capable of handling dynamic support thresholds, leading to shorter rule discovery time but higher memory consumption. This study highlights the effectiveness of Hash and TRIE data structures considering the dataset characteristics during rule mining. It underscores the importance of appropriate data structures based on dataset features, scanning techniques, and user-defined parameters
Learning from non-stationary data using a growing network of prototypes
Proceeding of: 2013 IEEE Congress on Evolutionary Computation (CEC), Cancun, 20-23 June 2013Learning from non-stationary data requires methods that are able to deal with a continuous stream of data instances, possibly of infinite size, where the class distributions are potentially drifting over time. For handling such datasets, we are proposing a new method that incrementally creates and adapts a network of prototypes for classifying complex data received in an online fashion. The algorithm includes both an accuracy-based and time-based forgetting mechanisms that ensure that the model size does not grow indefinitely with large datasets. We have performed tests on seven benchmarking datasets for comparing our proposal with several approaches found in the literature, including ensemble algorithms associated to two different base classifiers. Performances obtained show that our algorithm is comparable to the best of the ensemble classifiers in terms of accuracy/time trade-off. Moreover, our approach appears to have significant advantages for dealing with data that has a complex, non-linearly separable topology.Spanish Ministry of
Science and Innovation under the project MOVES, grant
reference TIN2011-28336, and NSERC-CanadaThis article has been funded by the Spanish Ministry of
Science and Innovation under the project MOVES with grant
reference TIN2011-28336, and NSERC-Canada.Publicad
Social Fingerprinting: detection of spambot groups through DNA-inspired behavioral modeling
Spambot detection in online social networks is a long-lasting challenge
involving the study and design of detection techniques capable of efficiently
identifying ever-evolving spammers. Recently, a new wave of social spambots has
emerged, with advanced human-like characteristics that allow them to go
undetected even by current state-of-the-art algorithms. In this paper, we show
that efficient spambots detection can be achieved via an in-depth analysis of
their collective behaviors exploiting the digital DNA technique for modeling
the behaviors of social network users. Inspired by its biological counterpart,
in the digital DNA representation the behavioral lifetime of a digital account
is encoded in a sequence of characters. Then, we define a similarity measure
for such digital DNA sequences. We build upon digital DNA and the similarity
between groups of users to characterize both genuine accounts and spambots.
Leveraging such characterization, we design the Social Fingerprinting
technique, which is able to discriminate among spambots and genuine accounts in
both a supervised and an unsupervised fashion. We finally evaluate the
effectiveness of Social Fingerprinting and we compare it with three
state-of-the-art detection algorithms. Among the peculiarities of our approach
is the possibility to apply off-the-shelf DNA analysis techniques to study
online users behaviors and to efficiently rely on a limited number of
lightweight account characteristics
Sistem Prediksi Transaksi Nasabah Bank Swasta Memanfaatkan Fuzzy Time Interval Sequential Pattern Mining
Layanan perbankan saat ini memang dirancang sebagai salah satu cara untuk memuaskan para nasabah. Pelayanan operasional adalah pelayanan yang penting karena terjadi secara langsung. Kebutuhan seorang nasabah yang terjadi sewaktu-waktu sehingga bank harus siap dalam hal dana tunai. Transaksi yang terjadi pada sebuah bank tidak dapat diprediksi dengan kasat mata dikarenakan situasi dan kondisi perekonomian yang labil sehingga bank harus memperhatikan jumlah dana tunai yang tersedia. Oleh sebab itu perlu dibangun sebuah sistem prediksi yang dapat memprediksi transaksi nasabah guna untuk mengetahui pada saat momen apa, transaksi apa yang akan dilakukan serta dalam waktu atau tempo yang sebentar, sedang atau lama transaksi kedua akan dilakukan. Sistem ini menggunakan metode fuzzy time interval sequential pattern yang dapat memprediksi transaksi nasabah dikolaborasi dengan momen
An Event-based Analysis Framework for Open Source Software Development Projects
The increasing popularity and success of Open Source Software (OSS) development projects has drawn significant attention of academics and open source participants over the last two decades. As one of the key areas in OSS research, assessing and predicting OSS performance is of great value to both OSS communities and organizations who are interested in investing in OSS projects. Most existing research, however, has considered OSS project performance as the outcome of static cross-sectional factors such as number of developers, project activity level, and license choice. While variance studies can identify some predictors of project outcomes, they tend to neglect the actual process of development. Without a closer examination of how events occur, an understanding of OSS projects is incomplete. This dissertation aims to combine both process and variance strategy, to investigate how OSS projects change over time through their development processes; and to explore how these changes affect project performance. I design, instantiate, and evaluate a framework and an artifact, EventMiner, to analyze OSS projects’ evolution through development activities. This framework integrates concepts from various theories such as distributed cognition (DCog) and complexity theory, applying data mining techniques such as decision trees, motif analysis, and hidden Markov modeling to automatically analyze and interpret the trace data of 103 OSS projects from an open source repository. The results support the construction of process theories on OSS development. The study contributes to literature in DCog, design routines, OSS development, and OSS performance. The resulting framework allows OSS researchers who are interested in OSS development processes to share and reuse data and data analysis processes in an open-source manner
Recommended from our members
Spatio-temporal patterns of human mobility from geo-social networks for urban computing: Analysis, models & applications
The availability of rich information about fine-grained user mobility in urban environments from increasingly geographically-aware social networking services and the rapid development of machine learning applications greatly facilitate the investigation of urban issues. In this setting, urban computing emerges intending to tackle a variety of challenges faced by cities nowadays and to offer promising approaches to improving our living environment. Leveraging massive amounts of data from geo-social networks with unprecedented richness, we show how to devise novel algorithmic techniques to reveal underlying urban mobility patterns for better policy-making and more efficient mobile applications in this dissertation.
Building upon the foundation of existing research efforts in urban computing field and basic machine learning techniques, in this dissertation, we propose a general framework of urban computing with geo-social network data and develop novel algorithms tailored for three urban computing tasks. We begin by exploring how the transition data recording human movements between urban venues from geo-social networks can be aggregated and utilised to detect spatio-temporal changes of local graphs in urban areas. We further explore how this can be used as a proxy to track and predict socio-economic deprivation changes as government financial effort is put in developing areas by supervised machine learning methods. We then study how to extract latent patterns from collective user-venue interactions with the help of a spatio-temporal aware topic modeling approach for the benefit of urban
infrastructure planning. After that, we propose a model to detect the gap between user-side demand and venue-side supply levels for certain types of services in urban environments to suggest further policymaking and investment optimisation. Finally, we address a mobility prediction task, the application aim of which is to recommend new places to explore in the city for mobile users. To this end, we develop a deep learning framework that integrates memory network and topic modeling techniques. Extensive experiments indicate that the proposed architecture can enhance the prediction performance in various recommendation scenarios with high interpretability.
All in all, the insights drawn and the techniques developed in this dissertation make a substantial step in addressing issues in cities and open the door to future possibilities in the promising urban computing area
MINING FREQUENT PATTERNS FROM PRECISE AND UNCERTAIN DATA // MINERAÇÃO DE PADRÕES FREQUENTES A PARTIR DE DADOS PRECISOS E INCERTOS
Data mining has gained popularity over the past two decades and has been considered one of the most prominent areas of current database research. Common data mining tasks include finding frequent patterns, clustering and classifying objects, as well as detecting anomalies. To handle these tasks, techniques from different fields—such as database systems, machine learning, statistics, information retrieval, and data visualization—are applied to provide business intelligent (BI) solutions to various real-life problems. In this survey, we focus on the task of frequent pattern mining, which non-trivially extracts implicit, previously unknown and potentially useful information in the form of frequently occurring sets of items. Mined frequent patterns can be considered as building blocks for association rules, which help reveal associative relationships between items or events on the antecedent and the consequent of rules. Here, we describe some classical algorithms, as well as some recent innovative algorithms, for mining precise data (in which users are certain about the presence or absence of data items) and uncertain data (in which users are uncertain about the presence or absence of data items and they only know that data items probably occur). Mineração de Dados ganhou popularidade nas últimas duas décadas e tem sido considerada uma das mais proeminentes áreas dentro da área de Banco de Dados. Dentre as tarefas comumente realizadas em mineração de dados encontram-se busca de padrões frequentes, clusterização e classificação de objetos, como também detecção de anomalias. Para manipular estas tarefas, técnicas de diferentes campos – tais como sistemas de banco de dados, máquinas de aprendizado, estatÃstica, recuperação de informações e visualização de dados – são aplicadas para oferecer soluções para problemas em nÃvel de Business Intelligent (BI). Nesta pesquisa, nós focamos em tarefas relacionadas a mineração de padrões frequentes, que implica na extração de informações potencialmente úteis, não triviais e previamente desconhecidas, na forma de ocorrências de conjunto de itens frequentes. Mineração de padrões frequentes pode ser considerados como blocos de informações para a construção de regras de associação, os quais auxiliam na identificação de relacionamentos entre itens ou eventos que participam das partes antecedente e consequente de uma regra. Neste trabalho são descritos alguns algoritmos clássicos, como também alguns algoritmos inovadores recentes, para mineração de dados precisos (para os quais o usuário têm certeza da presença ou ausência dos itens de dados) e dados incertos (para os quais usuários tem somente uma certeza probabilÃstica da presença ou ausência de determinados itens de dados)
Failure prediction for high-performance computing systems
The failure rate in high-performance computing (HPC) systems continues to escalate as the number of components in these systems increases. This affects the scalability and the performance of parallel applications in large-scale HPC systems. Fault tolerance (FT) mechanisms help mitigating the impact of failures on parallel applications. However, utilizing such mechanisms requires additional overhead. Besides, the overuse of FT mechanisms results in unnecessarily large overhead in the parallel applications. Knowing when and where failures will occur can greatly reduce the excessive overhead. As such, failure prediction is critical in order to effectively utilize FT mechanisms. In addition, it also helps in system administration and management, as the predicted failure can be handled beforehand with limited impact to the running systems.
This dissertation proposes new proficiency metrics for failure prediction based on failure impact in UPC environment that the existing proficiency metrics tire unable to reflect. Furthermore, an efficient log message clustering algorithm is proposed for system event log data preprocessing and analysis. Then, two novel association rule mining approaches are introduced and employed for HPC failure prediction. Finally, the performances of the existing and the proposed association rule mining methods are compared and analyzed
Personalized Market Basket Prediction with Temporal Annotated Recurring Sequences
Nowadays, a hot challenge for supermarket chains is to offer personalized services to their customers. Market basket prediction, i.e., supplying the customer a shopping list for the next purchase according to her current needs, is one of these services. Current approaches are not capable of capturing at the same time the different factors influencing the customer's decision process: co-occurrence, sequentuality, periodicity and recurrency of the purchased items. To this aim, we define a pattern Temporal Annotated Recurring Sequence (TARS) able to capture simultaneously and adaptively all these factors. We define the method to extract TARS and develop a predictor for next basket named TBP (TARS Based Predictor) that, on top of TARS, is able to understand the level of the customer's stocks and recommend the set of most necessary items. By adopting the TBP the supermarket chains could crop tailored suggestions for each individual customer which in turn could effectively speed up their shopping sessions. A deep experimentation shows that TARS are able to explain the customer purchase behavior, and that TBP outperforms the state-of-the-art competitors
- …