299 research outputs found

    Analysis of group evolution prediction in complex networks

    Full text link
    In the world, in which acceptance and the identification with social communities are highly desired, the ability to predict evolution of groups over time appears to be a vital but very complex research problem. Therefore, we propose a new, adaptable, generic and mutli-stage method for Group Evolution Prediction (GEP) in complex networks, that facilitates reasoning about the future states of the recently discovered groups. The precise GEP modularity enabled us to carry out extensive and versatile empirical studies on many real-world complex / social networks to analyze the impact of numerous setups and parameters like time window type and size, group detection method, evolution chain length, prediction models, etc. Additionally, many new predictive features reflecting the group state at a given time have been identified and tested. Some other research problems like enriching learning evolution chains with external data have been analyzed as well

    A survey of cost-sensitive decision tree induction algorithms

    Get PDF
    The past decade has seen a significant interest on the problem of inducing decision trees that take account of costs of misclassification and costs of acquiring the features used for decision making. This survey identifies over 50 algorithms including approaches that are direct adaptations of accuracy based methods, use genetic algorithms, use anytime methods and utilize boosting and bagging. The survey brings together these different studies and novel approaches to cost-sensitive decision tree learning, provides a useful taxonomy, a historical timeline of how the field has developed and should provide a useful reference point for future research in this field

    Click-through rate prediction : a comparative study of ensemble techniques in real-time bidding

    Get PDF
    Dissertation presented as a partial requirement for obtaining the Master’s degree in Information Management, with a specialization in Business Intelligence and Knowledge ManagementReal-Time Bidding is an automated mechanism to buy and sell ads in real time that uses data collected from internet users, to accurately deliver the right audience to the best-matched advertisers. It goes beyond contextual advertising by motivating the bidding focused on user data and also, it is different from the sponsored search auction where the bid price is associated with keywords. There is extensive literature regarding the classification and prediction of performance metrics such as click-through-rate, impression rate and bidding price. However, there is limited research on the application of advanced machine learning techniques, such as ensemble methods, on predicting click-through rate of real-time bidding campaigns. This paper presents an in-depth analysis of predicting click-through rate in real-time bidding campaigns by comparing the classification results from six traditional classification models (Linear Discriminant Analysis, Logistic Regression, Regularised Regression, Decision trees, k-nearest neighbors and Support Vector Machines) with two popular ensemble learning techniques (Voting and BootStrap Aggregation). The goal of our research is to determine whether ensemble methods can accurately predict click-through rate and compared to standard classifiers. Results showed that ensemble techniques outperformed simple classifiers performance. Moreover, also, highlights the excellent performance of linear algorithms (Linear Discriminant Analysis and Regularized Regression)

    An under-Sampled Approach for Handling Skewed Data Distribution using Cluster Disjuncts

    Get PDF
    In Data mining and Knowledge Discovery hidden and valuable knowledge from the data sources is discovered. The traditional algorithms used for knowledge discovery are bottle necked due to wide range of data sources availability. Class imbalance is a one of the problem arises due to data source which provide unequal class i.e. examples of one class in a training data set vastly outnumber examples of the other class(es). Researchers have rigorously studied several techniques to alleviate the problem of class imbalance, including resampling algorithms, and feature selection approaches to this problem. In this paper, we present a new hybrid frame work dubbed as Majority Under-sampling based on Cluster Disjunct (MAJOR_CD) for learning from skewed training data. This algorithm provides a simpler and faster alternative by using cluster disjunct concept. We conduct experiments using twelve UCI data sets from various application domains using five algorithms for comparison on six evaluation metrics. The empirical study suggests that MAJOR_CD have been believed to be effective in addressing the class imbalance problem

    EPRENNID: An evolutionary prototype reduction based ensemble for nearest neighbor classification of imbalanced data

    Get PDF
    Classification problems with an imbalanced class distribution have received an increased amount of attention within the machine learning community over the last decade. They are encountered in a growing number of real-world situations and pose a challenge to standard machine learning techniques. We propose a new hybrid method specifically tailored to handle class imbalance, called EPRENNID. It performs an evolutionary prototype reduction focused on providing diverse solutions to prevent the method from overfitting the training set. It also allows us to explicitly reduce the underrepresented class, which the most common preprocessing solutions handling class imbalance usually protect. As part of the experimental study, we show that the proposed prototype reduction method outperforms state-of-the-art preprocessing techniques. The preprocessing step yields multiple prototype sets that are later used in an ensemble, performing a weighted voting scheme with the nearest neighbor classifier. EPRENNID is experimentally shown to significantly outperform previous proposals

    Comparative analysis using supervised learning methods in anti-money laundering of Bitcoin data

    Get PDF
    With the advance of Bitcoin technology, money laundering has been incentivised as a den of Bitcoin blockchain, in which the user's identity is hidden behind a pseudonym known as address. Although this trait permits concealing in the plain sight, the public ledger of Bitcoin blockchain provides more power for investigators and allows collective intelligence for anti-money laundering and forensic analysis. This fascinating paradox arises in the strength of Bitcoin technology. Machine learning techniques have attained promising results in forensic analysis, in order to spot suspicious behaviour in Bitcoin blockchain. This paper presents a comparative analysis of the performance of classical supervised learning methods using a recently published data set derived from Bitcoin blockchain, to predict licit and illicit transactions in the network. Besides, an ensemble learning method is utilised using a combination of the given supervised learning models, which outperforms the given classical methods. This experiment is performed using a newly published data set derived from Bitcoin blockchain. Our main contribution points out that using ensemble learning approach outperforms the performance of the classical learning models used in the original paper, using Elliptic data set, a time series of Bitcoin transaction graph with node transactions and directed payments flow edges. Using the same data set, we show that we are able to predict licit/illicit transactions with an accuracy of 98.13% and F1 score equals to 83.36% using the proposed method. We discuss the variety of supervised learning methods, and their capabilities of assisting forensic analysis, and propose future work directions

    PROPAGATION OF MISCLASSIFIED INSTANCES TO HANDLE NONSTATIONARY IMBALANCED DATA STREAM

    Get PDF
    Learning on the data stream with nonstationary and imbalanced property is an interesting and complicated problem in data mining as change in class distribution may result in class unbalancing. Many real time problems like intrusion detection, credit card fraud detection, weather forecasting and many more applications suffer concept drift as well as class imbalance as they change with time. The rationale of this paper is to present an effective learning for nonstationary imbalanced data stream which emphasis on misclassified examples with the focus on two-class problems. At the end of paper, proposed algorithms is compared with existing similar approaches using various evaluation metrics

    Stacked Generalizations in Imbalanced Fraud Data Sets using Resampling Methods

    Full text link
    This study uses stacked generalization, which is a two-step process of combining machine learning methods, called meta or super learners, for improving the performance of algorithms in step one (by minimizing the error rate of each individual algorithm to reduce its bias in the learning set) and then in step two inputting the results into the meta learner with its stacked blended output (demonstrating improved performance with the weakest algorithms learning better). The method is essentially an enhanced cross-validation strategy. Although the process uses great computational resources, the resulting performance metrics on resampled fraud data show that increased system cost can be justified. A fundamental key to fraud data is that it is inherently not systematic and, as of yet, the optimal resampling methodology has not been identified. Building a test harness that accounts for all permutations of algorithm sample set pairs demonstrates that the complex, intrinsic data structures are all thoroughly tested. Using a comparative analysis on fraud data that applies stacked generalizations provides useful insight needed to find the optimal mathematical formula to be used for imbalanced fraud data sets.Comment: 19 pages, 3 figures, 8 table

    Selecting Representative Data Sets

    Get PDF

    Rule-based classification approach for railway wagon health monitoring

    Get PDF
    Modern machine learning techniques have encouraged interest in the development of vehicle health monitoring systems that ensure secure and reliable operations of rail vehicles. In an earlier study, an energy-efficient data acquisition method was investigated to develop a monitoring system for railway applications using modern machine learning techniques, more specific classification algorithms. A suitable classifier was proposed for railway monitoring based on relative weighted performance metrics. To improve the performance of the existing approach, a rule-based learning method using statistical analysis has been proposed in this paper to select a unique classifier for the same application. This selected algorithm works more efficiently and improves the overall performance of the railway monitoring systems. This study has been conducted using six classifiers, namely REPTree, J48, Decision Stump, IBK, PART and OneR, with twenty-five datasets. The Waikato Environment for Knowledge Analysis (WEKA) learning tool has been used in this study to develop the prediction models
    • …
    corecore