222,481 research outputs found

    Complex graph stream mining

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Recent years have witnessed a dramatic increase of information due to the ever development of modern technologies. The large scale of information makes data analysis, particularly data mining and knowledge discovery tasks, unprecedentedly challenging. First, data is becoming more and more interconnected. In a variety of domains such as social networks, chemical compounds, and XML documents, data is no longer represented by a flat table with instance-feature format, but exhibits complex structures indicating dependency relationships. Second, data is evolving more and more dynamically. Emerging applications such as social networks continuously generate information over time. Third, the learning tasks in many real-life applications become more and more complicated in that there are various constraints on the number of labelled data, class distributions, misclassification costs, or the number of learning tasks etc. Considering the above challenges, this research aims to investigate theoretical foundations, study new algorithm designs and system frameworks to enable the mining of complex graph streams from three aspects, including (1) Correlated Graph Stream Mining, (2) Graph Stream Classifications, and (3) Complex Task Graph Classification. In particular, correlated graph stream mining intends to carry out structured pattern search and support the query of similar graphs from a graph stream. Due to the dynamic changing nature of the streaming data and the inherent complexity of the graph query process, treating graph streams as static datasets is computationally infeasible or ineffective. Therefore, we proposed a novel algorithm, CGStream, to identify correlated graphs from a data stream, by using a sliding window, which covers a number of consecutive batches of stream data records. Experimental results demonstrate that the proposed algorithm is several times, or even an order of magnitude, more efficient than the straightforward algorithms. Graph stream classification aims to build effective and efficient classification models for graph streams with continuous growing volumes and dynamic changes. We proposed two methods for complex graph stream classification. Due to the inherent complexity of graph structure, labelling graph data is very expensive. To solve this problem, we proposed a gLSU algorithm, which aims to select discriminative subgraph features with minimum redundancy by using both labelled and unlabelled graphs for graph streams. The second approach handles graph streams with imbalanced class distributions and noise. Both frameworks use an instance weighting scheme to capture the underlying concept drifts of graph streams and achieve significant performance gain on benchmark graph streams. Complex task graph classification aims to address the graph classification problems with complex constraints. We studied two complex task graph classification problems, cost-sensitive graph classification of large-scale graphs and multi-task graph classification. As in medical diagnosis the misclassification cost/risk for different classes is inherently different and large scale graph classification is highly demanded in real-life applications, we proposed a CogBoost algorithm for cost-sensitive classification of large scale graphs. To overcome the limitation of insufficient labelled graphs for a specific learning task, we further proposed effective algorithms to leverage multiple graph learning tasks to select subgraph features and regularize multiple tasks to achieve better generalization performance for all learning tasks

    Rain Fall Prediction using Ada Boost Machine Learning Ensemble Algorithm

    Get PDF
    Every government takes initiative for the well-being of their citizens in terms of environment and climate in which they live. Global warming is one of the reason for climate change. With the help of machine learning algorithms in the flash light of Artificial Intelligence and Data Mining techniques, weather predictions not only rainfall, lightings, thunder outbreaks, etc. can be predicted. Management of water reservoirs, flooding, traffic - control in smart cities, sewer system functioning and agricultural production are the hydro-meteorological factors that affect human life very drastically. Due to dynamic nature of atmosphere, existing Statistical techniques (Support Vector Machine (SVM), Decision Tree (DT) and logistic regression (LR)) fail to provide good accuracy for rainfall forecasting. Different weather features (Temperature, Relative Humidity, Dew Point, Solar Radiation and Precipitable Water Vapour) are extracted for rainfall prediction. In this research work, data analysis using machine learning ensemble algorithm like Adaptive Boosting (Ada Boost) is proposed. Dataset used for this classification application is taken from hydrological department, India from 1901-2015. Overall, proposed algorithm is feasible to be used in order to qualitatively predict rainfall with the help of R tool and Ada Boost algorithm. Accuracy rate and error false rates are compared with the existing Support Vector Machine (SVM) algorithm and the proposed one gives the better result

    Analisis Perbandingan Algoritma Svm Dan Knn Untuk Klasifikasi Anime Bergenre Drama

    Get PDF
    There are many genres of anime such as drama, action, romance, comedy, and so on. However, because there are so many anime genres, it is quite difficult for viewers to find anime whose genre they like, such as the drama genre which tells about everyday human life which is quite light in nature. From these problems, a classification method is needed to classify anime that belongs to the drama genre. Classification has several algorithms including Support Vector Machine (SVM) and K-Nearest Neighbors (KNN). SVM and KNN algorithms have been widely used and have a good level of accuracy. In this study, a comparative analysis will be carried out between the two algorithms, the dataset used is 12,294 data and 2 genre classes, namely drama and non-drama, the attribute of the anime dataset is 7. The results obtained in this study indicate that the K-Nearest Neighbors Algorithm (KNN) ) get a training accuracy value of 100% and a test accuracy value of 84%. And also the Support Vector Machine (SVM) algorithm gets a training accuracy value of 83% and a test accuracy value of 82%. The results of the accuracy values of the two algorithms indicate that the K-Nearest Neighbors (KNN) algorithm has a better testing accuracy than the Support Vector Machine (SVM) with a fairly thin difference between the two algorithms

    Cost-Sensitive Classification Methods for the Detection of Smuggled Nuclear Material in Cargo Containers

    Get PDF
    Classification problems arise in so many different parts of life – from sorting machine parts to diagnosing a disease. Humans make these classifications utilizing vast amounts of data, filtering observations for useful information, and then making a decision based on a subjective level of cost/risk of classifying objects incorrectly. This study investigates the translation of the human decision process into a mathematical problem in the context of a border security problem: How does one find special nuclear material being smuggled inside large cargo crates while balancing the cost of invasively searching suspect containers against the risk of al lowing radioactive material to escape detection? This may be phrased as a classification problem in which one classifies cargo containers into two categories – those containing a smuggled source and those containing only innocuous cargo. This task presents numerous challenges, e.g., the stochastic nature of radiation and the low signal-to-noise ratio caused by background radiation and cargo shielding. In the course of this work, we will break the analysis of this problem into three major sections – the development of an optimal decision rule, the choice of most useful measurements or features, and the sensitivity of developed algorithms to physical variations. This will include an examination of how accounting for the cost/risk of a decision affects the formulation of our classification problem. Ultimately, a support vector machine (SVM) framework with F -score feature selection will be developed to provide nearly optimal classification given a constraint on the reliability of detection provided by our algorithm. In particular, this can decrease the fraction of false positives by an order of magnitude over current methods. The proposed method also takes into account the relationship between measurements, whereas current methods deal with detectors independently of one another

    Origins of Modern Data Analysis Linked to the Beginnings and Early Development of Computer Science and Information Engineering

    Get PDF
    The history of data analysis that is addressed here is underpinned by two themes, -- those of tabular data analysis, and the analysis of collected heterogeneous data. "Exploratory data analysis" is taken as the heuristic approach that begins with data and information and seeks underlying explanation for what is observed or measured. I also cover some of the evolving context of research and applications, including scholarly publishing, technology transfer and the economic relationship of the university to society.Comment: 26 page

    Training a Feed-forward Neural Network with Artificial Bee Colony Based Backpropagation Method

    Full text link
    Back-propagation algorithm is one of the most widely used and popular techniques to optimize the feed forward neural network training. Nature inspired meta-heuristic algorithms also provide derivative-free solution to optimize complex problem. Artificial bee colony algorithm is a nature inspired meta-heuristic algorithm, mimicking the foraging or food source searching behaviour of bees in a bee colony and this algorithm is implemented in several applications for an improved optimized outcome. The proposed method in this paper includes an improved artificial bee colony algorithm based back-propagation neural network training method for fast and improved convergence rate of the hybrid neural network learning method. The result is analysed with the genetic algorithm based back-propagation method, and it is another hybridized procedure of its kind. Analysis is performed over standard data sets, reflecting the light of efficiency of proposed method in terms of convergence speed and rate.Comment: 14 Pages, 11 figure

    Predicting customer's gender and age depending on mobile phone data

    Full text link
    In the age of data driven solution, the customer demographic attributes, such as gender and age, play a core role that may enable companies to enhance the offers of their services and target the right customer in the right time and place. In the marketing campaign, the companies want to target the real user of the GSM (global system for mobile communications), not the line owner. Where sometimes they may not be the same. This work proposes a method that predicts users' gender and age based on their behavior, services and contract information. We used call detail records (CDRs), customer relationship management (CRM) and billing information as a data source to analyze telecom customer behavior, and applied different types of machine learning algorithms to provide marketing campaigns with more accurate information about customer demographic attributes. This model is built using reliable data set of 18,000 users provided by SyriaTel Telecom Company, for training and testing. The model applied by using big data technology and achieved 85.6% accuracy in terms of user gender prediction and 65.5% of user age prediction. The main contribution of this work is the improvement in the accuracy in terms of user gender prediction and user age prediction based on mobile phone data and end-to-end solution that approaches customer data from multiple aspects in the telecom domain

    Comparative Evaluation of Packet Classification Algorithms for Implementation on Resource Constrained Systems

    Get PDF
    This paper provides a comparative evaluation of a number of known classification algorithms that have been considered for both software and hardware implementation. Differently from other sources, the comparison has been carried out on implementations based on the same principles and design choices. Performance measurements are obtained by feeding the implemented classifiers with various traffic traces in the same test scenario. The comparison also takes into account implementation feasibility of the considered algorithms in resource constrained systems (e.g. embedded processors on special purpose network platforms). In particular, the comparison focuses on achieving a good compromise between performance, memory usage, flexibility and code portability to different target platforms
    • …
    corecore