5 research outputs found

    Multi-objective Optimization for Incremental Decision Tree Learning

    Get PDF
    Abstract. Decision tree learning can be roughly classified into two categories: static and incremental inductions. Static tree induction applies greedy search in splitting test for obtaining a global optimal model. Incremental tree induction constructs a decision model by analyzing data in short segments; during each segment a local optimal tree structure is formed. Very Fast Decision Tree [4] is a typical incremental tree induction based on the principle of Hoeffding bound for node-splitting test. But it does not work well under noisy data. In this paper, we propose a new incremental tree induction model called incrementally Optimized Very Fast Decision Tree (iOVFDT), which uses a multi-objective incremental optimization method. iOVFDT also integrates four classifiers at the leaf levels. The proposed incremental tree induction model is tested with a large volume of data streams contaminated with noise. Under such noisy data, we investigate how iOVFDT that represents incremental induction method working with local optimums compares to C4.5 which loads the whole dataset for building a globally optimal decision tree. Our experiment results show that iOVFDT is able to achieve similar though slightly lower accuracy, but the decision tree size and induction time are much smaller than that of C4.5

    Hoeffding Tree Algorithms for Anomaly Detection in Streaming Datasets: A Survey

    Get PDF
    This survey aims to deliver an extensive and well-constructed overview of using machine learning for the problem of detecting anomalies in streaming datasets. The objective is to provide the effectiveness of using Hoeffding Trees as a machine learning algorithm solution for the problem of detecting anomalies in streaming cyber datasets. In this survey we categorize the existing research works of Hoeffding Trees which can be feasible for this type of study into the following: surveying distributed Hoeffding Trees, surveying ensembles of Hoeffding Trees and surveying existing techniques using Hoeffding Trees for anomaly detection. These categories are referred to as compositions within this paper and were selected based on their relation to streaming data and the flexibility of their techniques for use within different domains of streaming data. We discuss the relevance of how combining the techniques of the proposed research works within these compositions can be used to address the anomaly detection problem in streaming cyber datasets. The goal is to show how a combination of techniques from different compositions can solve a prominent problem, anomaly detection

    Considering Currency in Decision Trees in the Context of Big Data

    Get PDF
    In the current age of big data, decision trees are one of the most commonly applied data mining methods. However, for reliable results they require up-to-date input data, which is not always given in reality. We present a two-phase approach based on probability theory for considering currency of stored data in decision trees. Our approach is efficient and thus suitable for big data applications. Moreover, it is independent of the particular decision tree classifier. Finally, it is context-specific since the decision tree structure and supplemental data are taken into account. We demonstrate the benefits of the novel approach by applying it to three datasets. The results show a substantial increase in the classification success rate as opposed to not considering currency. Thus, applying our approach prevents wrong classification and consequently wrong decisions

    Algoritmos incrementais para previsão de variáveis quantitativas usando dados de chamadas móveis

    Get PDF
    O fluxo de informação gerado e que circula, hoje em dia, nas redes de dados, locais e transnacionais, é enorme. Essa informação tem origem, por exemplo, nos meios de comunicação e resulta da atividade quotidiana dos utilizadores. O registo em massa de informação em bases de dados diversas, de dimensão muitas vezes colossal, e a um ritmo permanente, cria nas organizações dificuldades crescentes de gestão dessa informação, mas ao mesmo tempo encerra um potencial de valor oculto, muitas vezes mal compreendido e mal explorado. Com o aparecimento deste fenómeno, crescente acumulação de dados, emergiram novos problemas e desafios. Como descobrir, no meio de dados aparentemente irrelevantes, os dados significantes, a informação útil, e os padrões de valor?Nas mais variadas áreas é guardada informação de forma quase contínua e, neste contexto, desenvolveu-se ao longo das últimas três décadas uma nova área de investigação, o Data Mining. Particularmente, as empresas de telecomunicações dispõem de milhões de registos com informação preciosa que poderiam utilizar, no sentido de prestar novos serviços aos clientes, isto se encontrassem uma forma clara de a utilizar. Com essa informação poderiam realizar diversas tarefas, como a previsão da duração de uma chamada quando esta se inicia, que constitui o objecto desta investigação. Com a elaboração deste trabalho pretendeu-se contribuir para o conhecimento do modo como transformar os dados, provenientes de uma grande base de dados, em informação relevante para as empresas Procura-se a forma de acrescentar mais valias e mais conhecimento à informação disponível, de forma a dar maior rentabilidade ao negócio.Qualquer investigação neste domínio, rapidamente se confronta com uma grande dificuldade, que consiste na análise dum imenso volume de dados, que coloca um problema de complexidade computacional. A dificuldade deriva não apenas da necessidade de descobrir informação útil escondida, mas também da necessidade de processar essa informação em tempo útil. Assim, o principal objetivo deste projeto passou por estudar e comparar algoritmos incrementais para previsão da duração de uma chamada quando esta se inicia e identificar os melhores algoritmos que se adequem a problemas de regressão. Trata-se de um problema de aprendizagem supervisionada na qual são utilizadas técnicas de regressão. São usados: métodos baseados em distâncias, k-Nearest Neighbor, métodos baseados em procura - árvores de decisão, VFDT - Very Fast Decision Tree, e métodos de ensemble homogéneo e heterogéneo, onde vários modelos são combinados para tomar as melhores decisões. Na investigação prática usaram-se métodos de avaliação que comparam a eficiência dos algoritmos.The information flow that circulates nowadays in both local and transnational data networks is huge. That information originates, for example, in the media or as the result of users' everyday activities. The mass storage of information in massive databases, and at a increasing rate, creates growing difficulties for the organizations in how this information should be handled, but at the same time, it contains an hidden potential, often misunderstood and poorly acknowledged. With the emergence of this phenomenon of the growing accumulation of data, new problems and challenges have also arisen. How can one identify significant data, useful information and patterns of value amongst seemingly irrelevant information?In most areas information is constantly beeing stored, and, in this context, a new area of investigation, the Data Mining, has evolved over the last three decades.Telecommunication enterprises in particular have at their disposal millions of records of precious information which they could use to develop new services for their clients, that is, if they could find a clear way to use it properly. With that information they could perform several tasks like predicting the length of a call from the moment it begins, which is the goal of this study. This work intended to contribute to the knowledge of how to transform data coming from a big database into relevant information for businesses. Ways to add more value and knowledge to the available information, were searched for in order to boost businesses' profits.Any study in this area is rapidly confronted with a great difficulty, the analysis of an enormous amount of data, a problem of computer capacity in data processing. Difficulty lies not only in identifying useful hidden information but also in the necessity of processing that information in a reasonable ammount of time. Therefore the main goal of this project is to study and compare incremental algorithms for the prediction of the length of a call from the moment it begins, and identifying the best algorithms for this regression problem and included preprocessing tasks. It is a problem of supervised learning in which regression techniques are used.The following methods are used: distance based methods, k-Nearest Neighbor method, search based methods - decision trees, VFDT - Very Fast Decision Tree, and methods for heterogeneous and homogeneous ensembles, where several models are combined to make the best decisions. At the end of the study there will be used evaluation methods which will allow for the comparisso of the algorithms' efficiency. It is expected that through the results one can identify which method is the most efficient in predicting the length of a call, the expected precision for the prediction and which confidence interval the results fall within
    corecore