1,446 research outputs found

    A new approach of top-down induction of decision trees for knowledge discovery

    Get PDF
    Top-down induction of decision trees is the most popular technique for classification in the field of data mining and knowledge discovery. Quinlan developed the basic induction algorithm of decision trees, ID3 (1984), and extended to C4.5 (1993). There is a lot of research work for dealing with a single attribute decision-making node (so-called the first-order decision) of decision trees. Murphy and Pazzani (1991) addressed about multiple-attribute conditions at decision-making nodes. They show that higher order decision-making generates smaller decision trees and better accuracy. However, there always exist NP-complete combinations of multiple-attribute decision-makings.;We develop a new algorithm of second-order decision-tree inductions (SODI) for nominal attributes. The induction rules of first-order decision trees are combined by \u27AND\u27 logic only, but those of SODI consist of \u27AND\u27, \u27OR\u27, and \u27OTHERWISE\u27 logics. It generates more accurate results and smaller decision trees than any first-order decision tree inductions.;Quinlan used information gains via VC-dimension (Vapnik-Chevonenkis; Vapnik, 1995) for clustering the experimental values for each numerical attribute. However, many researchers have discovered the weakness of the use of VC-dim analysis. Bennett (1997) sophistically applies support vector machines (SVM) to decision tree induction. We suggest a heuristic algorithm (SVMM; SVM for Multi-category) that combines a TDIDT scheme with SVM. In this thesis it will be also addressed how to solve multiclass classification problems.;Our final goal for this thesis is IDSS (Induction of Decision Trees using SODI and SVMM). We will address how to combine SODI and SVMM for the construction of top-down induction of decision trees in order to minimize the generalized penalty cost

    Combining classification algorithms

    Get PDF
    Dissertação de Doutoramento em Ciência de Computadores apresentada à Faculdade de Ciências da Universidade do PortoA capacidade de um algoritmo de aprendizagem induzir, para um determinado problema, uma boa generalização depende da linguagem de representação usada para generalizar os exemplos. Como diferentes algoritmos usam diferentes linguagens de representação e estratégias de procura, são explorados espaços diferentes e são obtidos resultados diferentes. O problema de encontrar a representação mais adequada para o problema em causa, é uma área de investigação bastante activa. Nesta dissertação, em vez de procurar métodos que fazem o ajuste aos dados usando uma única linguagem de representação, apresentamos uma família de algoritmos, sob a designação genérica de Generalização em Cascata, onde o espaço de procura contem modelos que utilizam diferentes linguagens de representação. A ideia básica do método consiste em utilizar os algoritmos de aprendizagem em sequência. Em cada iteração ocorre um processo com dois passos. No primeiro passo, um classificador constrói um modelo. No segundo passo, o espaço definido pelos atributos é estendido pela inserção de novos atributos gerados utilizando este modelo. Este processo de construção de novos atributos constrói atributos na linguagem de representação do classificador usado para construir o modelo. Se posteriormente na sequência, um classificador utiliza um destes novos atributos para construir o seu modelo, a sua capacidade de representação foi estendida. Desta forma as restrições da linguagem de representação dosclassificadores utilizados a mais alto nível na sequência, são relaxadas pela incorporação de termos da linguagem derepresentação dos classificadores de base. Esta é a metodologia base subjacente ao sistema Ltree e à arquitecturada Generalização em Cascata.O método é apresentado segundo duas perspectivas. Numa primeira parte, é apresentado como uma estratégia paraconstruir árvores de decisão multivariadas. É apresentado o sistema Ltree que utiliza como operador para a construção de atributos um discriminante linear. ..

    Approximation and Relaxation Approaches for Parallel and Distributed Machine Learning

    Get PDF
    Large scale machine learning requires tradeoffs. Commonly this tradeoff has led practitioners to choose simpler, less powerful models, e.g. linear models, in order to process more training examples in a limited time. In this work, we introduce parallelism to the training of non-linear models by leveraging a different tradeoff--approximation. We demonstrate various techniques by which non-linear models can be made amenable to larger data sets and significantly more training parallelism by strategically introducing approximation in certain optimization steps. For gradient boosted regression tree ensembles, we replace precise selection of tree splits with a coarse-grained, approximate split selection, yielding both faster sequential training and a significant increase in parallelism, in the distributed setting in particular. For metric learning with nearest neighbor classification, rather than explicitly train a neighborhood structure we leverage the implicit neighborhood structure induced by task-specific random forest classifiers, yielding a highly parallel method for metric learning. For support vector machines, we follow existing work to learn a reduced basis set with extremely high parallelism, particularly on GPUs, via existing linear algebra libraries. We believe these optimization tradeoffs are widely applicable wherever machine learning is put in practice in large scale settings. By carefully introducing approximation, we also introduce significantly higher parallelism and consequently can process more training examples for more iterations than competing exact methods. While seemingly learning the model with less precision, this tradeoff often yields noticeably higher accuracy under a restricted training time budget

    Process-Oriented Stream Classification Pipeline:A Literature Review

    Get PDF
    Featured Application: Nowadays, many applications and disciplines work on the basis of stream data. Common examples are the IoT sector (e.g., sensor data analysis), or video, image, and text analysis applications (e.g., in social media analytics or astronomy). With our work, we gather different approaches and terminology, and give a broad overview over the topic. Our main target groups are practitioners and newcomers to the field of data stream classification. Due to the rise of continuous data-generating applications, analyzing data streams has gained increasing attention over the past decades. A core research area in stream data is stream classification, which categorizes or detects data points within an evolving stream of observations. Areas of stream classification are diverse—ranging, e.g., from monitoring sensor data to analyzing a wide range of (social) media applications. Research in stream classification is related to developing methods that adapt to the changing and potentially volatile data stream. It focuses on individual aspects of the stream classification pipeline, e.g., designing suitable algorithm architectures, an efficient train and test procedure, or detecting so-called concept drifts. As a result of the many different research questions and strands, the field is challenging to grasp, especially for beginners. This survey explores, summarizes, and categorizes work within the domain of stream classification and identifies core research threads over the past few years. It is structured based on the stream classification process to facilitate coordination within this complex topic, including common application scenarios and benchmarking data sets. Thus, both newcomers to the field and experts who want to widen their scope can gain (additional) insight into this research area and find starting points and pointers to more in-depth literature on specific issues and research directions in the field.</p

    Instance selection for model-based classifiers

    Get PDF
    Aspects of a classifier\u27s training dataset can often make building a helpful and high accuracy classifier difficult. Instance selection addresses some of the issues in a dataset by selecting a subset of the data in such a way that learning from the reduced dataset leads to a better classifier. This work introduces an integer programming formulation of instance selection that relies on column generation techniques to obtain a good solution to the problem. Experimental results show that instance selection improves the usefulness of some classifiers by optimizing the training data so that that the training dataset has easier to learn boundaries between class values. Also included in this paper are two case studies from the Surveillance, Epidemiology, and End Results (SEER) database that further confirm the benefit of instance selection. Overall, results indicate that performing instance selection for a classifier is a competitive classification approach. However, it should be noted that instance selection might overfit classifiers that have already achieved a good fit to the dataset

    Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

    Get PDF
    This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining

    Holistic Learning for Multi-Target and Network Monitoring Problems

    Get PDF
    abstract: Technological advances have enabled the generation and collection of various data from complex systems, thus, creating ample opportunity to integrate knowledge in many decision making applications. This dissertation introduces holistic learning as the integration of a comprehensive set of relationships that are used towards the learning objective. The holistic view of the problem allows for richer learning from data and, thereby, improves decision making. The first topic of this dissertation is the prediction of several target attributes using a common set of predictor attributes. In a holistic learning approach, the relationships between target attributes are embedded into the learning algorithm created in this dissertation. Specifically, a novel tree based ensemble that leverages the relationships between target attributes towards constructing a diverse, yet strong, model is proposed. The method is justified through its connection to existing methods and experimental evaluations on synthetic and real data. The second topic pertains to monitoring complex systems that are modeled as networks. Such systems present a rich set of attributes and relationships for which holistic learning is important. In social networks, for example, in addition to friendship ties, various attributes concerning the users' gender, age, topic of messages, time of messages, etc. are collected. A restricted form of monitoring fails to take the relationships of multiple attributes into account, whereas the holistic view embeds such relationships in the monitoring methods. The focus is on the difficult task to detect a change that might only impact a small subset of the network and only occur in a sub-region of the high-dimensional space of the network attributes. One contribution is a monitoring algorithm based on a network statistical model. Another contribution is a transactional model that transforms the task into an expedient structure for machine learning, along with a generalizable algorithm to monitor the attributed network. A learning step in this algorithm adapts to changes that may only be local to sub-regions (with a broader potential for other learning tasks). Diagnostic tools to interpret the change are provided. This robust, generalizable, holistic monitoring method is elaborated on synthetic and real networks.Dissertation/ThesisDoctoral Dissertation Industrial Engineering 201

    Unsupervised Extra Trees: a stochastic approach to compute similarities in heterogeneous data.

    Get PDF
    International audienceIn this paper we present a method to compute similarities on unlabeled data, based on extremely randomized trees. The main idea of our method, Unsu-pervised Extremely Randomized Trees (UET) is to randomly split the data in an iterative fashion until a stopping criterion is met, and to compute a similarity based on the co-occurrence of samples in the leaves of each generated tree. Using a tree-based approach to compute similarities is interesting, as the inherent We evaluate our method on synthetic and real-world datasets by comparing the mean similarities between samples with the same label and the mean similarities between samples with different labels. These metrics are similar to intracluster and intercluster similarities, and are used to assess the computed similarities instead of a clustering algorithm's results. Our empirical study shows that the method effectively gives distinct similarity values between samples belonging to different clusters, and gives indiscernible values when there is no cluster structure. We also assess some interesting properties such as in-variance under monotone transformations of variables and robustness to correlated variables and noise. Finally , we performed hierarchical agglomerative clustering on synthetic and real-world homogeneous and heterogeneous datasets using UET versus standard similarity measures. Our experiments show that the algorithm outperforms existing methods in some cases, and can reduce the amount of preprocessing needed with many real-world datasets
    corecore