    Cost-Sensitive Decision Trees with Completion Time Requirements

    In many classification tasks, managing costs and completion times are the main concerns. In this paper, we assume that the completion time for classifying an instance is determined by its class label, and that a late penalty cost is incurred if the deadline is not met. This time requirement enriches the classification problem but posts a challenge to developing a solution algorithm. We propose an innovative approach for the decision tree induction, which produces multiple candidate trees by allowing more than one splitting attribute at each node. The user can specify the maximum number of candidate trees to control the computational efforts required to produce the final solution. In the tree-induction process, an allocation scheme is used to dynamically distribute the given number of candidate trees to splitting attributes according to their estimated contributions to cost reduction. The algorithm finds the final tree by backtracking. An extensive experiment shows that the algorithm outperforms the top-down heuristic and can effectively obtain the optimal or near-optimal decision trees without an excessive computation time.classification, decision tree, cost and time sensitive learning, late penalty

    Machine Learning in Automated Text Categorization

    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey

    Discrimination-aware classification

    Classifier construction is one of the most researched topics within the data mining and machine learning communities. Literally thousands of algorithms have been proposed. The quality of the learned models, however, depends critically on the quality of the training data. No matter which classifier inducer is applied, if the training data is incorrect, poor models will result. In this thesis, we study cases in which the input data is discriminatory and we are supposed to learn a classifier that optimizes accuracy, but does not discriminate in its predictions. Such situations occur naturally as artifacts of the data collection process when the training data is collected from different sources with different labeling criteria, when the data is generated by a biased decision process, or when the sensitive attribute, e.g., gender serves as a proxy for unobserved features. In many situations, a classifier that detects and uses the racial or gender discrimination is undesirable for legal reasons. The concept of discrimination is illustrated by the next example: Throughout the years, an employment bureau recorded various parameters of job candidates. Based on these parameters, the company wants to learn a model for partially automating the matchmaking between a job and a job candidate. A match is labeled as successful if the company hires the applicant. It turns out, however, that the historical data is biased; for higher board functions, Caucasian males are systematically being favored. A model learned directly on this data will learn this discriminatory behavior and apply it over future predictions. From an ethical and legal point of view it is of course unacceptable that a model discriminating in this way is deployed. Our proposed solutions to the discrimination problem fall into two broad categories. First, we propose pre-processing methods to remove the discrimination from the training dataset. Second, we propose solutions to the discrimination problem by directly pushing the non-discrimination constraints into classification models and post-processing of built models. We further studied the discrimination-aware classification paradigm in the presence of explanatory attributes that were correlated with the sensitive attribute, e.g., low income may be explained by the low education level. In such a case, as we show, not all discrimination can be considered bad. Therefore, we introduce a new way of measuring discrimination, by explicitly splitting it up into explainable and bad discrimination and propose methods to remove the bad discrimination only. We tried our discrimination-aware methods over real world data sets. We observed in our experiments that our methods show promising results and clearly outperform the traditional classification model w.r.t. accuracy discrimination trade-off. To conclude, we believe that discrimination-aware classification is a new and exciting area of research addressing a societally relevant problem

    Utility-Aware Scheduling of Stochastic Real-Time Systems

    Time utility functions offer a reasonably general way to describe the complex timing constraints of real-time and cyber-physical systems. However, utility-aware scheduling policy design is an open research problem. In particular, scheduling policies that optimize expected utility accrual are needed for real-time and cyber-physical domains. This dissertation addresses the problem of utility-aware scheduling for systems with periodic real-time task sets and stochastic non-preemptive execution intervals. We model these systems as Markov Decision Processes. This model provides an evaluation framework by which different scheduling policies can be compared. By solving the Markov Decision Process we can derive value-optimal scheduling policies for moderate sized problems. However, the time and memory complexity of computing and storing value-optimal scheduling policies also necessitates the exploration of other more scalable solutions. We consider heuristic schedulers, including a generalization we have developed for the existing Utility Accrual Packet Scheduling Algorithm. We compare several heuristics under soft and hard real-time conditions, different load conditions, and different classes of time utility functions. Based on these evaluations we present guidelines for which heuristics are best suited to particular scheduling criteria. Finally, we address the memory complexity of value-optimal scheduling, and examine trade-offs between optimality and memory complexity. We show that it is possible to derive good low complexity scheduling decision functions based on a synthesis of heuristics and reduced-memory approximations of the value-optimal scheduling policy

    Time and multiple objectives in scheduling and routing problems

    Many optimization problems encountered in practice are multi-objective by nature, i.e., different objectives are conflicting and equally important. Many times, it is not desirable to drop some of them or to optimize them in a composite single objective or hierarchical manner. Furthermore, cost parameters change over time which makes optimization problems harder. For instance, in the transport sector, travel costs are a function of travel time which changes depending on the time of the day a vehicle is travelling (e.g., due to road congestion). Road congestion results in tremendous delays which lead to a decrease in the service quality and the responsiveness of logistic service providers. In Chapter 2, we develop a generic approach to deal with Multi-Objective Scheduling Problems (MOSPs) with State-Dependent Cost Parameters. The aim is to determine the set of Pareto solutions that capture the trade offs between the different conflicting objectives. Due to the complexity of MOSPs, an efficient approximation based on dynamic programming is developed. The approximation has a provable worse case performance guarantee. Even though the generated approximate Pareto front consist of fewer solutions, it still represents a good coverage of the true Pareto front. Furthermore, considerable gains in computation times are achieved. In Chapter 3, the developed methodology is validated on the multi-objective timedependent knapsack problem. In the classical knapsack problem, the input consists of a knapsack with a finite capacity and a set of items, each with a certain weight and a cost. A feasible solution to the knapsack problem is a selection of items such that their total weight does not exceed the knapsack capacity. The goal is to maximize the single objective function consisting of the total pro t of the selected items. We extend the classical knapsack problem in two ways. First, we consider time-dependent profits (e.g., in a retail environment profit depends on whether it is Christmas or not)
