567 research outputs found

    A lexicographic multi-objective genetic algorithm for multi-label correlation-based feature selection

    Get PDF
    This paper proposes a new Lexicographic multi-objective Genetic Algorithm for Multi-Label Correlation-based Feature Selection (LexGA-ML-CFS), which is an extension of the previous single-objective Genetic Algorithm for Multi-label Correlation-based Feature Selection (GA-ML-CFS). This extension uses a LexGA as a global search method for generating candidate feature subsets. In our experiments, we compare the results obtained by LexGA-ML-CFS with the results obtained by the original hill climbing-based ML-CFS, the single-objective GA-ML-CFS and a baseline Binary Relevance method, using ML-kNN as the multi-label classifier. The results from our experiments show that LexGA-ML-CFS improved predictive accuracy, by comparison with other methods, in some cases, but in general there was no statistically significant different between the results of LexGA-ML-CFS and other methods

    New Multi-Label Correlation-Based Feature Selection Methods for Multi-Label Classification and Application in Bioinformatics

    Get PDF
    The very large dimensionality of real world datasets is a challenging problem for classification algorithms, since often many features are redundant or irrelevant for classification. In addition, a very large number of features leads to a high computational time for classification algorithms. Feature selection methods are used to deal with the large dimensionality of data by selecting a relevant feature subset according to an evaluation criterion. The vast majority of research on feature selection involves conventional single-label classification problems, where each instance is assigned a single class label; but there has been growing research on more complex multi-label classification problems, where each instance can be assigned multiple class labels. This thesis proposes three types of new Multi-Label Correlation-based Feature Selection (ML-CFS) methods, namely: (a) methods based on hill-climbing search, (b) methods that exploit biological knowledge (still using hill-climbing search), and (c) methods based on genetic algorithms as the search method. Firstly, we proposed three versions of ML-CFS methods based on hill climbing search. In essence, these ML-CFS versions extend the original CFS method by extending the merit function (which evaluates candidate feature subsets) to the multi-label classification scenario, as well as modifying the merit function in other ways. A conventional search strategy, hill-climbing, was used to explore the space of candidate solutions (candidate feature subsets) for those three versions of ML-CFS. These ML-CFS versions are described in detail in Chapter 4. \ud Secondly, in order to try to improve the performance of ML-CFS in cancer-related microarray gene expression datasets, we proposed three versions of the ML-CFS method that exploit biological knowledge. These ML-CFS versions are also based on hill-climbing search, but the merit function was modified in a way that favours the selection of genes (features) involved in pre-defined cancer-related pathways, as discussed in detail in Chapter 5. Lastly, we proposed two more sophisticated versions of ML-CFS based on Genetic Algorithms (rather than hill-climbing) as the search method. The first version of GA-based ML-CFS is based on a conventional single-objective GA, where there is only one objective to be optimized; while the second version of GA-based ML-CFS performs lexicographic multi-objective optimization, where there are two objectives to be optimized, as discussed in detail in Chapter 6. In this thesis, all proposed ML-CFS methods for multi-label classification problems were evaluated by measuring the predictive accuracies obtained by two well-known multi-label classification algorithms when using the selected features? namely: the Multi-Label K-Nearest neighbours (ML-kNN) algorithm and the Multi-Label Back Propagation Multi-Label Learning Neural Network (BPMLL) algorithm. In general, the results obtained by the best version of the proposed ML-CFS methods, namely a GA-based ML-CFS method, were competitive with the results of other multi-label feature selection methods and baseline approaches. More precisely, one of our GA-based methods achieved the second best predictive accuracy out of all methods being compared (both with ML-kNN and BPMLL used as classifiers), but there was no statistically significant difference between that GA-based ML-CFS and the best method in terms of predictive accuracy. In addition, in the experiment with ML-kNN (the most accurate) method selects about twice as many features as our GA-based ML-CFS; whilst in the experiments with BPMLL the most accurate method was a baseline method that does not perform any feature selection, and runs the classifier once (with all original features) for each of the many class labels, which is a very computationally expensive baseline approach. In summary, one of the proposed GA-based ML-CFS methods managed to achieve substantial data reduction, (selecting a smaller subset of relevant features) without a significant decrease in predictive accuracy with respect to the most accurate method

    Simpler is better: a novel genetic algorithm to induce compact multi-label chain classifiers

    Get PDF
    Multi-label classification (MLC) is the task of assigning multiple class labels to an object based on the features that describe the object. One of the most effective MLC methods is known as Classifier Chains (CC). This approach consists in training q binary classifiers linked in a chain, y1 → y2 → ... → yq, with each responsible for classifying a specific label in {l1, l2, ..., lq}. The chaining mechanism allows each individual classifier to incorporate the predictions of the previous ones as additional information at classification time. Thus, possible correlations among labels can be automatically exploited. Nevertheless, CC suffers from two important drawbacks: (i) the label ordering is decided at random, although it usually has a strong effect on predictive accuracy; (ii) all labels are inserted into the chain, although some of them might carry irrelevant information to discriminate the others. In this paper we tackle both problems at once, by proposing a novel genetic algorithm capable of searching for a single optimized label ordering, while at the same time taking into consideration the utilization of partial chains. Experiments on benchmark datasets demonstrate that our approach is able to produce models that are both simpler and more accurate

    A survey of genetic algorithms for multi-label classification

    Get PDF
    In recent years, multi-label classification (MLC) has become an emerging research topic in big data analytics and machine learning. In this problem, each object of a dataset may belong to multiple class labels and the goal is to learn a classification model that can infer the correct labels of new, previously unseen, objects. This paper presents a survey of genetic algorithms (GAs) designed for MLC tasks. The study is organized in three parts. First, we propose a new taxonomy focused on GAs for MLC. In the second part, we provide an up-to-date overview of the work in this area, categorizing the approaches identified in the literature with respect to the taxonomy. In the third and last part, we discuss some new ideas for combining GAs with MLC

    On learning and visualizing lexicographic preference trees

    Get PDF
    Preferences are very important in research fields such as decision making, recommendersystemsandmarketing. The focus of this thesis is on preferences over combinatorial domains, which are domains of objects configured with categorical attributes. For example, the domain of cars includes car objects that are constructed withvaluesforattributes, such as ‘make’, ‘year’, ‘model’, ‘color’, ‘body type’ and ‘transmission’.Different values can instantiate an attribute. For instance, values for attribute ‘make’canbeHonda, Toyota, Tesla or BMW, and attribute ‘transmission’ can haveautomaticormanual. To this end,thisthesis studiesproblemsonpreference visualization and learning for lexicographic preference trees, graphical preference models that often are compact over complex domains of objects built of categorical attributes. Visualizing preferences is essential to provide users with insights into the process of decision making, while learning preferences from data is practically important, as it is ineffective to elicit preference models directly from users. The results obtained from this thesis are two parts: 1) for preference visualization, aweb- basedsystem is created that visualizes various types of lexicographic preference tree models learned by a greedy learning algorithm; 2) for preference learning, a genetic algorithm is designed and implemented, called GA, that learns a restricted type of lexicographic preference tree, called unconditional importance and unconditional preference tree, or UIUP trees for short. Experiments show that GA achieves higher accuracy compared to the greedy algorithm at the cost of more computational time. Moreover, a Dynamic Programming Algorithm (DPA) was devised and implemented that computes an optimal UIUP tree model in the sense that it satisfies as many examples as possible in the dataset. This novel exact algorithm (DPA), was used to evaluate the quality of models computed by GA, and it was found to reduce the factorial time complexity of the brute force algorithm to exponential. The major contribution to the field of machine learning and data mining in this thesis would be the novel learning algorithm (DPA) which is an exact algorithm. DPA learns and finds the best UIUP tree model in the huge search space which classifies accurately the most number of examples in the training dataset; such model is referred to as the optimal model in this thesis. Finally, using datasets produced from randomly generated UIUP trees, this thesis presents experimental results on the performances (e.g., accuracy and computational time) of GA compared to the existent greedy algorithm and DPA

    New Techniques and Algorithms for Multiobjective and Lexicographic Goal-Based Shortest Path Problems

    Get PDF
    Shortest Path Problems (SPP) are one of the most extensively studied problems in the fields of Artificial Intelligence (AI) and Operations Research (OR). It consists in finding the shortest path between two given nodes in a graph such that the sum of the weights of its constituent arcs is minimized. However, real life problems frequently involve the consideration of multiple, and often conflicting, criteria. When multiple objectives must be simultaneously optimized, the concept of a single optimal solution is no longer valid. Instead, a set of efficient or Pareto-optimal solutions define the optimal trade-off between the objectives under consideration. The Multicriteria Search Problem (MSP), or Multiobjective Shortest Path Problem, is the natural extension to the SPP when more than one criterion are considered. The MSP is computationally harder than the single objective one. The number of label expansions can grow exponentially with solution depth, even for the two objective case. However, with the assumption of bounded integer costs and a fixed number of objectives the problem becomes tractable for polynomially sized graphs. A wide variety of practical application in different fields can be identified for the MSP, like robot path planning, hazardous material transportation, route planning, optimization of public transportation, QoS in networks, or routing in multimedia networks. Goal programming is one of the most successful Multicriteria Decision Making (MCDM) techniques used in Multicriteria Optimization. In this thesis we explore one of its variants in the MSP. Thus, we aim to solve the Multicriteria Search Problem with lexicographic goal-based preferences. To do so, we build on previous work on algorithm NAMOA*, a successful extension of the A* algorithm to the multiobjective case. More precisely, we provide a new algorithm called LEXGO*, an exact label-setting algorithm that returns the subset of Pareto-optimal paths that satisfy a set of lexicographic goals, or the subset that minimizes deviation from goals if these cannot be fully satisfied. Moreover, LEXGO* is proved to be admissible and expands only a subset of the labels expanded by an optimal algorithm like NAMOA*, which performs a full Multiobjective Search. Since time rather than memory is the limiting factor in the performance of multicriteria search algorithms, we also propose a new technique called t-discarding to speed up dominance checks in the process of discarding new alternatives during the search. The application of t-discarding to the algorithms studied previously, NAMOA* and LEXGO*, leads to the introduction of two new time-efficient algorithms named NAMOA*dr and LEXGO*dr , respectively. All the algorithmic alternatives are tested in two scenarios, random grids and realistic road maps problems. The experimental evaluation shows the effectiveness of LEXGO* in both benchmarks, as well as the dramatic reductions of time requirements experienced by the t-discarding versions of the algorithms, with respect to the ones with traditional pruning

    A distributed and energy‑efficient KNN for EEG classification with dynamic money‑saving policy in heterogeneous clusters

    Get PDF
    Universidad de Granada/CBUASpanish Ministry of Science, Innovation, and Universities under Grants PGC2018-098813-B-C31,PID2022-137461NB-C32ERDF fund. Funding for open access charge: University of Granada/ CBU

    The case for hybrid multi-objective optimisation in high-stakes machine learning applications

    Get PDF
    Most classification (supervised learning) algorithms optimise a single objective, typically the predictive performance of the learned classification model. However, in high-stake classification applications, involving e.g. decisions about whether or not an individual should undergo a medical surgery, be granted a loan or be hired for a job, often there is a need to optimise multiple objectives, such as the predictive performance, interpretability or fairness of the learned model. In this context, this position paper discusses the pros and cons of two different multi-objective optimisation approaches (the Pareto and the lexicographic approaches), and proposes a conceptual framework for hybrid multi-objective optimisation, combining those two approaches

    CLOUD-BASED MACHINE LEARNING AND SENTIMENT ANALYSIS

    Get PDF
    The role of a Data Scientist is becoming increasingly ubiquitous as companies and institutions see the need to gain additional insights and information from data to make better decisions to improve the quality-of-service delivery to customers. This thesis document contains three aspects of data science projects aimed at improving tools and techniques used in analyzing and evaluating data. The first research study involved the use of a standard cybersecurity dataset and cloud-based auto-machine learning algorithms were applied to detect vulnerabilities in the network traffic data. The performance of the algorithms was measured and compared using standard evaluation metrics. The second research study involved the use of text-mining social media, specifically Reddit. We mined up to 100,000 comments in multiple subreddits and tested for hate speech via a custom designed version of the Python Vader sentiment analysis package. Our work integrated standard sentiment analysis with Hatebase.org and we demonstrate our new method can better detect hate speech in social media. Following sentiment analysis and hate speech detection, in the third research project, we applied statistical techniques in evaluating the significant difference in text analytics, specifically the sentiment-categories for both lexicon-based software and cloud-based tools. We compared the three big cloud providers, AWS, Azure, and GCP with the standard python Vader sentiment analysis library. We utilized statistical analysis to determine a significant difference between the cloud platforms utilized as well as Vader and demonstrated that each platform is unique in its analysis scoring mechanism
    • 

    corecore