2,420 research outputs found

    A lexicographic multi-objective genetic algorithm for multi-label correlation-based feature selection

    Get PDF
    This paper proposes a new Lexicographic multi-objective Genetic Algorithm for Multi-Label Correlation-based Feature Selection (LexGA-ML-CFS), which is an extension of the previous single-objective Genetic Algorithm for Multi-label Correlation-based Feature Selection (GA-ML-CFS). This extension uses a LexGA as a global search method for generating candidate feature subsets. In our experiments, we compare the results obtained by LexGA-ML-CFS with the results obtained by the original hill climbing-based ML-CFS, the single-objective GA-ML-CFS and a baseline Binary Relevance method, using ML-kNN as the multi-label classifier. The results from our experiments show that LexGA-ML-CFS improved predictive accuracy, by comparison with other methods, in some cases, but in general there was no statistically significant different between the results of LexGA-ML-CFS and other methods

    An evaluation of feature selection methods on multi-class imbalance and high dimensionality shape-based leaf image features

    Get PDF
    Multi-class imbalance shape-based leaf image features requires feature subset that appropriately represent the leaf shape.Multi-class imbalance data is a type of data classification problem in which some data classes is highly underrepresented compared to others.This occurs when at least one data class is represented by just a few numbers of training samples known as the minority class compared to other classes that make up the majority class.To address this issue in shapebased leaf image feature extraction, this paper discusses the evaluation of several methods available in Weka and a wrapperbased genetic algorithm feature selection

    Examining applying high performance genetic data feature selection and classification algorithms for colon cancer diagnosis

    Get PDF
    Background and Objectives: This paper examines the accuracy and efficiency (time complexity) of high performance genetic data feature selection and classification algorithms for colon cancer diagnosis. The need for this research derives from the urgent and increasing need for accurate and efficient algorithms. Colon cancer is a leading cause of death worldwide, hence it is vitally important for the cancer tissues to be expertly identified and classified in a rapid and timely manner, to assure both a fast detection of the disease and to expedite the drug discovery process. Methods: In this research, a three-phase approach was proposed and implemented: Phases One and Two examined the feature selection algorithms and classification algorithms employed separately, and Phase Three examined the performance of the combination of these. Results: It was found from Phase One that the Particle Swarm Optimization (PSO) algorithm performed best with the colon dataset as a feature selection (29 genes selected) and from Phase Two that the Sup- port Vector Machine (SVM) algorithm outperformed other classifications, with an accuracy of almost 86%. It was also found from Phase Three that the combined use of PSO and SVM surpassed other algorithms in accuracy and performance, and was faster in terms of time analysis (94%). Conclusions: It is concluded that applying feature selection algorithms prior to classification algorithms results in better accuracy than when the latter are applied alone. This conclusion is important and significant to industry and society

    Freight transportation and the environment: Using geographic information systems to inform goods movement policy

    Get PDF
    The freight transportation sector is a major emitter of the greenhouse gas carbon dioxide (CO2) which has been recognized by numerous experts and science organizations as a significant contributor to climate change. The purpose of this thesis is to develop a a framework for obtaining the freight flows for containerized goods movement through the U.S. marine, highway, and rail systems and to estimate CO2 emissions associated with the freight traffic along interstate corridors that serve the three major U.S. ports on the West Coast, namely the port of Los Angeles and Long Beach, the Port of Oakland and the Port of Seattle. This thesis utilizes the Geospatial Intermodal Freight Transportation (GIFT) model, which is a Geographic Information Systems (GIS) based model that links the U.S. and Canadian water, rail, and road transportation networks through intermodal transfer facilities, The inclusion of environmental attributes of transportation modes (trucks, locomotives, vessels) traversing the network is what makes GIFT a unique tool to aid policy analysts and decision makers to understand the environmental, economic, and energy impacts of intermodal freight transportation. In this research, GIFT is used to model the volumes of freight flowing between multiple origins and destinations, and demonstrate the potential of system improvements in addressing environmental issues related to freight transport. Overall, this thesis demonstrates how the GIFT model, configured with California-specific freight data, can be used to improve understanding and decision-making associated with freight transport at regional scales

    Attribute Selection Algorithm with Clustering based Optimization Approach based on Mean and Similarity Distance

    Get PDF
    With hundreds or thousands of attributes in high-dimensional data, the computational workload is challenging. Attributes that have no meaningful influence on class predictions throughout the classification process increase the computing load. This article's goal is to use attribute selection to reduce the size of high-dimensional data, which will lessen the computational load. Considering selected attribute subsets that cover all attributes. As a result, there are two stages to the process: filtering out superfluous information and settling on a single attribute to stand in for a group of similar but otherwise meaningless characteristics. Numerous studies on attribute selection, including backward and forward selection, have been undertaken. This experiment and the accuracy of the categorization result recommend a k-means based PSO clustering-based attribute selection. It is likely that related attributes are present in the same cluster while irrelevant attributes are not identified in any clusters. Datasets for Credit Approval, Ionosphere, Annealing, Madelon, Isolet, and Multiple Attributes are employed alongside two other high-dimensional datasets. Both databases include the class label for each data point. Our test demonstrates that attribute selection using k-means clustering may be done to offer a subset of characteristics and that doing so produces classification outcomes that are more accurate than 80%

    An ensemble of intelligent water drop algorithm for feature selection optimization problem

    Get PDF
    Master River Multiple Creeks Intelligent Water Drops (MRMC-IWD) is an ensemble model of the intelligent water drop, whereby a divide-and-conquer strategy is utilized to improve the search process. In this paper, the potential of the MRMC-IWD using real-world optimization problems related to feature selection and classification tasks is assessed. An experimental study on a number of publicly available benchmark data sets and two real-world problems, namely human motion detection and motor fault detection, are conducted. Comparative studies pertaining to the features reduction and classification accuracies using different evaluation techniques (consistency-based, CFS, and FRFS) and classifiers (i.e., C4.5, VQNN, and SVM) are conducted. The results ascertain the effectiveness of the MRMC-IWD in improving the performance of the original IWD algorithm as well as undertaking real-world optimization problems

    A Comparative Analysis for Filter-Based Feature Selection Techniques with Tree-based Classification

    Get PDF
    The selection of features is crucial as an essential pre-processing method, used in the area of research as Data Mining, Text mining, and Image Processing. Raw datasets for machine learning, comprise a combination of multidimensional attributes which have a huge amount of size. They are used for making predictions. If these datasets are used for classification, due to the majority of the presence of features that are inconsistent and redundant, it occupies more resources according to time and produces incorrect results and effects on the classification. With the intention of improving the efficiency and performance of the classification, these features have to be eliminated. A variety of feature subset selection methods had been presented to find and eliminate as many redundant and useless features as feasible. A comparative analysis for filter-based feature selection techniques with tree-based classification is done in this research work. Several feature selection techniques and classifiers are applied to different datasets using the Weka Tool. In this comparative analysis, we evaluated the performance of six different feature selection techniques and their effects on decision tree classifiers using 10-fold cross-validation on three datasets. After the analysis of the result, It has been found that the feature selection method ChiSquaredAttributeEval + Ranker search with Random Forest classifier beats other methods for effective and efficient evaluation and it is applicable to numerous real datasets in several application domain

    Determining appropriate approaches for using data in feature selection

    Get PDF
    Feature selection is increasingly important in data analysis and machine learning in big data era. However, how to use the data in feature selection, i.e. using either ALL or PART of a dataset, has become a serious and tricky issue. Whilst the conventional practice of using all the data in feature selection may lead to selection bias, using part of the data may, on the other hand, lead to underestimating the relevant features under some conditions. This paper investigates these two strategies systematically in terms of reliability and effectiveness, and then determines their suitability for datasets with different characteristics. The reliability is measured by the Average Tanimoto Index and the Inter-method Average Tanimoto Index, and the effectiveness is measured by the mean generalisation accuracy of classification. The computational experiments are carried out on ten real-world benchmark datasets and fourteen synthetic datasets. The synthetic datasets are generated with a pre-set number of relevant features and varied numbers of irrelevant features and instances, and added with different levels of noise. The results indicate that the PART approach is more effective in reducing the bias when the size of a dataset is small but starts to lose its advantage as the dataset size increases
    • …
    corecore