11 research outputs found

    An Approach for Optimal Feature Subset Selection using a New Term Weighting Scheme and Mutual Information

    Get PDF
    With the development of the web, large numbers of documents are available on the Internet and they are growing drastically day by day. Hence automatic text categorization becomes more and more important for dealing with massive data. However the major problem of document categorization is the high dimensionality of feature space.  The measures to decrease the feature dimension under not decreasing recognition effect are called the problems of feature optimum extraction or selection. Dealing with reduced relevant feature set can be more efficient and effective. The objective of feature selection is to find a subset of features that have all characteristics of the full features set. Instead Dependency among features is also important for classification. During past years, various metrics have been proposed to measure the dependency among different features. A popular approach to realize dependency is maximal relevance feature selection: selecting the features with the highest relevance to the target class. A new feature weighting scheme, we proposed have got a tremendous improvements in dimensionality reduction of the feature space. The experimental results clearly show that this integrated method works far better than the others

    Improving Floating Search Feature Selection using Genetic Algorithm

    Get PDF
    Classification, a process for predicting the class of a given input data, is one of the most fundamental tasks in data mining. Classification performance is negatively affected by noisy data and therefore selecting features relevant to the problem is a critical step in classification, especially when applied to large datasets. In this article, a novel filter-based floating search technique for feature selection to select an optimal set of features for classification purposes is proposed. A genetic algorithm is employed to improve the quality of the features selected by the floating search method in each iteration. A criterion function is applied to select relevant and high-quality features that can improve classification accuracy. The proposed method was evaluated using 20 standard machine learning datasets of various size and complexity. The results show that the proposed method is effective in general across different classifiers and performs well in comparison with recently reported techniques. In addition, the application of the proposed method with support vector machine provides the best performance among the classifiers studied and outperformed previous researches with the majority of data sets

    Метод сокращения обучающих выборок GridDC

    Get PDF
    Предложен новый метод сокращения обучающих выборок GridDC (Grid-density-center method), основанный на покрытии признакового пространства сеткой и нахождении единственного объекта клетки как объекта новой обучающей выборки. Предложен принцип формирования сетки и способы построения объектов сокращенной обучающей выборки. Для определения эффективности предложенного метода проведен сравнительный экспериментальный анализ с известными методами сокращения обучающих выборок, показавший эффективность метода GridDC.Метою роботи є розробка методу скорочення навчальних вибірок у системах розпізнавання. Запропоновано новий метод скорочення навчальних вибірок GridDC (Grid-density-center method), який базується на покритті ознакового простору сіткою і знаходженні єдиного об’єкта клітки як об’єкта нової навчальної вибірки. Запропоновано принцип формування сітки і способи побудови об’єктів скороченої навчальної вибірки. Для визначення ефективності запропонованого методу проведений порівняльний експериментальний аналіз з відомими методами скорочення навчальних вибірок, що показав ефективність методу GridDC.A grid-density-center method of the reduction of the teaching selections in the recognition systems is proposed. It is based on coverage of character space and on finding the unique object of a cage as object of new teaching selections. Principle of forming of the grid and methods of the construction of the objects of brief teaching selection are offered. For calculation of the efficiency of the offered method a comparative experimental analysis is conducted with the known methods. The analyses have shown that the method increases accuracy the classification and decrease the length of the teaching selections

    Feature Selection for MAUC-Oriented Classification Systems

    Full text link
    Feature selection is an important pre-processing step for many pattern classification tasks. Traditionally, feature selection methods are designed to obtain a feature subset that can lead to high classification accuracy. However, classification accuracy has recently been shown to be an inappropriate performance metric of classification systems in many cases. Instead, the Area Under the receiver operating characteristic Curve (AUC) and its multi-class extension, MAUC, have been proved to be better alternatives. Hence, the target of classification system design is gradually shifting from seeking a system with the maximum classification accuracy to obtaining a system with the maximum AUC/MAUC. Previous investigations have shown that traditional feature selection methods need to be modified to cope with this new objective. These methods most often are restricted to binary classification problems only. In this study, a filter feature selection method, namely MAUC Decomposition based Feature Selection (MDFS), is proposed for multi-class classification problems. To the best of our knowledge, MDFS is the first method specifically designed to select features for building classification systems with maximum MAUC. Extensive empirical results demonstrate the advantage of MDFS over several compared feature selection methods.Comment: A journal length pape

    Discriminant analysis of multi sensor data fusion based on percentile forward feature selection

    Get PDF
    Feature extraction is a widely used approach to extract significant features in multi sensor data fusion. However, feature extraction suffers from some drawbacks. The biggest problem is the failure to identify discriminative features within multi-group data. Thus, this study proposed a new discriminant analysis of multi sensor data fusion using feature selection based on the unbounded and bounded Mahalanobis distance to replace the feature extraction approach in low and intermediate levels data fusion. This study also developed percentile forward feature selection (PFFS) to identify discriminative features feasible for sensor data classification. The proposed discriminant procedure begins by computing the average distance between multi- group using the unbounded and bounded distances. Then, the selection of features started by ranking the fused features in low and intermediate levels based on the computed distances. The feature subsets were selected using the PFFS. The constructed classification rules were measured using classification accuracy measure. The whole investigations were carried out on ten e-nose and e-tongue sensor data. The findings indicated that the bounded Mahalanobis distance is superior in selecting important features with fewer features than the unbounded criterion. Moreover, with the bounded distance approach, the feature selection using the PFFS obtained higher classification accuracy. The overall proposed procedure is found fit to replace the traditional discriminant analysis of multi sensor data fusion due to greater discriminative power and faster convergence rate of higher accuracy. As conclusion, the feature selection can solve the problem of feature extraction. Next, the proposed PFFS has been proved to be effective in selecting subsets of features of higher accuracy with faster computation. The study also specified the advantage of the unbounded and bounded Mahalanobis distance in feature selection of high dimensional data which benefit both engineers and statisticians in sensor technolog

    Evaluation of Classifiers in Software Fault-Proneness Prediction

    Get PDF
    Reliability of software counts on its fault-prone modules. This means that the less software consists of fault-prone units the more we may trust it. Therefore, if we are able to predict the number of fault-prone modules of software, it will be possible to judge the software reliability. In predicting software fault-prone modules, one of the contributing features is software metric by which one can classify software modules into fault-prone and non-fault-prone ones. To make such a classification, we investigated into 17 classifier methods whose features (attributes) are software metrics (39 metrics) and instances (software modules) of mining are instances of 13 datasets reported by NASA. However, there are two important issues influencing our prediction accuracy when we use data mining methods: (1) selecting the best/most influent features (i.e. software metrics) when there is a wide diversity of them and (2) instance sampling in order to balance the imbalanced instances of mining; we have two imbalanced classes when the classifier biases towards the majority class. Based on the feature selection and instance sampling, we considered 4 scenarios in appraisal of 17 classifier methods to predict software fault-prone modules. To select features, we used Correlation-based Feature Selection (CFS) and to sample instances we did Synthetic Minority Oversampling Technique (SMOTE). Empirical results showed that suitable sampling software modules significantly influences on accuracy of predicting software reliability but metric selection has not considerable effect on the prediction

    Active Sample Selection Based Incremental Algorithm for Attribute Reduction with Rough Sets

    Get PDF
    Attribute reduction with rough sets is an effective technique for obtaining a compact and informative attribute set from a given dataset. However, traditional algorithms have no explicit provision for handling dynamic datasets where data present themselves in successive samples. Incremental algorithms for attribute reduction with rough sets have been recently introduced to handle dynamic datasets with large samples, though they have high complexity in time and space. To address the time/space complexity issue of the algorithms, this paper presents a novel incremental algorithm for attribute reduction with rough sets based on the adoption of an active sample selection process and an insight into the attribute reduction process. This algorithm first decides whether each incoming sample is useful with respect to the current dataset by the active sample selection process. A useless sample is discarded while a useful sample is selected to update a reduct. At the arrival of a useful sample, the attribute reduction process is then employed to guide how to add and/or delete attributes in the current reduct. The two processes thus constitute the theoretical framework of our algorithm. The proposed algorithm is finally experimentally shown to be efficient in time and space.This is a manuscript of the publication Yang, Yanyan, Degang Chen, and Hui Wang. "Active Sample Selection Based Incremental Algorithm for Attribute Reduction With Rough Sets." IEEE Transactions on Fuzzy Systems 25, no. 4 (2017): 825-838. DOI: 10.1109/TFUZZ.2016.2581186. Posted with permission.</p

    Facilitation of visual pattern recognition by extraction of relevant features from microscopic traffic data

    Get PDF
    An experimental approach to traffic flow analysis is presented in which methodology from pattern recognition is applied to a specific dataset to examine its utility in determining traffic patterns. The selected dataset for this work, taken from a 1985 study by JHK and Associates (traffic research) for the Federal Highway Administration, covers an hour long time period over a quarter mile section and includes nine different identifying features for traffic at any given time. The initial step is to select the most pertinent of these features as a target for extraction and local storage during the experiment. The tools created for this approach, a two-level hierarchical group of operators, are used to extract features from the dataset to create a feature space; this is done to minimize the experimental set to a matrix of desirable attributes from the vehicles on the roadway. The application is to identify if this data can be readily parsed into four distinct traffic states; in this case, the state of a vehicle is defined by its velocity and acceleration at a selected timestamp. A three-dimensional plot is used, with color as the third dimension and seen from a top-down perspective, to initially identify vehicle states in a section of roadway over a selected section of time. This is followed by applying k-means clustering, in this case with k=4 to match the four distinct traffic states, to the feature space to examine its viability in determining the states of vehicles in a time section. The method’s accuracy is viewed through silhouette plots. Finally, a group of experiments run through a decision-tree architecture is compared to the kmeans clustering approach. Each decision-tree format uses sets of predefined values for velocity and acceleration to parse the data into the four states; modifications are made to acceleration and deceleration values to examine different results. The three-dimensional plots provide a visual example of congested traffic for use in performing visual comparisons of the clustering results. The silhouette plot results of the k-means experiments show inaccuracy for certain clusters; on the other hand, the decision-tree work shows promise for future work

    Machine learning for scientific data mining and solar eruption prediction

    Get PDF
    This dissertation explores new machine learning techniques and adapts them to mine scientific data, specifically data from solar physics and space weather studies. The dissertation tackles three important problems in heliophysics: solar flare prediction, coronal mass ejection (CME) prediction and Stokes inversion. First, the dissertation presents a long short-term memory (LSTM) network for predicting whether an active region (AR) would produce a certain class of solar flare within the next 24 hours. The essence of this approach is to model data samples in an AR as time series and use LSTMs to capture temporal information of the data samples. The LSTM network consists of an LSTM layer, an attention layer, two fully connected layers and an output layer. The attention layer is designed to allow the LSTM network to automatically search for parts of the data samples that are related to the prediction of solar flares. Second, the dissertation presents two recurrent neural networks (RNNs), one based on gated recurrent units and the other based on LSTM, for predicting whether an AR that produces a significant flare will also initiate a CME. Again, data samples in an AR are modeled as time series and the RNNs are used to capture temporal dependencies in the time series. A feature selection technique is employed to enhance prediction accuracy. Third, the dissertation approaches the Stokes inversion problem using a novel convolutional neural network (CNN). This CNN method is faster, and produces cleaner magnetic maps, than a widely used physics-based tool. Furthermore, the CNN method outperforms other machine learning algorithms such as multiple support vector regression and multilayer perceptrons. Findings reported here have been validated by substantial experiments based on different datasets. The dissertation concludes with a fully operational database system containing real-time flare forecasting results produced by the proposed LSTM method. This is the first cyberinfrastructure capable of continuous learning and forecasting of solar flares based on deep learning
    corecore