22 research outputs found

    Hybrid ACO and TOFA feature selection approach for text classification

    Get PDF
    With the highly increasing availability of text data on the Internet, the process of selecting an appropriate set of features for text classification becomes more important, for not only reducing the dimensionality of the feature space, but also for improving the classification performance. This paper proposes a novel feature selection approach to improve the performance of text classifier based on an integration of Ant Colony Optimization algorithm (ACO) and Trace Oriented Feature Analysis (TOFA). ACO is metaheuristic search algorithm derived by the study of foraging behavior of real ants, specifically the pheromone communication to find the shortest path to the food source. TOFA is a unified optimization framework developed to integrate and unify several state-of-the-art dimension reduction algorithms through optimization framework. It has been shown in previous research that ACO is one of the promising approaches for optimization and feature selection problems. TOFA is capable of dealing with large scale text data and can be applied to several text analysis applications such as text classification, clustering and retrieval. For classification performance yet effective, the proposed approach makes use of TOFA and classifier performance as heuristic information of ACO. The results on Reuters and Brown public datasets demonstrate the effectiveness of the proposed approach. © 2012 IEEE

    Enhanced ontology-based text classification algorithm for structurally organized documents

    Get PDF
    Text classification (TC) is an important foundation of information retrieval and text mining. The main task of a TC is to predict the text‟s class according to the type of tag given in advance. Most TC algorithms used terms in representing the document which does not consider the relations among the terms. These algorithms represent documents in a space where every word is assumed to be a dimension. As a result such representations generate high dimensionality which gives a negative effect on the classification performance. The objectives of this thesis are to formulate algorithms for classifying text by creating suitable feature vector and reducing the dimension of data which will enhance the classification accuracy. This research combines the ontology and text representation for classification by developing five algorithms. The first and second algorithms namely Concept Feature Vector (CFV) and Structure Feature Vector (SFV), create feature vector to represent the document. The third algorithm is the Ontology Based Text Classification (OBTC) and is designed to reduce the dimensionality of training sets. The fourth and fifth algorithms, Concept Feature Vector_Text Classification (CFV_TC) and Structure Feature Vector_Text Classification (SFV_TC) classify the document to its related set of classes. These proposed algorithms were tested on five different scientific paper datasets downloaded from different digital libraries and repositories. Experimental obtained from the proposed algorithm, CFV_TC and SFV_TC shown better average results in terms of precision, recall, f-measure and accuracy compared against SVM and RSS approaches. The work in this study contributes to exploring the related document in information retrieval and text mining research by using ontology in TC

    Context based mixture model for cell phase identification in automated fluorescence microscopy

    Get PDF
    BACKGROUND: Automated identification of cell cycle phases of individual live cells in a large population captured via automated fluorescence microscopy technique is important for cancer drug discovery and cell cycle studies. Time-lapse fluorescence microscopy images provide an important method to study the cell cycle process under different conditions of perturbation. Existing methods are limited in dealing with such time-lapse data sets while manual analysis is not feasible. This paper presents statistical data analysis and statistical pattern recognition to perform this task. RESULTS: The data is generated from Hela H2B GFP cells imaged during a 2-day period with images acquired 15 minutes apart using an automated time-lapse fluorescence microscopy. The patterns are described with four kinds of features, including twelve general features, Haralick texture features, Zernike moment features, and wavelet features. To generate a new set of features with more discriminate power, the commonly used feature reduction techniques are used, which include Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), Maximum Margin Criterion (MMC), Stepwise Discriminate Analysis based Feature Selection (SDAFS), and Genetic Algorithm based Feature Selection (GAFS). Then, we propose a Context Based Mixture Model (CBMM) for dealing with the time-series cell sequence information and compare it to other traditional classifiers: Support Vector Machine (SVM), Neural Network (NN), and K-Nearest Neighbor (KNN). Being a standard practice in machine learning, we systematically compare the performance of a number of common feature reduction techniques and classifiers to select an optimal combination of a feature reduction technique and a classifier. A cellular database containing 100 manually labelled subsequence is built for evaluating the performance of the classifiers. The generalization error is estimated using the cross validation technique. The experimental results show that CBMM outperforms all other classifies in identifying prophase and has the best overall performance. CONCLUSION: The application of feature reduction techniques can improve the prediction accuracy significantly. CBMM can effectively utilize the contextual information and has the best overall performance when combined with any of the previously mentioned feature reduction techniques

    Use of interpretable evolved search query classifiers for sinhala documents

    Get PDF
    Document analysis is a well matured yet still active research field, partly as a result of the intricate nature of building computational tools but also due to the inherent problems arising from the variety and complexity of human languages. Breaking down language barriers is vital in enabling access to a number of recent technologies. This paper investigates the application of document classification methods to new Sinhalese datasets. This language is geographically isolated and rich with many of its own unique features. We will examine the interpretability of the classification models with a particular focus on the use of evolved Lucene search queries generated using a Genetic Algorithm (GA) as a method of document classification. We will compare the accuracy and interpretability of these search queries with other popular classifiers. The results are promising and are roughly in line with previous work on English language datasets

    COMPARISON OF PRINCIPAL COMPONENT ANALYSIS AND ANFIS TO IMPROVE EEVE LABORATORY ENERGY USE PREDICTION PERFORMANCE

    Get PDF
    The energy use that is in excess of practicum students’ needs and the disturbed comfort that the practicum students experience when conducting practicums in the Electrical eengineering vocational education (EEVE) laboratory. The main objective in this study was to figure out how to predict and streamline the use of electrical energy in the EEVE laboratory. The model used to achieve this research’s goal was called the adaptive neurofuzzy inference system (ANFIS) model, which was coupled with principal component analysis (PCA) feature selection. The use of PCA in data grouping performance aims to improve the performance of the ANFIS model when predicting energy needs in accordance with the standards set by the campus while still taking students’ confidence in conducting practicum activities during campus operating hours into consideration. After some experiments and tests, very good results were obtained in the training: R=1 in training; minimum RMSE=0.011900; epoch of 100 per iteration; and R=0.37522. In conclusion, the ANFIS model coupled with PCA feature selection was excellent at predicting energy needs in the laboratory while the comfort of the students during practicums in the room remained within consideration

    Effective and efficient approach in IoT Botnet detection

    Get PDF
    Internet of Things (IoT) technology presents an advantage to daily life, but this advantage is not a guarantee of security. This is because cyber-attacks, such as botnets, remain a threat to the user. Detection systems are one of the alternatives to maintain the security of IoT network. A reliable detection system should effectively detect botnets with high accuracy levels and low positive rate. It should be efficient to perform detection quickly. However, data generated by IoT networks have high dimensions and high scalability, so they need to be minimized. In network security analysis process, high-dimension data pose challenges, such as the dimension curse correlation between different dimensions, which causes features that are hard to define, datasets that are mostly unordered, cluster combination, and exponential growth. In this study, we applied feature reduction using the Linear Discriminant Analysis (LDA) method to minimize features on the IoT network to detect botnet. The reduction process is carried out on the N-BaIoT dataset which has 115 features reduced to 2 features. Performing feature reduction with detection systems has become more effective and efficient. Experimental result showed that the application of LDA combined with machine learning on the classification Decision Tree method was able to detect with accuracy that reached 100% in 98.58s with only two features
    corecore