6,743 research outputs found

    Hybrid feature selection based on principal component analysis and grey wolf optimizer algorithm for Arabic news article classification

    Get PDF
    The rapid growth of electronic documents has resulted from the expansion and development of internet technologies. Text-documents classification is a key task in natural language processing that converts unstructured data into structured form and then extract knowledge from it. This conversion generates a high dimensional data that needs further analusis using data mining techniques like feature extraction, feature selection, and classification to derive meaningful insights from the data. Feature selection is a technique used for reducing dimensionality in order to prune the feature space and, as a result, lowering the computational cost and enhancing classification accuracy. This work presents a hybrid filter-wrapper method based on Principal Component Analysis (PCA) as a filter approach to select an appropriate and informative subset of features and Grey Wolf Optimizer (GWO) as wrapper approach (PCA-GWO) to select further informative features. Logistic Regression (LR) is used as an elevator to test the classification accuracy of candidate feature subsets produced by GWO. Three Arabic datasets, namely Alkhaleej, Akhbarona, and Arabiya, are used to assess the efficiency of the proposed method. The experimental results confirm that the proposed method based on PCA-GWO outperforms the baseline classifiers with/without feature selection and other feature selection approaches in terms of classification accuracy

    Proceedings of the 2nd Computer Science Student Workshop: Microsoft Istanbul, Turkey, April 9, 2011

    Get PDF

    An Ontology-based Two-Stage Approach to Medical Text Classification with Feature Selection by Particle Swarm Optimisation

    Full text link
    © 2019 IEEE. Document classification (DC) is the task of assigning pre-defined labels to unseen documents by utilizing a model trained on the available labeled documents. DC has attracted much attention in medical fields recently because many issues can be formulated as a classification problem. It can assist doctors in decision making and correct decisions can reduce the medical expenses. Medical documents have special attributes that distinguish them from other texts and make them difficult to analyze. For example, many acronyms and abbreviations, and short expressions make it more challenging to extract information. The classification accuracy of the current medical DC methods is not satisfactory. The goal of this work is to enhance the input feature sets of the DC method to improve the accuracy. To approach this goal, a novel two-stage approach is proposed. In the first stage, a domain-specific dictionary, namely the Unified Medical Language System (UMLS), is employed to extract the key features belonging to the most relevant concepts such as diseases or symptoms. In the second stage, PSO is applied to select more related features from the extracted features in the first stage. The performance of the proposed approach is evaluated on the 2010 Informatics for Integrating Biology and the Bedside (i2b2) data set which is a widely used medical text dataset. The experimental results show substantial improvement by the proposed method on the accuracy of classification

    Improving Feature Selection Techniques for Machine Learning

    Get PDF
    As a commonly used technique in data preprocessing for machine learning, feature selection identifies important features and removes irrelevant, redundant or noise features to reduce the dimensionality of feature space. It improves efficiency, accuracy and comprehensibility of the models built by learning algorithms. Feature selection techniques have been widely employed in a variety of applications, such as genomic analysis, information retrieval, and text categorization. Researchers have introduced many feature selection algorithms with different selection criteria. However, it has been discovered that no single criterion is best for all applications. We proposed a hybrid feature selection framework called based on genetic algorithms (GAs) that employs a target learning algorithm to evaluate features, a wrapper method. We call it hybrid genetic feature selection (HGFS) framework. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for the target algorithm. The experiments on genomic data demonstrate that ours is a robust and effective approach that can find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm. A common characteristic of text categorization tasks is multi-label classification with a great number of features, which makes wrapper methods time-consuming and impractical. We proposed a simple filter (non-wrapper) approach called Relation Strength and Frequency Variance (RSFV) measure. The basic idea is that informative features are those that are highly correlated with the class and distribute most differently among all classes. The approach is compared with two well-known feature selection methods in the experiments on two standard text corpora. The experiments show that RSFV generate equal or better performance than the others in many cases

    Nomenclature and Benchmarking Models of Text Classification Models: Contemporary Affirmation of the Recent Literature

    Get PDF
    In this paper we present automated text classification in text mining that is gaining greater relevance in various fields every day Text mining primarily focuses on developing text classification systems able to automatically classify huge volume of documents comprising of unstructured and semi structured data The process of retrieval classification and summarization simplifies extract of information by the user The finding of the ideal text classifier feature generator and distinct dominant technique of feature selection leading all other previous research has received attention from researchers of diverse areas as information retrieval machine learning and the theory of algorithms To automatically classify and discover patterns from the different types of the documents 1 techniques like Machine Learning Natural Language Processing NLP and Data Mining are applied together In this paper we review some effective feature selection researches and show the results in a table for

    Development of Artificial Intelligence systems as a prediction tool in ovarian cancer

    Get PDF
    PhD ThesisOvarian cancer is the 5th most common cancer in females and the UK has one of the highest incident rates in Europe. In the UK only 36% of patients will live for at least 5 years after diagnosis. The number of prognostic markers, treatments and the sequences of treatments in ovarian cancer are rising. Therefore, it is getting more difficult for the human brain to perform clinical decision making. There is a need for an expert computer system (e.g. Artificial Intelligence (AI)), which is capable of investigating the possible outcomes for each marker, treatment and sequence of treatment. Such expert systems may provide a tool which could help clinicians to analyse and predict outcome using different treatment pathways. Whilst prediction of overall survival of a patient is difficult there may be some benefits, as this not only is useful information for the patient but may also determine treatment modality. In this project a dataset was constructed of 352 patients who had been treated at a single centre. Clinical data were extracted from the health records. Expert systems were then investigated to determine the optimum model to predict overall survival of a patient. The five year survival period (a standard survival outcome measure in cancer research) was investigated; in addition, the system was developed with the flexibility to predict patient survival rates for many other categories. Comparisons with currently used prognostic models in ovarian cancer demonstrated a significant improvement in performance for the AI model (Area under the Curve (AUC) of 0.72 for AI and AUC of 0.62 for the statistical model). Using various methods, the most important variables in this prediction were identified as: FIGO stage, outcome of the surgery and CA125. This research investigated the effects of increasing the number of cases in prediction models. Results indicated that by increasing the number of cases, the prediction performance improved. Categorization of continuous data did not improve the prediction performance. The project next investigated the possibility of predicting surgical outcomes in ovarian cancer using AI, based on the variables that are available for clinicians prior to the surgery. Such a tool could have direct clinical relevance. Diverse models that can predict the outcome of the surgery were investigated and developed. The developed AI models were also compared against the standard statistical prediction model, which demonstrated that the AI model outperformed the statistical prediction model: the prediction of all outcomes (complete or optimal or suboptimal) (AUC of AI: 0.71 and AUC of statistical model: 0.51), the prediction of complete or optimal cytoreduction versus suboptimal cytoreduction (AUC of AI: 0.73 and AUC of statistical model: 0.50) and finally the prediction of complete cytoreduction versus optimal or suboptimal cytoreduction (AUC of AI: 0.79 and AUC of statistical model: 0.47). The most important variables for this prediction were identified as: FIGO stage, tumour grade and histology. The application of transcriptomic analysis to cancer research raises the question of which genes are significantly involved in a particular cancer and which genes can accurately predict survival outcomes in a given cancer. Therefore, AI techniques were employed to identify the most important genes for the prediction of Homologous Recombination (HR), an important DNA repair pathway in ovarian cancer, identifying LIG1 and POLD3 as novel prognostic biomarkers. Finally, AI models were used to predict the HR status for any given patient (AUC: 0.87). This project has demonstrated that AI may have an important role in ovarian cancer. AI systems may provide tools to help clinicians and research in ovarian cancer and may allow more informed decisions resulting in better management of this cancer
    corecore