51 research outputs found

    A comparison of the effect of feature selection and balancing strategies upon the sentiment classification of portuguese news stories

    Get PDF
    Sentiment classification of news stories using supervised learning is a mature task in the field of Natural Language Processing. Supervised learning strategies rely upon training data to induce a classifier. Training data can be imbalanced, with typically the neutral class being the majority class. This imbalance can bias the induced classifier towards the majority class. Balancing and feature selection can mitigate the effects of imbalanced data. This paper surveys a number of common balancing and\ud feature selections techniques, and applies them to an imbalanced data set of manually labelled Brazilian agricultural news stories. The strategies were appraised with a 90:10 holdout evaluation and compared with a baseline strategy. We found that: 1. the feature selection strategies provided no identifiable advantage over a baseline method and 2. balancing produced an advantage over baseline with random oversampling producing the best results.FAPESP (grant 11/20451-1

    Predictive Analytics For Disease Condition Of Patients In Emergency Department

    Get PDF
    Emergency Departments (EDs) in hospitals are experiencing severe crowding and prolonged patient waiting times. The reported crowding in hospitals shows patients in hospital hallways, long waiting times and full occupancy of ED beds. ED crowding has several potential unfavorable effects including patients and staff frustration, lower patient satisfaction and poor health outcomes. The primary motivations behind this study are shortening the patients’ waiting time and improving patient satisfaction and level of care. The very initial interaction between clinicians and a patient is recorded on nurse triage notes which contain details of the reason for patient’s visit including specific symptoms and incidents. Triage notes and vital signs measured by triage nurse determine the complexity of the patient’s condition. If a minor illness or injury occurred, patient would be treated by nurse practitioners under ED physicians’ supervision. This process called fast track system which allows the main ED area to focus on more severe patient condition. The final decision should be made by physicians so patients have to wait to be seen in order to find out whether they need to be admitted in the hospital or be discharged. In this study, we propose a decision support system based on nurse triage notes and vital signs that can automatically predict ICD9 code assigned to each patient prior to the visit time. We tested the model on 8000 patient records from VA Medical Center in Detroit for ICD9 classification and measured performance in terms of accuracy

    A Survey of Methods for Handling Disk Data Imbalance

    Full text link
    Class imbalance exists in many classification problems, and since the data is designed for accuracy, imbalance in data classes can lead to classification challenges with a few classes having higher misclassification costs. The Backblaze dataset, a widely used dataset related to hard discs, has a small amount of failure data and a large amount of health data, which exhibits a serious class imbalance. This paper provides a comprehensive overview of research in the field of imbalanced data classification. The discussion is organized into three main aspects: data-level methods, algorithmic-level methods, and hybrid methods. For each type of method, we summarize and analyze the existing problems, algorithmic ideas, strengths, and weaknesses. Additionally, the challenges of unbalanced data classification are discussed, along with strategies to address them. It is convenient for researchers to choose the appropriate method according to their needs

    A study of feature exraction techniques for classifying topics and sentiments from news posts

    Get PDF
    Recently, many news channels have their own Facebook pages in which news posts have been released in a daily basis. Consequently, these news posts contain temporal opinions about social events that may change over time due to external factors as well as may use as a monitor to the significant events happened around the world. As a result, many text mining researches have been conducted in the area of Temporal Sentiment Analysis, which one of its most challenging tasks is to detect and extract the key features from news posts that arrive continuously overtime. However, extracting these features is a challenging task due to post’s complex properties, also posts about a specific topic may grow or vanish overtime leading in producing imbalanced datasets. Thus, this study has developed a comparative analysis on feature extraction Techniques which has examined various feature extraction techniques (TF-IDF, TF, BTO, IG, Chi-square) with three different n-gram features (Unigram, Bigram, Trigram), and using SVM as a classifier. The aim of this study is to discover the optimal Feature Extraction Technique (FET) that could achieve optimum accuracy results for both topic and sentiment classification. Accordingly, this analysis is conducted on three news channels’ datasets. The experimental results for topic classification have shown that Chi-square with unigram have proven to be the best FET compared to other techniques. Furthermore, to overcome the problem of imbalanced data, this study has combined the best FET with OverSampling technology. The evaluation results have shown an improvement in classifier’s performance and has achieved a higher accuracy at 93.37%, 92.89%, and 91.92 for BBC, Al-Arabiya, and Al-Jazeera, respectively, compared to what have been obtained on original datasets. Similarly, same combination (Chi-square+Unigram) has been used for sentiment classification and obtained accuracies at rates of 81.87%, 70.01%, 77.36%. However, testing the recognized optimal FET on unseen randomly selected news posts has shown a relatively very low accuracies for both topic and sentiment classification due to the changes of topics and sentiments over time

    A Semi-Supervised Algorithm for Detecting Extremism Propaganda Diffusion on Social Media

    Get PDF
    European Social Fund, the Spanish Ministry of Economy and Competitiveness (Project Reference: FFI2016-79748-R)Junta de Andalucía (Project References: P18-FR-5020 and A-HUM-250-UGR18)Spanish Ministry of Economy and Competitiveness 2017 FPI Predoctoral Programme (Grant Number: BES-2017-081202

    Learning from high-dimensional and class-imbalanced datasets using random forests

    Get PDF
    Class imbalance and high dimensionality are two major issues in several real-life applications, e.g., in the fields of bioinformatics, text mining and image classification. However, while both issues have been extensively studied in the machine learning community, they have mostly been treated separately, and little research has been thus far conducted on which approaches might be best suited to deal with datasets that are class-imbalanced and high-dimensional at the same time (i.e., with a large number of features). This work attempts to give a contribution to this challenging research area by studying the effectiveness of hybrid learning strategies that involve the integration of feature selection techniques, to reduce the data dimensionality, with proper methods that cope with the adverse effects of class imbalance (in particular, data balancing and cost-sensitive methods are considered). Extensive experiments have been carried out across datasets from different domains, leveraging a well-known classifier, the Random Forest, which has proven to be effective in high-dimensional spaces and has also been successfully applied to imbalanced tasks. Our results give evidence of the benefits of such a hybrid approach, when compared to using only feature selection or imbalance learning methods alone
    corecore