51 research outputs found
A comparison of the effect of feature selection and balancing strategies upon the sentiment classification of portuguese news stories
Sentiment classification of news stories using supervised learning is a mature task in the field of Natural Language Processing. Supervised learning strategies rely upon training data to induce a classifier. Training data can be imbalanced, with typically the neutral class being the majority class. This imbalance can bias the induced classifier towards the majority class. Balancing and feature selection can mitigate the effects of imbalanced data. This paper surveys a number of common balancing and\ud
feature selections techniques, and applies them to an imbalanced data set of manually labelled Brazilian agricultural news stories. The strategies were appraised with a 90:10 holdout evaluation and compared with a baseline strategy. We found that: 1. the feature selection strategies provided no identifiable advantage over a baseline method and 2. balancing produced an advantage over baseline with random oversampling producing the best results.FAPESP (grant 11/20451-1
Predictive Analytics For Disease Condition Of Patients In Emergency Department
Emergency Departments (EDs) in hospitals are experiencing severe crowding and prolonged patient waiting times. The reported crowding in hospitals shows patients in hospital hallways, long waiting times and full occupancy of ED beds. ED crowding has several potential unfavorable effects including patients and staff frustration, lower patient satisfaction and poor health outcomes. The primary motivations behind this study are shortening the patients’ waiting time and improving patient satisfaction and level of care.
The very initial interaction between clinicians and a patient is recorded on nurse triage notes which contain details of the reason for patient’s visit including specific symptoms and incidents. Triage notes and vital signs measured by triage nurse determine the complexity of the patient’s condition. If a minor illness or injury occurred, patient would be treated by nurse practitioners under ED physicians’ supervision. This process called fast track system which allows the main ED area to focus on more severe patient condition. The final decision should be made by physicians so patients have to wait to be seen in order to find out whether they need to be admitted in the hospital or be discharged.
In this study, we propose a decision support system based on nurse triage notes and vital signs that can automatically predict ICD9 code assigned to each patient prior to the visit time. We tested the model on 8000 patient records from VA Medical Center in Detroit for ICD9 classification and measured performance in terms of accuracy
Recommended from our members
A hybrid reduced approach to handle missing values in type 2 diabetes prediction
Diabetes gains more attention among medical institutions and health care organizations as the increasing trend of diabetes around the world. In the United States, 29.1 million people or 9.3% of U.S. population are diagnosed with diabetes. About 86 million people are categorized as pre-diabetes and 15-30% of them will develop diabetes within 5 years. To tackle this challenge, National Diabetes Prevention Program (DPP) was introduced in 2002 and it reduces risk of diabetes by 58% through lifestyle change program. In order to help select a better group of prediabetes for intervention and maximize the cost-effectiveness of the program, we propose a Hybrid Reduced approach to handle missing values when predicting type 2 diabetes. This approach deals with 4 challenges in electronic medical records: missing values, missing not at random, class imbalance and predicting at a longer window (2-year). We select three ensemble predictive models: AdaBoost.M1, Gradient Boosting and Extremely Randomized Trees and apply this approach across 7 years to assess its robustness. The Hybrid Reduced approach includes two sub-approaches: Hybrid Reduced Organic and Hybrid Reduced Imputed. Throughout the experiments, Hybrid Reduced Imputed is the best performer and achieves a 5-7% improvement in precision. By simply using this approach, we could save $278 million for healthcare and improve people’s health conditionStatistic
A Survey of Methods for Handling Disk Data Imbalance
Class imbalance exists in many classification problems, and since the data is
designed for accuracy, imbalance in data classes can lead to classification
challenges with a few classes having higher misclassification costs. The
Backblaze dataset, a widely used dataset related to hard discs, has a small
amount of failure data and a large amount of health data, which exhibits a
serious class imbalance. This paper provides a comprehensive overview of
research in the field of imbalanced data classification. The discussion is
organized into three main aspects: data-level methods, algorithmic-level
methods, and hybrid methods. For each type of method, we summarize and analyze
the existing problems, algorithmic ideas, strengths, and weaknesses.
Additionally, the challenges of unbalanced data classification are discussed,
along with strategies to address them. It is convenient for researchers to
choose the appropriate method according to their needs
A study of feature exraction techniques for classifying topics and sentiments from news posts
Recently, many news channels have their own Facebook pages in which news posts have been released in a daily basis. Consequently, these news posts contain temporal opinions about social events that may change over time due to external factors as well as may use as a monitor to the significant events happened around the world. As a result, many text mining researches have been conducted in the area of Temporal Sentiment Analysis, which one of its most challenging tasks is to detect and extract
the key features from news posts that arrive continuously overtime. However, extracting these features is a challenging task due to post’s complex properties, also posts about a specific topic may grow or vanish overtime leading in producing imbalanced datasets. Thus, this study has developed a comparative analysis on feature extraction Techniques which has examined various feature extraction techniques (TF-IDF, TF, BTO, IG, Chi-square) with three different n-gram features (Unigram, Bigram, Trigram), and using SVM as a classifier. The aim of this study is to discover the optimal Feature Extraction Technique (FET) that could achieve optimum accuracy results for both topic and sentiment classification. Accordingly, this analysis is conducted on three news channels’ datasets. The experimental results for topic classification have shown that Chi-square with unigram have proven to be the best FET compared to other techniques. Furthermore, to overcome the problem of imbalanced data, this study has combined the best FET with OverSampling
technology. The evaluation results have shown an improvement in classifier’s performance and has achieved a higher accuracy at 93.37%, 92.89%, and 91.92 for BBC, Al-Arabiya, and Al-Jazeera, respectively, compared to what have been obtained on original datasets. Similarly, same combination (Chi-square+Unigram) has been used for sentiment classification and obtained accuracies at rates of 81.87%, 70.01%, 77.36%. However, testing the recognized optimal FET on unseen randomly selected news posts has shown a relatively very low accuracies for both topic and sentiment classification due to the changes of topics and sentiments over time
A Semi-Supervised Algorithm for Detecting Extremism Propaganda Diffusion on Social Media
European Social Fund, the Spanish Ministry of
Economy and Competitiveness (Project Reference:
FFI2016-79748-R)Junta de Andalucía (Project
References: P18-FR-5020 and A-HUM-250-UGR18)Spanish Ministry of Economy and
Competitiveness 2017 FPI Predoctoral Programme (Grant
Number: BES-2017-081202
Learning from high-dimensional and class-imbalanced datasets using random forests
Class imbalance and high dimensionality are two major issues in several real-life applications, e.g., in the fields of bioinformatics, text mining and image classification. However, while both issues have been extensively studied in the machine learning community, they have mostly been treated separately, and little research has been thus far conducted on which approaches might be best suited to deal with datasets that are class-imbalanced and high-dimensional at the same time (i.e., with a large number of features). This work attempts to give a contribution to this challenging research area by studying the effectiveness of hybrid learning strategies that involve the integration of feature selection techniques, to reduce the data dimensionality, with proper methods that cope with the adverse effects of class imbalance (in particular, data balancing and cost-sensitive methods are considered). Extensive experiments have been carried out across datasets from different domains, leveraging a well-known classifier, the Random Forest, which has proven to be effective in high-dimensional spaces and has also been successfully applied to imbalanced tasks. Our results give evidence of the benefits of such a hybrid approach, when compared to using only feature selection or imbalance learning methods alone
- …