14 research outputs found

    Exploring Sentiment Analysis Techniques in Natural Language Processing: A Comprehensive Review

    Full text link
    Sentiment analysis (SA) is the automated process of detecting and understanding the emotions conveyed through written text. Over the past decade, SA has gained significant popularity in the field of Natural Language Processing (NLP). With the widespread use of social media and online platforms, SA has become crucial for companies to gather customer feedback and shape their marketing strategies. Additionally, researchers rely on SA to analyze public sentiment on various topics. In this particular research study, a comprehensive survey was conducted to explore the latest trends and techniques in SA. The survey encompassed a wide range of methods, including lexicon-based, graph-based, network-based, machine learning, deep learning, ensemble-based, rule-based, and hybrid techniques. The paper also addresses the challenges and opportunities in SA, such as dealing with sarcasm and irony, analyzing multi-lingual data, and addressing ethical concerns. To provide a practical case study, Twitter was chosen as one of the largest online social media platforms. Furthermore, the researchers shed light on the diverse application areas of SA, including social media, healthcare, marketing, finance, and politics. The paper also presents a comparative and comprehensive analysis of existing trends and techniques, datasets, and evaluation metrics. The ultimate goal is to offer researchers and practitioners a systematic review of SA techniques, identify existing gaps, and suggest possible improvements. This study aims to enhance the efficiency and accuracy of SA processes, leading to smoother and error-free outcomes

    Intrusion Detection System Berbasis Seleksi Fitur Dengan Kombinasi Filter Information Gain Ratio Dan Correlation

    Get PDF
    Intrusion Detection System merupakan suatu sistem yang dikembangkan untuk memantau dan memfilter aktivitas jaringan dengan mengidentifikasi serangan. Karena jumlah data yang perlu diperiksa oleh IDS sangat besar dan banyaknya fitur-fitur asing yang dapat membuat proses analisis menjadi sulit untuk mendeteksi pola perilaku yang mencurigakan, maka IDS perlu mengurangi jumlah data yang akan diproses dengan cara mengurangi fitur yang dapat dilakukan dengan seleksi fitur. Pada penelitian ini mengkombinasikan dua metode perangkingan fitur yaitu Information Gain Ratio dan Correlation dan mengklasifikasikannya menggunakan algoritma K-Nearest Neighbor. Hasil perankingan dari kedua metode dibagi menjadi dua kelompok. Pada kelompok pertama dicari nilai mediannya dan untuk kelompok kedua dihapus. Lalu dilakukan klasifikasi K-Nearest Neighbor dengan menggunakan 10 kali validasi silang dan dilakukan pengujian dengan nilai k=5. Penerapan pemodelan yang diusulkan menghasilkan akurasi tertinggi sebesar 99.61%. Sedangkan untuk akurasi tanpa seleksi fitur menghasilkan akurasi tertinggi sebesar 99.59%. AbstractIntrusion Detection System is a system that was developed for monitoring and filtering activity in network with identified of attack. Because of the amount of the data that need to be checked by IDS is very large and many foreign feature that can make the analysis process difficult for detection suspicious pattern of behavior, so that IDS need for reduce amount of the data to be processed by reducing features that can be done by feature selection. In this study, combines two methods of feature ranking is Information Gain Ratio and Correlation and classify it using K-Nearest Neighbor algorithm. The result of feature ranking from the both methods divided into two groups. in the first group searched for the median value and in the second group is removed. Then do the classification of  K-Nearest Neighbor using 10 fold cross validation and do the tests with values k=5. The result of the  proposed modelling produce the highest accuracy of 99.61%. While the highest accuracy value of the not using the feature selection is 99.59%

    Comparative Analysis of Building Insurance Prediction Using Some Machine Learning Algorithms

    Get PDF
    In finance and management, insurance is a product that tends to reduce or eliminate in totality or partially the loss caused due to different risks. Various factors affect house insurance claims, some of which contribute to formulating insurance policies including specific features that the house has. Machine Learning (ML) when brought into the field of insurance would enable seamless formulation of insurance policies with a better performance which will also save time. Various classification algorithms have been used since they have a long history and have also got some modifications for optimum functionality. To illustrate the performance of each of the ML algorithms that we used here, we analyzed an insurance dataset drawn from Zindi Africa competition which is said to be from Olusola Insurance Company in Lagos Nigeria. This study therefore, compares the performance of Logistic Regression (LR), Decision Tree (DT), K-Nearest Neighbor (KNN), Kernel Support Vector Machine (kSVM), NaĂŻve Bayes (NB), and Random Forest (RF) Regressors on a dataset got from Zindi.africa competition and their performances are checked using not only accuracy and precision metrics but also recall, and F1 score metrics, all displayed on the confusion matrix. The accuracy result shows that logistic regression and Kernel SVM both gave 78% but kSVM outperformed LR in precision with a percentage of 70.8% for kSVM and 64.8% for LR showing that kSVM offered the best result

    Lexicon‐pointed hybrid N‐gram Features Extraction Model (LeNFEM) for sentence level sentiment analysis

    Get PDF
    open access articleSentiment analysis of social media textual posts can provide information and knowledge that is applicable in social settings, business intelligence, evaluation of citizens' opinions in governance, and in mood triggered devices in the Internet of Things. Feature extraction and selection is a key determinant of accuracy and computational cost of machine learning models for such analysis. Most feature extraction and selection techniques utilize bag of words, N‐grams, and frequency‐based algorithms especially Term Frequency‐Inverse Document Frequency. However, these approaches do not consider relationships between words, they ignore words' characteristics and they suffer high feature dimensionality. In this paper we propose and evaluate a feature extraction and selection approach that utilizes a fixed hybrid N‐gram window for feature extraction and minimum redundancy maximum relevance feature selection algorithm for sentence level sentiment analysis. The approach improves the existing features extraction techniques, specifically the N‐gram by generating a hybrid vector from words, Part of Speech (POS) tags, and word semantic orientation. The vector is extracted by using a static trigram window identified by a lexicon where a sentiment word appears in a sentence. A blend of the words, POS tags, and the sentiment orientations of the static trigram are used to build the feature vector. The optimal features from the vector are then selected using minimum redundancy maximum relevance (MRMR) algorithm. Experiments were carried out using the public Yelp dataset to compare the performance of the proposed model and existing feature extraction models (BOW, normal N‐grams and lexicon‐based bag of words semantic orientations). Using supervised machine learning classifiers the experimental results showed that the proposed model had the highest F‐measure (88.64%) compared to the highest (83.55%) from baseline approaches. Wilcoxon test carried out ascertained that the proposed approach performed significantly better than the baseline approaches. Comparative performance analysis with other datasets further affirmed that the proposed approach is generalizable

    Unionization method for changing opinion in sentiment classification using machine learning

    Get PDF
    Sentiment classification aims to determine whether an opinionated text expresses a positive, negative or neutral opinion. Most existing sentiment classification approaches have focused on supervised text classification techniques. One critical problem of sentiment classification is that a text collection may contain tens or hundreds of thousands of features, i.e. high dimensionality, which can be solved by dimension reduction approach. Nonetheless, although feature selection as a dimension reduction method can reduce feature space to provide a reduced feature subset, the size of the subset commonly requires further reduction. In this research, a novel dimension reduction approach called feature unionization is proposed to construct a more reduced feature subset. This approach works based on the combination of several features to create a more informative single feature. Another challenge of sentiment classification is the handling of concept drift problem in the learning step. Users’ opinions are changed due to evolution of target entities over time. However, the existing sentiment classification approaches do not consider the evolution of users’ opinions. They assume that instances are independent, identically distributed and generated from a stationary distribution, even though they are generated from a stream distribution. In this study, a stream sentiment classification method is proposed to deal with changing opinion and imbalanced data distribution using ensemble learning and instance selection methods. In relation to the concept drift problem, another important issue is the handling of feature drift in the sentiment classification. To handle feature drift, relevant features need to be detected to update classifiers. Since proposed feature unionization method is very effective to construct more relevant features, it is further used to handle feature drift. Thus, a method to deal with concept and feature drifts for stream sentiment classification was proposed. The effectiveness of the feature unionization method was compared with the feature selection method over fourteen publicly available datasets in sentiment classification domain using three typical classifiers. The experimental results showed the proposed approach is more effective than current feature selection approaches. In addition, the experimental results showed the effectiveness of the proposed stream sentiment classification method in comparison to static sentiment classification. The experiments conducted on four datasets, have successfully shown that the proposed algorithm achieved better results and proving the effectiveness of the proposed method

    An ensemble-based anomaly-behavioural crypto-ransomware pre-encryption detection model

    Get PDF
    Crypto-ransomware is a malware that leverages cryptography to encrypt files for extortion purposes. Even after neutralizing such attacks, the targeted files remain encrypted. This irreversible effect on the target is what distinguishes crypto-ransomware attacks from traditional malware. Thus, it is imperative to detect such attacks during pre-encryption phase. However, existing crypto-ransomware early detection solutions are not effective due to inaccurate definition of the pre-encryption phase boundaries, insufficient data at that phase and the misuse-based approach that the solutions employ, which is not suitable to detect new (zero-day) attacks. Consequently, those solutions suffer from low detection accuracy and high false alarms. Therefore, this research addressed these issues and developed an Ensemble-Based Anomaly-Behavioural Pre-encryption Detection Model (EABDM) to overcome data insufficiency and improve detection accuracy of known and novel crypto-ransomware attacks. In this research, three phases were used in the development of EABDM. In the first phase, a Dynamic Pre-encryption Boundary Definition and Features Extraction (DPBD-FE) scheme was developed by incorporating Rocchio feedback and vector space model to build a pre-encryption boundary vector. Then, an improved term frequency-inverse document frequency technique was utilized to extract the features from runtime data generated during the pre-encryption phase of crypto-ransomware attacks’ lifecycle. In the second phase, a Maximum of Minimum-Based Enhanced Mutual Information Feature Selection (MM-EMIFS) technique was used to select the informative features set, and prevent overfitting caused by high dimensional data. The MM-EMIFS utilized the developed Redundancy Coefficient Gradual Upweighting (RCGU) technique to overcome data insufficiency during pre-encryption phase and improve feature’s significance estimation. In the final phase, an improved technique called incremental bagging (iBagging) built incremental data subsets for anomaly and behavioural-based detection ensembles. The enhanced semi-random subspace selection (ESRS) technique was then utilized to build noise-free and diverse subspaces for each of these incremental data subsets. Based on the subspaces, the base classifiers were trained for each ensemble. Both ensembles employed the majority voting to combine the decisions of the base classifiers. After that, the decision of the anomaly ensemble was combined into behavioural ensemble, which gave the final decision. The experimental evaluation showed that, DPBD-FE scheme reduced the ratio of crypto-ransomware samples whose pre-encryption boundaries were missed from 18% to 8% as compared to existing works. Additionally, the features selected by MM-EMIFS technique improved the detection accuracy from 89% to 96% as compared to existing techniques. Likewise, on average, the EABDM model increased detection accuracy from 85% to 97.88% and reduced the false positive alarms from 12% to 1% in comparison to existing early detection models. These results demonstrated the ability of the EABDM to improve the detection accuracy of crypto-ransomware attacks early and before the encryption takes place to protect files from being held to ransom

    Analisis Sentimen Opini Film Menggunakan Metode NaĂŻve Bayes dengan Ensemble Feature dan Seleksi Fitur Pearson Correlation Coefficient

    Get PDF
    Microblogging telah menjadi media informasi yang sangat populer di kalangan pengguna internet. Oleh karena itu, microblogging menjadi sumber data yang kaya untuk opini dan ulasan terutama pada ulasan film. Pada penelitian ini mengusulkan analisis sentimen opini film dengan menggunakan ensemble feature dan Bag of Words dan seleksi Fitur Pearson Correlation Coefficient untuk mengurangi dimensi fitur dan mendapatkan kombinasi fitur yang optimal. Penggunaan seleksi fitur dilakukan untuk meningkatkan performa dari klasifikasi, mengurangi dimensi fitur dan mendapatkan kombinasi fitur yang optimal. Pada penelitian ini menggunakan beberapa model dari klasifikasi Naïve Bayes yaitu Bernoulli Naïve Bayes untuk tipe data biner, Gaussian Naïve Bayes untuk tipe data kontinu dan Multinomial Naïve Bayes untuk tipe data numeric. Hasil penelitian ini menunjukkan bahwa dengan menggunakan kata tidak baku pada tweet hasil evaluasi yang didapatkan accuracy 82%, precision 86%, recall 79,62% dan fmeasure 82,69% pada seleksi Fitur 20%. Kemudian setelah dilakukan pembakuan kata secara manual hasil evaluasi pada accuracy meningkat sebesar 8% sehingga accuracy menjadi 90%, precision 92%, recall 88,46% dan f-measure 90,19% pada seleksi fitur 85%. Berdasarkan hasil tersebut dapat disimpulkan bahwa penggunaan kata baku dapat meningkatkan performa klasifikasi dan seleksi fitur Pearson’s memberikan kombinasi fitur yang optimal dan mengurangi jumlah total dimensi fitur

    A feature selection model based on genetic rank aggregation for text sentiment classification

    No full text
    Sentiment analysis is an important research direction of natural language processing, text mining and web mining which aims to extract subjective information in source materials. The main challenge encountered in machine learning method-based sentiment classification is the abundant amount of data available. This amount makes it difficult to train the learning algorithms in a feasible time and degrades the classification accuracy of the built model. Hence, feature selection becomes an essential task in developing robust and efficient classification models whilst reducing the training time. In text mining applications, individual filter-based feature selection methods have been widely utilized owing to their simplicity and relatively high performance. This paper presents an ensemble approach for feature selection, which aggregates the several individual feature lists obtained by the different feature selection methods so that a more robust and efficient feature subset can be obtained. In order to aggregate the individual feature lists, a genetic algorithm has been utilized. Experimental evaluations indicated that the proposed aggregation model is an efficient method and it outperforms individual filter-based feature selection methods on sentiment classification. © The Author(s) 2015
    corecore