9 research outputs found

    A Modified Support Vector Machine Classifiers Using Stochastic Gradient Descent with Application to Leukemia Cancer Type Dataset

    Get PDF
    شعاع الدعم الالي (SVM) هو أحد تطبيقات معادلة الانحدار للتعليم الاستنتاجي الذي يحلل البيانات ويستخدم في التصنيف ومعادلة الانحدار. في التصنيف، يستخدم SVM بشكل واسع بأختيار مقطع مثالي للفصل بين مجموعتين. وهو يمتلك دقة عالية و مستقر بصورة هائلة بالمقارنة مع طرق التصنيف الأخرى مثل الانحدار اللوجستي الخطي، random forest ،  k-nearest neighbor و  naïve model.على أي حال، عند العمل على بيانات هائلة تتولد مشاكل كبيرة كاستهلاك للوقت وأيضا النتائج  تكون غير دقيقة.  في هذا البحث SVM طورت بأستخدام طريقة الانحدار العشوائي. الطريقة المحدثة، SGD-SVM اختبرت بأستخدام مجموعتين من البيانات. ولأن تصنيف أنواع السرطان مهم بالنسبة لتشخيص السرطان واستكشاف الدواء. SGD-SVM طبقت لتصنيف بيانات تكسر كريات الدم الشهيرة. النتائج التي حصلنا عليها من طريقة SGD-SVM كانت دقتها اعلى من النتائج التي تم الحصول عليها من بعض الدراسات السابقة التي استخدمت نفس البيانات.Support vector machines (SVMs) are supervised learning models that analyze data for classification or regression. For classification, SVM is widely used by selecting an optimal hyperplane that separates two classes. SVM has very good accuracy and extremally robust comparing with some other classification methods such as logistics linear regression, random forest, k-nearest neighbor and naïve model. However, working with large datasets can cause many problems such as time-consuming and inefficient results. In this paper, the SVM has been modified by using a stochastic Gradient descent process. The modified method, stochastic gradient descent SVM (SGD-SVM), checked by using two simulation datasets. Since the classification of different cancer types is important for cancer diagnosis and drug discovery, SGD-SVM is applied for classifying the most common leukemia cancer type dataset. The results that are gotten using SGD-SVM are much accurate than other results of many studies that used the same leukemia datasets

    Evaluasi Kinerja MLLIB APACHE SPARK pada Klasifikasi Berita Palsu dalam Bahasa Indonesia

    Get PDF
    Machine learning digunakan untuk menganalisis, mengklasifikasikan, atau memprediksi data. Untuk melakukan tugas dari machine learning diperlukan alat bantu dengan kinerja serta lingkungan yang kuat demi mendapatkan akurasi dan efisiensi waktu yang baik. MLlib Apache Spark adalah library machine learning yang memiliki kemampuan dan kecepatan yang sangat baik. Hal ini dikarenakan dalam melakukan pemrosesan data, MLlib berjalan di atas memori. Penelitian ini menggunakan MLlib Apache Spark untuk melakukan klasifikasi berita palsu berbahasa Indonesia dengan jumlah data sebanyak 1786 yang diperoleh dari situs penyedia berita palsu dan fakta, yaitu TurnBackHoax.id. Algoritma klasifikasi yang diterapkan adalah Naïve Bayes, Gradient-Boosted Tree, SVM dan Logistic Regression. Keempat algoritma dipilih karena kemampuannya yang sudah terbukti baik dalam melakukan klasifikasi dan beberapa algoritma yang jarang digunakan namun memiliki kemampuan yang baik juga dalam hal klasifikasi. Tahap pengolahan data diantaranya adalah preprocessing, feature extraction, penerapan algoritma. Evaluasi dilakukan berdasarkan accuracy, test error, f1-score, confusion matrix, dan running time. Hasil menunjukkan bahwa MLlib Apache Spark terbukti memiliki kinerja yang cepat dan baik karena dalam melakukan pemrosesan machine learning, running time tercepat yang didapat adalah 6.46 detik dengan menggunakan algoritma Logistic Regression. Akurasi yang didapat juga cukup baik dengan rata-rata test error dari keempat algoritma hanya 0.180. F1-score yang diperoleh pada keempat algoritma juga cukup baik dengan rata-rata sebesar 0.818. Confusion matrix yang dihasilkan juga baik, karena jumlah prediksi benar jauh lebih banyak dibandingkan dengan jumlah yang salah. AbstractMachine learning is used to analyze, classify, or predict data. To do the task of machine learning, we need tools with a strong performance and environment to get good accuracy and time efficiency. MLlib Apache Spark is a machine learning library that has excellent capabilities and speed. This is because in performing data processing, MLlib runs on memory. This research uses MLlib Apache Spark to classify fake news in Indonesian language with 1786 data that were obtained from fake news and fact provider sites, TurnBackHoax.id. The classification algorithm applied was Naïve Bayes, Gradient-Boosted Tree, SVM and Logistic Regression. The four algorithms were chosen because of their proven ability to classify and several algorithms that are rarely used but have good abilities in terms of classification. Data processing stages include preprocessing, feature extraction, and algorithm implementation.  Evaluation was done based on accuracy, error test, f1-score, confusion matrix, and running time.  The results showed that MLlib Apache Spark was proven to have a fast and good performance because in doing machine learning processing, the fastest running time was 6.46 seconds using the Logistic Regression algorithm. The accuracy obtained was also quite good with an average test error of the four algorithms of only 0.180.  F1-scores obtained on the four algorithms were also quite good with an average of 0.818. The result of confusion matrix was also good, because the number of correct predictions was far more than the number of incorrect ones

    SLiSeS: Subsampled Line Search Spectral Gradient Method for Finite Sums

    Full text link
    The spectral gradient method is known to be a powerful low-cost tool for solving large-scale optimization problems. In this paper, our goal is to exploit its advantages in the stochastic optimization framework, especially in the case of mini-batch subsampling that is often used in big data settings. To allow the spectral coefficient to properly explore the underlying approximate Hessian spectrum, we keep the same subsample for several iterations before subsampling again. We analyze the required algorithmic features and the conditions for almost sure convergence, and present initial numerical results that show the advantages of the proposed method

    Credit Card Fraud Detection Using Machine Learning Algorithms

    Get PDF
    One of the main challenges to the security of an online business is credit card fraud. For this reason, algorithms based on artificial intelligence and machine learning are being introduced to enable the most accurate and fast detection of card fraud. This paper presents an approach to the detection of card fraud based on machine learning algorithms more specifically, a multilayer perceptron (MLP) and a Decision tree. The aforementioned algorithms were trained and tested using a publicly available data set on card fraud. The data set used consists of 7 characteristics of the card transaction and information on whether there was card fraud or not. In total, the data set contains information on 1,000,000 transactions, and it is highly imbalanced. To handle the class imbalance, random undersampling, SMOTE, and SMOTE-Tomek algorithms were proposed. From the achieved results it can be seen that the highest performances are achieved if MLP (AUC = 0.99, f1 = 0.99, MCC = 0.98, and Kappa = 0.98) and Decision tree (AUC = 0.99, f1 = 0.99, MCC = 0.99, and Kappa = 0.98) are trained by using data set re-sampled by using SMOTE-Tomek algorithm. If the performance of the mentioned algorithms is examined using fewer characteristics of the transaction, it can be seen that by reducing the number of characteristics a significant decrease in classification performances can be noticed if a Decision tree in combination with SMOTE-Tomek is used. However, if an MLP in combination with SMOTE-Tomek is used, a significantly lower decrease in performance can be observed, pointing to the higher robustness to input vector dimension reduction. Such a robust system can provide information about transaction validity even in a condition where the input data is limited to a few input variables. From the achieved results, it can be concluded that MLP in combination with the SMOTE-Tomek algorithm can be used for credit card fraud detection, even in conditions with a lower number of input variables

    Parameter Selection Method for Support Vector Regression Based on Adaptive Fusion of the Mixed Kernel Function

    Get PDF
    Support vector regression algorithm is widely used in fault diagnosis of rolling bearing. A new model parameter selection method for support vector regression based on adaptive fusion of the mixed kernel function is proposed in this paper. We choose the mixed kernel function as the kernel function of support vector regression. The mixed kernel function of the fusion coefficients, kernel function parameters, and regression parameters are combined together as the parameters of the state vector. Thus, the model selection problem is transformed into a nonlinear system state estimation problem. We use a 5th-degree cubature Kalman filter to estimate the parameters. In this way, we realize the adaptive selection of mixed kernel function weighted coefficients and the kernel parameters, the regression parameters. Compared with a single kernel function, unscented Kalman filter (UKF) support vector regression algorithms, and genetic algorithms, the decision regression function obtained by the proposed method has better generalization ability and higher prediction accuracy
    corecore