159 research outputs found

    CABLE NEWS NETWORK (CNN) ARTICLES CLASSIFICATION USING RANDOM FOREST ALGORITHM WITH HYPERPARAMETER OPTIMIZATION

    Get PDF
    The growth of news articles on the internet occurs in a short period with large amounts so necessary to be grouped into several categories for easy access. There is a method for grouping news articles, namely classification. One of the classification methods is random forest which is built on decision tree. This research discusses the application of random forest as a method of classifying news articles into six categories, these are business, entertainment, health, politics, sport, and news. The data used is Cable News Network (CNN) articles from 2011 to 2022. The data is in form of text and has large amounts so good handling is needed to avoid overfitting and underfitting. Random forest is proper to apply to the data because the algorithm works very well on large amounts of data. However, random forest has a difficult interpretation if the combination of parameters is not appropriate in the data processing. Therefore, hyperparameter optimization is needed to discover the best combination of parameters in the random forest. This research uses search cross-validation (SearchCV) method to optimize hyperparameters in the random forest by testing the combinations one by one and validating those. Then we obtain the classification of news articles into six categories with an accuracy value of 0.81 on training and 0.76 on testing

    Penerapan Random Forest dan Adaboost untuk Klasifikasi Serangan DDoS

    Get PDF
    Among the different types of attacks in the field of Information Technology, DDOS attacks are one of the biggest threats to internet sites and pose a devastating risk to the security of computer systems, mainly due to their potential impact. Hence why research in this area is growing rapidly, with researchers focusing on new ways to address intrusion detection and prevention. Machine learning and Artificial Intelligence are some of the latest additions to the list of technologies studied to perform intrusion detection classification. This study explores the behavior and application of DDoS datasets for machine learning in the context of intrusion detection. The flow in this study, first is to collect raw DDoS datasets from reputable sources. After the data is obtained, the final data set is created for modeling. Data management involves data cleansing, data type transformation and data exchange on data collection. The selection process is accompanied by a model. Two separate algorithms, random and adaboost, are used to train a model with a dataset. The model is validated and retrained with a k-fold cross. The model was eventually evaluated using invisible data. The result is determined by various output sizes. In the experiment, DDoS datasets were used: CICDDoS_2019 The intrusion detection performance of this dataset was analyzed using two machine learning models. The dataset is divided in an 80:20 ratio for model training, validation and testing. Machine learning models are selected systematically and carefully to ensure that experiments are conducted in the right way. The results were analyzed using a set of performance metrics, including accuracy, precision, recall, f-measure, and compute tim

    Predicting eBay Prices: Selecting and Interpreting Machine Learning Models – Results of the AG DANK 2018 Data Science Competition

    Get PDF
    The annual meeting of the work group on data analysis and numeric classification (DANK) took place at Stralsund University of Applied Sciences, Germany on October 26h and 27h, 2018 with a focus theme on interpretable machine learning. Traditionally, the conference is accompanied by a data science competition where the participants are invited to analyze one or several data sets and compare and discuss their solutions. In 2018, the task was to predict end prices of eBay auctions. The paper describes the task as well as a discussion of the results as provided by the conference participants. These cover aspects of preprocessing, comparison of different models, task specific hyperparameter tuning as well as the interpretation of the resulting models and the relevance of additional text information

    Performance evaluation and hyperparameter tuning of statistical and machine-learning models using spatial data

    Get PDF
    Machine-learning algorithms have gained popularity in recent years in the field of ecological modeling due to their promising results in predictive performance of classification problems. While the application of such algorithms has been highly simplified in the last years due to their well-documented integration in commonly used statistical programming languages such as R, there are several practical challenges in the field of ecological modeling related to unbiased performance estimation, optimization of algorithms using hyperparameter tuning and spatial autocorrelation. We address these issues in the comparison of several widely used machine-learning algorithms such as Boosted Regression Trees (BRT), kNearest Neighbor (WKNN), Random Forest (RF) and Support Vector Machine (SVM) to traditional parametric algorithms such as logistic regression (GLM) and semi-parametric ones like Generalized Additive Models (GAM). Different nested cross-validation methods including hyperparameter tuning methods are used to evaluate model performances with the aim to receive bias-reduced performance estimates. As a case study the spatial distribution of forest disease (Diplodia sapinea) in the Basque Country in Spain is investigated using common environmental variables such as temperature, precipitation, soil or lithology as predictors. Results show that GAM and Random Forest (RF) (mean AUROC estimates 0.708 and 0.699) outperform all other methods in predictive accuracy. The effect of hyperparameter tuning saturates at around 50 iterations for this data set. The AUROC differences between the bias-reduced (spatial cross-validation) and overoptimistic (non-spatial cross-validation) performance estimates of the GAM and RF are 0.167 (24%) and 0.213 (30%), respectively. It is recommended to also use spatial partitioning for cross-validation hyperparameter tuning of spatial data. The models developed in this study enhance the detection of Diplodia sapinea in the Basque Country compared to previous studies

    IMPLEMENTATION OF THE RANDOM FOREST ALGORITHM IN CLASSIFYING THE ACCURACY OF GRADUATION TIME FOR COMPUTER ENGINEERING STUDENTS AT DIAN NUSWANTORO UNIVERSITY

    Get PDF
    To ensure the existence of a university remains intact, one way that can be done is by optimizing the performance of the students so that they can graduate on time. A high percentage of on-time graduation will result in a good assessment of the accreditation of the department in the university. However, there are many factors that affect the graduation rate, such as the student's academic performance, extracurricular activities, and other factors. The data of graduation of students in the Computer Science program at the Faculty of Computer Science, Dian Nuswantoro University, for the academic years 2008-2017 is the object of this study. The objective of this research is to create the best classification model using the Random Forest algorithm to predict the accuracy of the graduation time of students, which will be useful for policy making in the future. The results of the classification using this algorithm received an accuracy of 93% for the training data and 91% for the test data

    Evaluation of sleep stage classification using feature importance of EEG signal for big data healthcare

    Get PDF
    Sleep analysis is widely and experimentally considered due to its importance to body health care. Since its sufficiency is essential for a healthy life, people often spend almost a third of their lives sleeping. In this case, a similar sleep pattern is not practiced by every individual, regarding pure healthiness or disorders such as insomnia, apnea, bruxism, epilepsy, and narcolepsy. Therefore, this study aims to determine the classification patterns of sleep stages, using big data for health care. This used a high-dimensional FFT extraction algorithm, as well as a feature importance and tuning classifier, to develop accurate classification. The results showed that the proposed method led to more accurate classification than previous techniques. This was because the previous experiments had been conducted with the feature selection model, with accuracy implemented as a performance evaluation. Meanwhile, the EEG Sleep Stages classification model in this present report was composed of the feature selection and importance of the extraction stage. The previous and present experiments also reached the highest values of accuracy, with the Random Forest and SVM models using 2000 and 3000 features (87.19% and 89.19%, respectively. In this article, we proposed an analysis that the feature importance subsequently influenced the model's accuracy. This was because the proposed method was easily fine-tuned and optimized for each subject to improve sensitivity and reduce false negative occurrences
    • …
    corecore