513 research outputs found

    A comparison of data mining methods for mass real estate appraisal

    Get PDF
    We compare the performance of both hedonic and non-hedonic pricing models applied to the problem of housing valuation in the city of Madrid. Urban areas pose several challenges in data mining because of the potential presence of different market segments originated from geospatial relations. Among the algorithms presented, ensembles of M5 model trees consistently showed superior correlation rates in out of sample data. Additionally, they improved the mean relative error rate by 23% when compared with the popular method of assessing the average price per square meter in each neighborhood, outperforming commonplace multiple linear regression models and artificial neural networks as well within our dataset, comprised of 25415 residential properties.mass appraisal, real estate, data mining

    COMPARISON OF DATA MINING ALGORITHM FOR FORECASTING BITCOIN CRYPTO CURRENCY TRENDS

    Get PDF
    The popularity of cryptocurrencies has been increasing in the approximately 10 years since their emergence in 2008. Bitcoin is the most popular and the most instrumental in the existence of cryptocurrencies. The price of coins in cryptocurrencies is the same as the price of shares in the capital market which always fluctuates and even tends to be more volatile than the stock market. This condition is very influential for actors in cryptocurrencies. This study aims to compare the Algorithm Forecasting so that it can be known the right algorithm in Forecasting the trend of Bitcoin. The algorithm used is Algorithm Supervised Learning that is Neural Network, Linear Regression, Support Vector Machine, Gaussian Process, and polynomial Regression. Accuracy was measured using a 10 Fold Cross-validation model and evaluation is done by Root Mean Square Error (RMSE). The results showed that the Algorithm Neural Network is an Algorithm Forecasting best with RMSE value 277,237 +/- 74,736 (micro: 287,208 +/- 0.000) among other Algorithms so that Neural Network can be used for Forecasting cryptocurrency Bitcoin

    Classifying Network Intrusions: A Comparison of Data Mining Methods

    Get PDF
    Network intrusion is an increasingly serious problem experienced by many organizations. In this increasingly hostile environment, networks must be able to detect whether a connection attempt is legitimate or not. The ever-changing nature of these attacks makes them difficult to detect. One solution is to use various data mining methods to determine if the network is being attacked. This paper compares the performance of two data mining methods— i.e., a standard artificial neural network (ANN) and an ANN guided by genetic algorithm (GA)— in classifying network connections as normal or attack. Using connection data drawn from a simulated US Air Force local area network each method was used to construct a predictive model. The models were then applied to validation data and the results were compared. The ANN guided by GA (90.67% correct classification) outperformed the standard ANN (81.75% correct classification) significantly, indicating the superiority of GAbased ANN

    Comparison of Data Mining and Statistical Techniques for Prediction Model

    Get PDF
    The aim of this research is to perform a comparison study between statistical and data mining modeling techniques. These techniques are statistical Logistic Regression, data mining Decision Tree and data mining Neural Network. The performance of these prediction techniques were measured and compared in terms of measuring the overall prediction accuracy percentage agreement for each technique and the models were trained using eight different training datasets samples drawn using two different sampling techniques. The effect of the dependent variable values distribution in the training dataset on the overall prediction percent and on the prediction accuracy of individual “0” and “1” values of the dependent variable values was also experimented. For a given data set, the results shows that the performance of the three techniques were comparable in general with small outperformance for the Neural Network. An affecting factor that makes the percent prediction accuracy varied is the dependent variable values distribution in the training dataset, distribution of “0” and “1”. The results showed that, for all the three techniques, the overall prediction accuracy percentage agreement was high when the dependent variable values distribution ratio in the training data was greater than 1:1 but at the same time they, the techniques, fails to predict the individual dependent variable values successfully or in acceptable prediction percent. If the individual dependent variable values needed to be predicted comparably, then the dependent variable values distribution ratio in the training data should be exactly 1:1.هدف هذه الدراسة هو إجراء مقارنة الكفاءة والفعالية بين الوسائل اإلحصائية وتقنيات التنقيب عن البيانات لبناء نماذج التصنيف والتنبؤ العلمي. الخوارزميات والوسائل والتقنيات التي تمت دراستها ومقارنة أدائها هي االنحدار اللوجستي اإلحصائي، وتقنيتي التنقيب عن البيانات شجرة القرار والشبكة العصبية. تم قياس أداء هذه التقنيات ومقارنتها باالعتماد على مقياس مشترك وهو النسبة المئوية الشاملة لدقة التنبؤ لكل تقنية. تم تدريب نماذج هذه التقنيات باستخدام ثمانية عينات من بيانات التدريب تم سحبها باالعتماد على تقنيتي سحب عينات إحصائية. تم أيضا فحص تأثير توزيع قيم المتغير التابع في بيانات تدريب خوارزميات التنبؤ المذكورة وذلك على مستوى النسبة المئوية الشاملة لدقة التنبؤ لكل تقنية وأيضا على مستوى النسبة المئوية لدقة التنبؤ لقيم المتغير التابع الفردية "0 "و "1 "لكل تقنية. أظهرت النتائج أن أداء التقنيات الثالثة كانت بشكل عام متقاربة وقابلة للمقارنة مع تفوق بسيط لخوارزمية الشبكات العصبية. تم تحديد عنصر مؤثر على اختالف وتفاوت دقة النسبة المئوية للتنبؤ وهذا العنصر هو توزيع قيم المتغير التابع في بيانات تدريب النماذج، أي توزيع "0 "و "1 ."كما أظهرت النتائج أيضا أن النسبة المئوية لدقة التنبؤ الشامل للتقنيات الثالثة كانت مرتفعة عندما كانت نسبة توزيع قيم المتغير التابع في بيانات التدريب أكبر من 1:1 ولكن في الوقت نفسه فشلت الخوارزميات والتقنيات قيد الدراسة في التنبؤ بالقيم الفردية للمتغير التابع بنجاح أو بنسبة تنبؤ مقبولة. في التطبيقات باستخدام هذه التقنيات إذا كان الهدف هو الحصول على تنبؤ بنسبة مئوية عالية لقيم المتغير التابع الفردية وأن تكون النسبة المئوية للتنبؤ بالقيمتين متقاربة فانه يجب أن تكون نسبة توزيع قيم المتغير التابع في بيانات التدريب بالضبط .1:1 تساو

    COMPARISON OF DATA MINING CLASSIFICATION METHODS TO DETECT HEART DISEASE

    Get PDF
    Heart disease is a disease that is deadly and must be treated as soon as possible because if it is too late, it has a big risk to one's life. Factors causing the disease of the heart is the use of tobacco, the physical who are less active, and an unhealthy diet. With existing data, the study is to compare the three algorithms, namely: Naive Bayes, Logistic Regression, and Support Vector Machine (SVM) which aims to determine the level of accuracy of the best of the dataset that is used to predict disease heart. This research produces the best accuracy of 87%, which is generated by the Naive Bayes metho

    COMPARISON OF DATA MINING ALGORITHM FOR CLUSTERING PATIENT DATA HUMAN INFECTIOUS DISEASES

    Get PDF
    Tuberculosis is known as an infectious disease whose transmission through air intermediaries is caused by the germ Mycobacterium Tuberculosis. This disease has become a case that has almost spread throughout the pelalawan Regency with the number continuing to increase every year so that it is possible to be able to group the areas where this disease spreads. Grouping of tuberculosis data distribution areas using data mining methods in the form of clustering with the data used coming from the Pelalawan Regency Health Office from 2020 to 2022. The data obtained earlier will then be processed using k-medoids, k-means, and x-means algorithms. The beginning of this research was by processing data from each year using these three algorithms. Determination of the most optimal algorithm using DBI or known as the Davies Bouldin Index. The results of the processing of existing indicators are grouped into three sections, namely areas with a high, medium, and low number of cases. From the results of the study, the optimal algorithm in 2020 data is the k-medoids algorithms with a DBI value of 0,553 and in 2021 data, the most optimal algorithm is the k-means and x-means algorithm with similar DBI values of 0,582. Furthermore, the data in 2022 the most optimal algorithms are the k-means and x-means algorithms because they have the same DBI value, which is 0,510

    Comparison of Data Mining Techniques for Predicting Compressive Strength of Environmentally Friendly Concrete

    Get PDF
    This material may be downloaded for personal use only. Any other use requires prior permission of the American Society of Civil Engineers. This material may be found at https://ascelibrary.org/doi/10.1061/%28ASCE%29CP.1943-5487.0000596 With its growing emphasis on sustainability, the construction industry is increasingly interested in environmentally friendly concrete produced by using alternative and/or recycled waste materials. However, the wide application of such concrete is hindered by the lack of understanding of the impacts of these materials on concrete properties. This research investigates and compares the performance of nine data mining models in predicting the compressive strength of a new type of concrete containing three alternative materials as fly ash, Haydite lightweight aggregate, and portland limestone cement. These models include three advanced predictive models (multilayer perceptron, support vector machines, and Gaussian processes regression), four regression tree models (M5P, REPTree, M5-Rules, and decision stump), and two ensemble methods (additive regression and bagging) with each of the seven individual models used as the base classifier

    Comparison of Data Mining and Mathematical Models for Estimating Fuel Consumption of Passenger Vehicles

    Get PDF
    A number of analytical models have been described in the literature to estimate the fuel consumption of vehicles, most of which require a wide range of vehicle and trip related parameters as input data, which might limit the practical applicability of these models if such data were not readily available. To overcome this drawback, this study describes the development of three data mining models to estimate fuel consumption of a vehicle, including linear regression, artificial neural network and support vector machines. The paper presents comparison results with five instantaneous fuel consumption models from the literature using real data collected from three passenger vehicles on three routes. The results indicate that while the prediction accuracy of the instantaneous fuel consumption models varies across the data sets, those obtained by the regression models are significantly better and more robust against changes in input data

    A Performance Comparison of Data Mining Algorithms Based Intrusion Detection System for Smart Grid

    Full text link
    Smart grid is an emerging and promising technology. It uses the power of information technologies to deliver intelligently the electrical power to customers, and it allows the integration of the green technology to meet the environmental requirements. Unfortunately, information technologies have its inherent vulnerabilities and weaknesses that expose the smart grid to a wide variety of security risks. The Intrusion detection system (IDS) plays an important role in securing smart grid networks and detecting malicious activity, yet it suffers from several limitations. Many research papers have been published to address these issues using several algorithms and techniques. Therefore, a detailed comparison between these algorithms is needed. This paper presents an overview of four data mining algorithms used by IDS in Smart Grid. An evaluation of performance of these algorithms is conducted based on several metrics including the probability of detection, probability of false alarm, probability of miss detection, efficiency, and processing time. Results show that Random Forest outperforms the other three algorithms in detecting attacks with higher probability of detection, lower probability of false alarm, lower probability of miss detection, and higher accuracy.Comment: 6 pages, 6 Figure

    A comparison of data mining techniques and multi-sensor analysis for inland marshes delineation

    Get PDF
    Inland Marsh (IM) is a type of wetland characterized by the presence of non-woody plants as grasses, reeds or sedges, with a water surface smaller than 25% of the area. Historically, these areas have been suffering impacts related to pollution by urban, industrial and agrochemical waste, as well as drainage for agriculture. The IM delineation allows to understand the vegetation and hydrodynamic dynamics and also to monitor the degradation caused by human-induced activities. This work aimed to compare four machine learning algorithms (classification and regression tree (CART), artificial neural network (ANN), random forest (RF), and k-nearest neighbors (k-NN)) using active and passive remote sensing data in order to address the following questions: (1) which of the four machine learning methods has the greatest potential for inland marshes delineation? (2) are SAR features more important for inland marshes delineation than optical features? and (3) what are the most accurate classification parameters for inland marshes delineation? To address these questions, we used data from Sentinel 1A and Alos Palsar I (SAR) and Sentinel 2A (optical) sensors, in a geographic object-based image analysis (GEOBIA) approach. In addition, we performed a vectorization of a 1975 Brazilian Army topographic chart (first official document presenting marsh boundaries) in order to quantify the marsh area losses between 1975 and 2018 by comparing it with a Sentinel 2A image. Our results showed that the method with the highest overall accuracy was k-NN, with 98.5%. The accuracies for the RF, ANN, and CART methods were 98.3%, 96.0% and 95.5%, respectively. The four classifiers presented accuracies exceeding 95%, showing that all methods have potential for inland marsh delineation. However, we note that the classification results have a great dependence on the input layers. Regarding the importance of the features, SAR images were more important in RF and ANN models, especially in the HV, HV + VH and VH channels of the Alos Palsar I L-band satellite, while spectral indices from optical images were more important in the marshes delineation with the CART method. In addition, we found that the CART and ANN methods presented the largest variations of the overall accuracy (OA) in relation to the different parameters tested. The multi-sensor approach was critical for the high OA values found in the IM delineation (> 95%). The four machine learning methods can be accurately applied for IM delineation, acting as an important low-cost tool for monitoring and managing these environments, in the face of advances in agriculture, soil degradation and pollution of water resources due to agrochemical dumping
    corecore