55 research outputs found

    Data pre-processing for database marketing

    Get PDF
    To increase effectiveness in their marketing and Customer Relationship Manager activities, many organizations are adopting strategies of Database Marketing (DBM). Nowadays, DBM faces new challenges in business knowledge since current strategies are mainly approached by classical statistical inference, which may fail when complex, multi-dimensional and incomplete data is available. An alternative is to use Knowledge Discovery from Databases (KDD), which aims at automatic extraction of useful patterns by using Data Mining (DM) techniques. When applied to DBM, the identified patterns can be used for the efficient characterization of the customers. This paper focus several problems that arose in the data pre-processing step (e.g. data cleaning), which is necessary for the success of the DM approach to a DBM project

    Geospatial data pre-processing on watershed datasets: A GIS approach

    Get PDF
    Spatial data mining helps to identify interesting patterns from the spatial data sets. However, geo spatial data requires substantial data pre-processing before data can be interrogated further using data mining techniques. Multi-dimensional spatial data has been used to explain the spatial analysis and SOLAP for pre-processing data. This paper examines some of the methods for pre-processing of the data using Arc GIS 10.2 and Spatial Analyst with a case study dataset of a watershed

    Dealing with missing data for prognostic purposes

    Get PDF
    Centrifugal compressors are considered one of the most critical components in oil industry, making the minimization of their downtime and the maximization of their availability a major target. Maintenance is thought to be a key aspect towards achieving this goal, leading to various maintenance schemes being proposed over the years. Condition based maintenance and prognostics and health management (CBM/PHM), which is relying on the concepts of diagnostics and prognostics, has been gaining ground over the last years due to its ability of being able to plan the maintenance schedule in advance. The successful application of this policy is heavily dependent on the quality of data used and a major issue affecting it, is that of missing data. Missing data's presence may compromise the information contained within a set, thus having a significant effect on the conclusions that can be drawn from the data, as there might be bias or misleading results. Consequently, it is important to address this matter. A number of methodologies to recover the data, called imputation techniques, have been proposed. This paper reviews the most widely used techniques and presents a case study with the use of actual industrial centrifugal compressor data, in order to identify the most suitable ones

    RecSys Challenge 2023: From data preparation to prediction, a simple, efficient, robust and scalable solution

    Full text link
    The RecSys Challenge 2023, presented by ShareChat, consists to predict if an user will install an application on his smartphone after having seen advertising impressions in ShareChat & Moj apps. This paper presents the solution of 'Team UMONS' to this challenge, giving accurate results (our best score is 6.622686) with a relatively small model that can be easily implemented in different production configurations. Our solution scales well when increasing the dataset size and can be used with datasets containing missing values

    An FPGA-based network system with service-uninterrupted remote functional update

    Get PDF
    The recent emergence of 5G network enables mass wireless sensors deployment for internet-of-things (IoT) applications. In many cases, IoT sensors in monitoring and data collection applications are required to operate continuously and active at all time (24/7) to ensure all data are sampled without loss. Field-programmable gate array (FPGA)-based systems exhibit a balanced processing throughput and datapath flexibility. Specifically, datapath flexibility is acquired from the FPGA-based system architecture that supports dynamic partial reconfiguration feature. However, device functional update can cause interruption to the application servicing, especially in an FPGA-based system. This paper presents a standalone FPGA-based system architecture that allows remote functional update without causing service interruption by adopting a redundancy mechanism in the application datapath. By utilizing dynamic partial reconfiguration, only the updating datapath is temporarily inactive while the rest of the circuitry, including the redundant datapath, remain active. Hence, there is no service interruption and downtime when a remote functional update takes place due to the existence of redundant application datapath, which is critical for network and communication systems. The proposed architecture has a significant impact for application in FPGA-based systems that have little or no tolerance in service interruption

    Internet traffic forecasting using neural networks

    Get PDF
    The forecast of Internet traffic is an important issue that has received few attention from the computer networks field. By improving this task, efficient traffic engineering and anomaly detection tools can be created, resulting in economic gains from better resource management. This paper presents a Neural Network Ensemble (NNE) for the prediction of TCP/IP traffic using a Time Series Forecasting (TSF) point of view. Several experiments were devised by considering real-world data from two large Internet Service Providers. In addition, different time scales (e.g. every five minutes and hourly) and forecasting horizons were analyzed. Overall, the NNE approach is competitive when compared with other TSF methods (e.g. Holt-Winters and ARIMA).Engineering and Physical Sciences Research Council (EP/522885 grant).Portuguese National Conference of Rectors (CRUP)/British Council Portugal (B-53/05 grant).Nuffield Foundation (NAL/001136/A grant)

    Predicting inpatient length of stay in a Portuguese hospital using the CRISP-DM methodology

    Get PDF
    Com base nos dados disponíveis num hospital português relativos aos processos de internamento, ocorridos no período de 2000 a 2013, e seguindo a metodologia de data mining CRISP-DM, obteve-se um modelo de previsão dos tempos de internamento baseado no algoritmo random forest que apresentou uma elevada qualidade, e superior à obtida com outras técnicas de data mining, e que permitiu identificar os atributos clínicos do paciente como os mais importantes para a explicação dos tempos de internamento.Using data collected from a Portuguese hospital, within the period 2000 to 2013, we adopted the CRISP-DM methodology to predict inpatient length of stay. The best method (random forest algorithm) achieved a high quality prediction. Such model allowed the identification of the most relevant input features, which are related with the patients’ clinical attributes.(undefined

    Using data mining for prediction of hospital length of stay: an application of the CRISP-DM Methodology

    Get PDF
    Hospitals are nowadays collecting vast amounts of data related with patient records. All this data hold valuable knowledge that can be used to improve hospital decision making. Data mining techniques aim precisely at the extraction of useful knowledge from raw data. This work describes an implementation of a medical data mining project approach based on the CRISP-DM methodology. Recent real-world data, from 2000 to 2013, were collected from a Portuguese hospital and related with inpatient hospitalization. The goal was to predict generic hospital Length Of Stay based on indicators that are commonly available at the hospitalization process (e.g., gender, age, episode type, medical specialty). At the data preparation stage, the data were cleaned and variables were selected and transformed, leading to 14 inputs. Next, at the modeling stage, a regression approach was adopted, where six learning methods were compared: Average Prediction, Multiple Regression, Decision Tree, Artificial Neural Network ensemble, Support Vector Machine and Random Forest. The best learning model was obtained by the Random Forest method, which presents a high quality coefficient of determination value (0.81). This model was then opened by using a sensitivity analysis procedure that revealed three influential input attributes: the hospital episode type, the physical service where the patient is hospitalized and the associated medical specialty. Such extracted knowledge confirmed that the obtained predictive model is credible and with potential value for supporting decisions of hospital managers

    Variable importance for sustaining macrophyte presence via random forests : data imputation and model settings

    Get PDF
    Data sets plagued with missing data and performance-affecting model parameters represent recurrent issues within the field of data mining. Via random forests, the influence of data reduction, outlier and correlated variable removal and missing data imputation technique on the performance of habitat suitability models for three macrophytes (Lemna minor, Spirodela polyrhiza and Nuphar lutea) was assessed. Higher performances (Cohen’s kappa values around 0.2–0.3) were obtained for a high degree of data reduction, without outlier or correlated variable removal and with imputation of the median value. Moreover, the influence of model parameter settings on the performance of random forest trained on this data set was investigated along a range of individual trees (ntree), while the number of variables to be considered (mtry), was fixed at two. Altering the number of individual trees did not have a uniform effect on model performance, but clearly changed the required computation time. Combining both criteria provided an ntree value of 100, with the overall effect of ntree on performance being relatively limited. Temperature, pH and conductivity remained as variables and showed to affect the likelihood of L. minor, S. polyrhiza and N. lutea being present. Generally, high likelihood values were obtained when temperature is high (>20 °C), conductivity is intermediately low (50–200 mS m−1) or pH is intermediate (6.9–8), thereby also highlighting that a multivariate management approach for supporting macrophyte presence remains recommended. Yet, as our conclusions are only based on a single freshwater data set, they should be further tested for other data sets
    corecore