1,652 research outputs found

    Data prediction for cases of incorrect data in multi-node electrocardiogram monitoring

    Get PDF
    The development of a mesh topology in multi-node electrocardiogram (ECG) monitoring based on the ZigBee protocol still has limitations. When more than one active ECG node sends a data stream, there will be incorrect data or damage due to a failure of synchronization. The incorrect data will affect signal interpretation. Therefore, a mechanism is needed to correct or predict the damaged data. In this study, the method of expectation-maximization (EM) and regression imputation (RI) was proposed to overcome these problems. Real data from previous studies are the main modalities used in this study. The ECG signal data that has been predicted is then compared with the actual ECG data stored in the main controller memory. Root mean square error (RMSE) is calculated to measure system performance. The simulation was performed on 13 ECG waves, each of them has 1000 samples. The simulation results show that the EM method has a lower predictive error value than the RI method. The average RMSE for the EM and RI methods is 4.77 and 6.63, respectively. The proposed method is expected to be used in the case of multi-node ECG monitoring, especially in the ZigBee application to minimize errors

    Regression Analysis of University Giving Data

    Get PDF
    This project analyzed the giving data of Worcester Polytechnic Institute\u27s alumni and other constituents (parents, friends, neighbors, etc.) from fiscal year 1983 to 2007 using a two-stage modeling approach. Logistic regression analysis was conducted in the first stage to predict the likelihood of giving for each constituent, followed by linear regression method in the second stage which was used to predict the amount of contribution to be expected from each contributor. Box-Cox transformation was performed in the linear regression phase to ensure the assumption underlying the model holds. Due to the nature of the data, multiple imputation was performed on the missing information to validate generalization of the models to a broader population. Concepts from the field of direct and database marketing, like score and lift , were also introduced in this report

    Effects of data cleaning on machine learning model performance

    Get PDF
    Abstract. This thesis is focused on the preprocessing and challenges of a university student data set and how different levels of data preprocessing affect the performance of a prediction model both in general and in selected groups of interest. The data set comprises the students at the University of Oulu who were admitted to the Faculty of Information Technology and Electrical Engineering during years 2006–2015. This data set was cleaned at three different levels, which resulted in three differently processed data sets: one set is the original data set with only basic cleaning, the second has been cleaned out of the most obvious anomalies and the third has been systematically cleaned out of possible anomalies. Each of these data sets was used to build a Gradient Boosting Machine model that predicted the cumulative number of ECTS the students would achieve by the end of their second-year studies based on their first-year studies and the Matriculation Examination results. The effects of the cleaning on the model performance were examined by comparing the prediction accuracy and the information the models gave of the factors that might indicate a slow ECTS accumulation. The results showed that the prediction accuracy improved after each cleaning stage and the influences of the features altered significantly, becoming more reasonable.Datan siivouksen vaikutukset koneoppimismallin suorituskykyyn. Tiivistelmä. Tässä tutkielmassa keskitytään opiskelijadatan esikäsittelyyn ja haasteisiin sekä siihen, kuinka eritasoinen esikäsittely vaikuttaa ennustemallin suorituskykyyn sekä yleisesti että tietyissä kiinnostuksen kohteena olevissa ryhmissä. Opiskelijadata koostuu Oulun yliopiston Tieto- ja sähkötekniikan tiedekuntaan vuosina 2006–2015 valituista opiskelijoista. Tätä opiskelijadataa käsiteltiin kolmella eri tasolla, jolloin saatiin kolme eritasoisesti siivottua versiota alkuperäisestä datajoukosta. Ensimmäinen versio on alkuperäinen datajoukko, jolle on tehty vain perussiivous, toisessa versiossa datasta on poistettu vain ilmeisimmät poikkeavuudet ja kolmannessa versiossa datasta on systemaattisesti poistettu mahdolliset poikkeavuudet. Jokaisella datajoukolla opetettiin Gradient Boosting Machine koneoppismismalli ennustamaan opiskelijoiden opintopistekertymää toisen vuoden loppuun mennessä perustuen heidän ensimmäisen vuoden opintoihinsa ja ylioppilaskirjoitustensa tuloksiin. Datan eritasoisen siivouksen vaikutuksia mallin suorituskykyyn tutkittiin vertailemalla mallien ennustetarkkuutta sekä tietoa, jota mallit antoivat niistä tekijöistä, jotka voivat ennakoida hitaampaa opintopistekertymää. Tulokset osoittivat mallin ennustetarkkuuden parantuneen jokaisen käsittelytason jälkeen sekä mallin ennustajien vaikutusten muuttuneen järjellisemmiksi

    Near-Lossless Compression for Large Traffic Networks

    Get PDF
    With advancements in sensor technologies, intelligent transportation systems can collect traffic data with high spatial and temporal resolution. However, the size of the networks combined with the huge volume of the data puts serious constraints on system resources. Low-dimensional models can help ease these constraints by providing compressed representations for the networks. In this paper, we analyze the reconstruction efficiency of several low-dimensional models for large and diverse networks. The compression performed by low-dimensional models is lossy in nature. To address this issue, we propose a near-lossless compression method for traffic data by applying the principle of lossy plus residual coding. To this end, we first develop a low-dimensional model of the network. We then apply Huffman coding (HC) in the residual layer. The resultant algorithm guarantees that the maximum reconstruction error will remain below a desired tolerance limit. For analysis, we consider a large and heterogeneous test network comprising of more than 18 000 road segments. The results show that the proposed method can efficiently compress data obtained from a large and diverse road network, while maintaining the upper bound on the reconstruction error.Singapore. National Research Foundation (Singapore-MIT Alliance for Research and Technology Center. Future Urban Mobility Program

    Condition Monitoring of Wind Turbines Using Intelligent Machine Learning Techniques

    Get PDF
    Wind Turbine condition monitoring can detect anomalies in turbine performance which have the potential to result in unexpected failure and financial loss. This study examines common Supervisory Control And Data Acquisition (SCADA) data over a period of 20 months for 21 pitch regulated 2.3 MW turbines and is presented in three manuscripts. First, power curve monitoring is targeted applying various types of Artificial Neural Networks to increase modeling accuracy. It is shown how the proposed method can significantly improve network reliability compared with existing models. Then, an advance technique is utilized to create a smoother dataset for network training followed by establishing dynamic ANFIS network. At this stage, designed network aims to predict power generation in future hours. Finally, a recursive principal component analysis is performed to extract significant features to be used as input parameters of the network. A novel fusion technique is then employed to build an advanced model to make predictions of turbines performance with favorably low errors

    ARDP: SIMPLIFIED MACHINE LEARNING PREDICTOR FOR MISSING UNIDIMENSIONAL ACADEMIC RESULTS DATASET

    Get PDF
    We present a machine learning predictor for academic results datasets (PARD), for missing academic results based on chi-squared expected calculation, positional clustering, progressive approximation of relative residuals, and positional averages of the data in a sampled population. Academic results datasets are data originating from academic institutions’ results repositories. It is a technique designed specifically for predicting missing academic results. Since the whole essence of data mining is to elicit useful information and gain knowledge-driven insights into datasets, PARD positions data explorer at this advantageous perspective. PARD promises to solve missing academic results dataset problems more quickly over and above what currently obtains in literatures. The predictor was implemented using Python, and the results obtained show that it is admissible in a minimum of up to 93.6 average percent accurate predictions of the sampled cases. The results demonstrate that PARD shows a tendency toward greater precision in providing the better solution to the problems of predictions of missing academic results datasets in universities

    Data mining tool for academic data exploitation: selection of most suitable algorithms

    Get PDF
    SPEET project is aimed at exploiting the potential synergy among the huge amount of academic data actually existing at universities and the maturity of data science in order to provide tools to extract information from students’ data. A rich picture can be extracted from this data if conveniently processed. The purpose of this project is to apply data mining algorithms to process this data in order to extract information about and to identify student profiles. In this document, the results obtained at SPEET project under the development of the data mining tools are presented. More specifically, two mechanisms have been developed: a clustering/classification scheme of students in terms of academic performance and a drop-out prediction system. The document starts by addressing the motivation of the development of data mining tools along with the considerations taken into account for academic data gathering. These considerations include the proposed unified dataset format and some details about confidentiality issues. Next, the students’ clustering and classification schemes are presented in detail. More specifically, a description of the considered machine learning algorithms can be found. Besides, a discussion of obtained results when considering data belonging to the different SPEET project’s partners is addressed. Results show how groups of clusters can be automatically identified and how new students can be classified into existing groups with a high accuracy. Finally, the implemented drop-out prediction system is considered by presenting several algorithms alternatives. In this case, the evaluation of the dropout mechanism is focused on one institution, showing a prediction accuracy around 91 %. Algorithms presented at this document are available at repositories or inline code format, as accordingly indicated.info:eu-repo/semantics/publishedVersio

    Segmentation with unsupervised learning: An application using the Walker's data

    Get PDF
    In this project, the Walkers suitable for the service were filtered by using the dataset shared by the DogGo company. Then, unsupervised machine learning methods such as K-Means, Gaussian, Principal Component Analysis were used to score and cluster the most suitable walkers according to performance, willingness, and experience. DogGo is the first mobile application in Turkey that provides pet walking and grooming services to its customers in a safe and professional manner. DogGo provides a professional service where dogs are taken care of in dog families' own homes or at the caretaker's home for any need of dog families. DogGo Company wants to provide the best matching of walkers and animals, using Machine Learning algorithms, through a 5-step acquisition process for their walkers. While the results of the K-means models created on the unique sliders were compared with the help of the Elbow method and the Silhouette score, the results of the Gaussian models were compared with the AIC and BIC method. In addition, an RFM scoring in a classical structure has also been created. When the results of the study were examined considering the Elbow and Silhouette scores, it was shown that the model created with K-Means gave the best results, and the number of clusters was decided as 2

    Manufacturing Process Causal Knowledge Discovery using a Modified Random Forest-based Predictive Model

    Get PDF
    A Modified Random Forest algorithm (MRF)-based predictive model is proposed for use in man-ufacturing processes to estimate the e˙ects of several potential interventions, such as (i) altering the operating ranges of selected continuous process parameters within specified tolerance limits,(ii) choosing particular categories of discrete process parameters, or (iii) choosing combinations of both types of process parameters. The model introduces a non-linear approach to defining the most critical process inputs by scoring the contribution made by each process input to the process output prediction power. It uses this contribution to discover optimal operating ranges for the continuous process parameters and/or optimal categories for discrete process parameters. The set of values used for the process inputs was generated from operating ranges identified using a novel Decision Path Search (DPS) algorithm and Bootstrap sampling.The odds ratio is the ratio between the occurrence probabilities of desired and undesired process output values. The e˙ect of potential interventions, or of proposed confirmation trials, are quantified as posterior odds and used to calculate conditional probability distributions. The advantages of this approach are discussed in comparison to fitting these probability distributions to Bayesian Networks (BN).The proposed explainable data-driven predictive model is scalable to a large number of process factors with non-linear dependence on one or more process responses. It allows the discovery of data-driven process improvement opportunities that involve minimal interaction with domain expertise. An iterative Random Forest algorithm is proposed to predict the missing values for the mixed dataset (continuous and categorical process parameters). It is shown that the algorithm is robust even at high proportions of missing values in the dataset.The number of observations available in manufacturing process datasets is generally low, e.g. of a similar order of magnitude to the number of process parameters. Hence, Neural Network (NN)-based deep learning methods are generally not applicable, as these techniques require 50-100 times more observations than input factors (process parameters).The results are verified on a number of benchmark examples with datasets published in the lit-erature. The results demonstrate that the proposed method outperforms the comparison approaches in term of accuracy and causality, with linearity assumed. Furthermore, the computational cost is both far better and very feasible for heterogeneous datasets
    • …
    corecore