480 research outputs found

    Multiple Imputation Ensembles (MIE) for dealing with missing data

    Get PDF
    Missing data is a significant issue in many real-world datasets, yet there are no robust methods for dealing with it appropriately. In this paper, we propose a robust approach to dealing with missing data in classification problems: Multiple Imputation Ensembles (MIE). Our method integrates two approaches: multiple imputation and ensemble methods and compares two types of ensembles: bagging and stacking. We also propose a robust experimental set-up using 20 benchmark datasets from the UCI machine learning repository. For each dataset, we introduce increasing amounts of data Missing Completely at Random. Firstly, we use a number of single/multiple imputation methods to recover the missing values and then ensemble a number of different classifiers built on the imputed data. We assess the quality of the imputation by using dissimilarity measures. We also evaluate the MIE performance by comparing classification accuracy on the complete and imputed data. Furthermore, we use the accuracy of simple imputation as a benchmark for comparison. We find that our proposed approach combining multiple imputation with ensemble techniques outperform others, particularly as missing data increases

    A Review of Missing Data Handling Techniques for Machine Learning

    Get PDF
    Real-world data are commonly known to contain missing values, and consequently affect the performance of most machine learning algorithms adversely when employed on such datasets. Precisely, missing values are among the various challenges occurring in real-world data. Since the accuracy and efficiency of machine learning models depend on the quality of the data used, there is a need for data analysts and researchers working with data, to seek out some relevant techniques that can be used to handle these inescapable missing values. This paper reviews some state-of-art practices obtained in the literature for handling missing data problems for machine learning. It lists some evaluation metrics used in measuring the performance of these techniques. This study tries to put these techniques and evaluation metrics in clear terms, followed by some mathematical equations. Furthermore, some recommendations to consider when dealing with missing data handling techniques were provided

    AI-enabled modeling and monitoring of data-rich advanced manufacturing systems

    Get PDF
    The infrastructure of cyber-physical systems (CPS) is based on a meta-concept of cybermanufacturing systems (CMS) that synchronizes the Industrial Internet of Things (IIoTs), Cloud Computing, Industrial Control Systems (ICSs), and Big Data analytics in manufacturing operations. Artificial Intelligence (AI) can be incorporated to make intelligent decisions in the day-to-day operations of CMS. Cyberattack spaces in AI-based cybermanufacturing operations pose significant challenges, including unauthorized modification of systems, loss of historical data, destructive malware, software malfunctioning, etc. However, a cybersecurity framework can be implemented to prevent unauthorized access, theft, damage, or other harmful attacks on electronic equipment, networks, and sensitive data. The five main cybersecurity framework steps are divided into procedures and countermeasure efforts, including identifying, protecting, detecting, responding, and recovering. Given the major challenges in AI-enabled cybermanufacturing systems, three research objectives are proposed in this dissertation by incorporating cybersecurity frameworks. The first research aims to detect the in-situ additive manufacturing (AM) process authentication problem using high-volume video streaming data. A side-channel monitoring approach based on an in-situ optical imaging system is established, and a tensor-based layer-wise texture descriptor is constructed to describe the observed printing path. Subsequently, multilinear principal component analysis (MPCA) is leveraged to reduce the dimension of the tensor-based texture descriptor, and low-dimensional features can be extracted for detecting attack-induced alterations. The second research work seeks to address the high-volume data stream problems in multi-channel sensor fusion for diverse bearing fault diagnosis. This second approach proposes a new multi-channel sensor fusion method by integrating acoustics and vibration signals with different sampling rates and limited training data. The frequency-domain tensor is decomposed by MPCA, resulting in low-dimensional process features for diverse bearing fault diagnosis by incorporating a Neural Network classifier. By linking the second proposed method, the third research endeavor is aligned to recovery systems of multi-channel sensing signals when a substantial amount of missing data exists due to sensor malfunction or transmission issues. This study has leveraged a fully Bayesian CANDECOMP/PARAFAC (FBCP) factorization method that enables to capture of multi-linear interaction (channels × signals) among latent factors of sensor signals and imputes missing entries based on observed signals

    Novel Matrix Completion Methods for Missing Data Recovery in Urban Systems

    Get PDF
    Urban systems are composed of multiple interdependent subsystems, for example, the traffic monitoring system, the air quality quality monitoring system or the electricity distribution system. Each subsystem generates a different dataset that is used for analyzing, monitoring, and controlling that particular component and the overall system. The quantity and quality of the data is essential for the successful applicability of the monitoring and control strategies. In practical settings, the sensing and communication infrastructures are prone to introduce telemetry errors that generate incomplete datasets. In this context, recovering missing data is paramount for ensuring the resilience of the urban system. First, the missing data from each subsystem is recovered using only the available observations from that subsystem. The fundamental limits of the missing data recovery problem are characterized by defining the optimal performance theoretically attainable by any estimator. The performance of a standard matrix-completion based algorithm and the linear minimum mean squared error estimator are benchmarked using real data from a low voltage distribution system. The comparison with the fundamental limit highlights the drawbacks of both methods in a practical setting. A new recovery method that combines the optimality of the Bayesian estimation with the matrix completion-based recovery is proposed. Numerical simulations show that the novel approach provides robust performance in realistic scenarios. Secondly, the correlation that results from the interdependence between the subsystems of an urban system is exploited in a joint recovery setting. To that end, the available observations from two interconnected components of an urban system are aggregated into a data matrix that is used for the recovery process. The fundamental limits of the joint recovery of two datasets are theoretically derived. In addition, when the locations of the missing entries from each dataset are uniformly distributed, theoretical bounds for the probability of recovery are established and used to minimize the acquisition cost. The numerical analysis shows that the proposed algorithm outperforms the standard matrix completion-based recovery in various noise and sampling regimes within the joint recovery setting

    Representation learning in finance

    Get PDF
    Finance studies often employ heterogeneous datasets from different sources with different structures and frequencies. Some data are noisy, sparse, and unbalanced with missing values; some are unstructured, containing text or networks. Traditional techniques often struggle to combine and effectively extract information from these datasets. This work explores representation learning as a proven machine learning technique in learning informative embedding from complex, noisy, and dynamic financial data. This dissertation proposes novel factorization algorithms and network modeling techniques to learn the local and global representation of data in two specific financial applications: analysts’ earnings forecasts and asset pricing. Financial analysts’ earnings forecast is one of the most critical inputs for security valuation and investment decisions. However, it is challenging to fully utilize this type of data due to the missing values. This work proposes one matrix-based algorithm, “Coupled Matrix Factorization,” and one tensor-based algorithm, “Nonlinear Tensor Coupling and Completion Framework,” to impute missing values in analysts’ earnings forecasts and then use the imputed data to predict firms’ future earnings. Experimental analysis shows that missing value imputation and representation learning by coupled matrix/tensor factorization from the observed entries improve the accuracy of firm earnings prediction. The results confirm that representing financial time-series in their natural third-order tensor form improves the latent representation of the data. It learns high-quality embedding by overcoming information loss of flattening data in spatial or temporal dimensions. Traditional asset pricing models focus on linear relationships among asset pricing factors and often ignore nonlinear interaction among firms and factors. This dissertation formulates novel methods to identify nonlinear asset pricing factors and develops asset pricing models that capture global and local properties of data. First, this work proposes an artificial neural network “auto enco der” based model to capture the latent asset pricing factors from the global representation of an equity index. It also shows that autoencoder effectively identifies communal and non-communal assets in an index to facilitate portfolio optimization. Second, the global representation is augmented by propagating information from local communities, where the network determines the strength of this information propagation. Based on the Laplacian spectrum of the equity market network, a network factor “Z-score” is proposed to facilitate pertinent information propagation and capture dynamic changes in network structures. Finally, a “Dynamic Graph Learning Framework for Asset Pricing” is proposed to combine both global and local representations of data into one end-to-end asset pricing model. Using graph attention mechanism and information diffusion function, the proposed model learns new connections for implicit networks and refines connections of explicit networks. Experimental analysis shows that the proposed model incorporates information from negative and positive connections, captures the network evolution of the equity market over time, and outperforms other state-of-the-art asset pricing and predictive machine learning models in stock return prediction. In a broader context, this is a pioneering work in FinTech, particularly in understanding complex financial market structures and developing explainable artificial intelligence models for finance applications. This work effectively demonstrates the application of machine learning to model financial networks, capture nonlinear interactions on data, and provide investors with powerful data-driven techniques for informed decision-making

    Spatial-Temporal Data Mining for Ocean Science: Data, Methodologies, and Opportunities

    Full text link
    With the increasing amount of spatial-temporal~(ST) ocean data, numerous spatial-temporal data mining (STDM) studies have been conducted to address various oceanic issues, e.g., climate forecasting and disaster warning. Compared with typical ST data (e.g., traffic data), ST ocean data is more complicated with some unique characteristics, e.g., diverse regionality and high sparsity. These characteristics make it difficult to design and train STDM models. Unfortunately, an overview of these studies is still missing, hindering computer scientists to identify the research issues in ocean while discouraging researchers in ocean science from applying advanced STDM techniques. To remedy this situation, we provide a comprehensive survey to summarize existing STDM studies in ocean. Concretely, we first summarize the widely-used ST ocean datasets and identify their unique characteristics. Then, typical ST ocean data quality enhancement techniques are discussed. Next, we classify existing STDM studies for ocean into four types of tasks, i.e., prediction, event detection, pattern mining, and anomaly detection, and elaborate the techniques for these tasks. Finally, promising research opportunities are highlighted. This survey will help scientists from the fields of both computer science and ocean science have a better understanding of the fundamental concepts, key techniques, and open challenges of STDM in ocean

    Automatic generation of meta classifiers with large levels for distributed computing and networking

    Full text link
    This paper is devoted to a case study of a new construction of classifiers. These classifiers are called automatically generated multi-level meta classifiers, AGMLMC. The construction combines diverse meta classifiers in a new way to create a unified system. This original construction can be generated automatically producing classifiers with large levels. Different meta classifiers are incorporated as low-level integral parts of another meta classifier at the top level. It is intended for the distributed computing and networking. The AGMLMC classifiers are unified classifiers with many parts that can operate in parallel. This make it easy to adopt them in distributed applications. This paper introduces new construction of classifiers and undertakes an experimental study of their performance. We look at a case study of their effectiveness in the special case of the detection and filtering of phishing emails. This is a possible important application area for such large and distributed classification systems. Our experiments investigate the effectiveness of combining diverse meta classifiers into one AGMLMC classifier in the case study of detection and filtering of phishing emails. The results show that new classifiers with large levels achieved better performance compared to the base classifiers and simple meta classifiers classifiers. This demonstrates that the new technique can be applied to increase the performance if diverse meta classifiers are included in the system
    • 

    corecore