569 research outputs found

    Novel Computationally Intelligent Machine Learning Algorithms for Data Mining and Knowledge Discovery

    Get PDF
    This thesis addresses three major issues in data mining regarding feature subset selection in large dimensionality domains, plausible reconstruction of incomplete data in cross-sectional applications, and forecasting univariate time series. For the automated selection of an optimal subset of features in real time, we present an improved hybrid algorithm: SAGA. SAGA combines the ability to avoid being trapped in local minima of Simulated Annealing with the very high convergence rate of the crossover operator of Genetic Algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks (GRNN). For imputing missing values and forecasting univariate time series, we propose a homogeneous neural network ensemble. The proposed ensemble consists of a committee of Generalized Regression Neural Networks (GRNNs) trained on different subsets of features generated by SAGA and the predictions of base classifiers are combined by a fusion rule. This approach makes it possible to discover all important interrelations between the values of the target variable and the input features. The proposed ensemble scheme has two innovative features which make it stand out amongst ensemble learning algorithms: (1) the ensemble makeup is optimized automatically by SAGA; and (2) GRNN is used for both base classifiers and the top level combiner classifier. Because of GRNN, the proposed ensemble is a dynamic weighting scheme. This is in contrast to the existing ensemble approaches which belong to the simple voting and static weighting strategy. The basic idea of the dynamic weighting procedure is to give a higher reliability weight to those scenarios that are similar to the new ones. The simulation results demonstrate the validity of the proposed ensemble model

    Causal Discovery from Temporal Data: An Overview and New Perspectives

    Full text link
    Temporal data, representing chronological observations of complex systems, has always been a typical data structure that can be widely generated by many domains, such as industry, medicine and finance. Analyzing this type of data is extremely valuable for various applications. Thus, different temporal data analysis tasks, eg, classification, clustering and prediction, have been proposed in the past decades. Among them, causal discovery, learning the causal relations from temporal data, is considered an interesting yet critical task and has attracted much research attention. Existing casual discovery works can be divided into two highly correlated categories according to whether the temporal data is calibrated, ie, multivariate time series casual discovery, and event sequence casual discovery. However, most previous surveys are only focused on the time series casual discovery and ignore the second category. In this paper, we specify the correlation between the two categories and provide a systematical overview of existing solutions. Furthermore, we provide public datasets, evaluation metrics and new perspectives for temporal data casual discovery.Comment: 52 pages, 6 figure

    Hybrid deep neural networks for mining heterogeneous data

    Get PDF
    In the era of big data, the rapidly growing flood of data represents an immense opportunity. New computational methods are desired to fully leverage the potential that exists within massive structured and unstructured data. However, decision-makers are often confronted with multiple diverse heterogeneous data sources. The heterogeneity includes different data types, different granularities, and different dimensions, posing a fundamental challenge in many applications. This dissertation focuses on designing hybrid deep neural networks for modeling various kinds of data heterogeneity. The first part of this dissertation concerns modeling diverse data types, the first kind of data heterogeneity. Specifically, image data and heterogeneous meta data are modeled. Detecting Copy Number Variations (CNVs) in genetic studies is used as a motivating example. A CNN-DNN blended neural network is proposed to authenticate CNV calls made by current state-of-art CNV detection algorithms. It utilizes hybrid deep neural networks to leverage both scatter plot image signal and heterogeneous numerical meta data for improving CNV calling and review efficiency. The second part of this dissertation deals with data of various frequencies or scales in time series data analysis, the second kind of data heterogeneity. The stock return forecasting problem in the finance field is used as a motivating example. A hybrid framework of Long-Short Term Memory and Deep Neural Network (LSTM-DNN) is developed to enrich the time-series forecasting task with static fundamental information. The application of the proposed framework is not limited to the stock return forecasting problem, but any time-series based prediction tasks. The third part of this dissertation makes an extension of LSTM-DNN framework to account for both temporal and spatial dependency among variables, common in many applications. For example, it is known that stock prices of relevant firms tend to fluctuate together. Such coherent price changes among relevant stocks are referred to a spatial dependency. In this part, Variational Auto Encoder (VAE) is first utilized to recover the latent graphical dependency structure among variables. Then a hybrid deep neural network of Graph Convolutional Network and Long-Short Term Memory network (GCN-LSTM) is developed to model both the graph structured spatial dependency and temporal dependency of variables at different scales. Extensive experiments are conducted to demonstrate the effectiveness of the proposed neural networks with application to solve three representative real-world problems. Additionally, the proposed frameworks can also be applied to other areas filled with similar heterogeneous inputs

    Anomaly Detection and Exploratory Causal Analysis for SAP HANA

    Get PDF
    Nowadays, the good functioning of the equipment, networks and systems will be the key for the business of a company to continue operating because it is never avoidable for the companies to use information technology to support their business in the era of big data. However, the technology is never infallible, faults that give rise to sometimes critical situations may appear at any time. To detect and prevent failures, it is very essential to have a good monitoring system which is responsible for controlling the technology used by a company (hardware, networks and communications, operating systems or applications, among others) in order to analyze their operation and performance, and to detect and alert about possible errors. The aim of this thesis is thus to further advance the field of anomaly detection and exploratory causal inference which are two major research areas in a monitoring system, to provide efficient algorithms with regards to the usability, maintainability and scalability. The analyzed results can be viewed as a starting point for the root cause analysis of the system performance issues and to avoid falls in the system or minimize the time of resolution of the issues in the future. The algorithms were performed on the historical data of SAP HANA database at last and the results gained in this thesis indicate that the tools have succeeded in providing some useful information for diagnosing the performance issues of the system

    Data Mining Applications: Promise and Challenges

    Get PDF
    Data mining is an emerging field gaining acceptance in research and industry. This is evidenced by an increasing number of research publications, conferences, journals and industry initiatives focused in this field in the recent past. Data mining aims to solve an intricate problem faced by a number of application domains today with the deluge of data that exists and is continually collected, typically, in large electronic databases. That is, to extract useful, meaningful knowledge from these vast data sets. Human analytical capabilities are limited, especially in its ability to analyse large and complex data sets. Data mining provides a number of tools and techniques that enables analysis of such data sets. Data mining incorporates techniques from a number of fields including statistics, machine learning, database management, artificial intelligence, pattern recognition, and data visualisation

    Regulatory Snapshots: Integrative Mining of Regulatory Modules from Expression Time Series and Regulatory Networks

    Get PDF
    Explaining regulatory mechanisms is crucial to understand complex cellular responses leading to system perturbations. Some strategies reverse engineer regulatory interactions from experimental data, while others identify functional regulatory units (modules) under the assumption that biological systems yield a modular organization. Most modular studies focus on network structure and static properties, ignoring that gene regulation is largely driven by stimulus-response behavior. Expression time series are key to gain insight into dynamics, but have been insufficiently explored by current methods, which often (1) apply generic algorithms unsuited for expression analysis over time, due to inability to maintain the chronology of events or incorporate time dependency; (2) ignore local patterns, abundant in most interesting cases of transcriptional activity; (3) neglect physical binding or lack automatic association of regulators, focusing mainly on expression patterns; or (4) limit the discovery to a predefined number of modules. We propose Regulatory Snapshots, an integrative mining approach to identify regulatory modules over time by combining transcriptional control with response, while overcoming the above challenges. Temporal biclustering is first used to reveal transcriptional modules composed of genes showing coherent expression profiles over time. Personalized ranking is then applied to prioritize prominent regulators targeting the modules at each time point using a network of documented regulatory associations and the expression data. Custom graphics are finally depicted to expose the regulatory activity in a module at consecutive time points (snapshots). Regulatory Snapshots successfully unraveled modules underlying yeast response to heat shock and human epithelial-to-mesenchymal transition, based on regulations documented in the YEASTRACT and JASPAR databases, respectively, and available expression data. Regulatory players involved in functionally enriched processes related to these biological events were identified. Ranking scores further suggested ability to discern the primary role of a gene (target or regulator). Prototype is available at: http://kdbio.inesc-id.pt/software/regulatorysnapshots

    Systematic Review on Missing Data Imputation Techniques with Machine Learning Algorithms for Healthcare

    Get PDF
    Missing data is one of the most common issues encountered in data cleaning process especially when dealing with medical dataset. A real collected dataset is prone to be incomplete, inconsistent, noisy and redundant due to potential reasons such as human errors, instrumental failures, and adverse death. Therefore, to accurately deal with incomplete data, a sophisticated algorithm is proposed to impute those missing values. Many machine learning algorithms have been applied to impute missing data with plausible values. However, among all machine learning imputation algorithms, KNN algorithm has been widely adopted as an imputation for missing data due to its robustness and simplicity and it is also a promising method to outperform other machine learning methods. This paper provides a comprehensive review of different imputation techniques used to replace the missing data. The goal of the review paper is to bring specific attention to potential improvements to existing methods and provide readers with a better grasps of imputation technique trends
    corecore