110 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Feature-based multi-class classification and novelty detection for fault diagnosis of industrial machinery

    Get PDF
    Given the strategic role that maintenance assumes in achieving profitability and competitiveness, many industries are dedicating many efforts and resources to improve their maintenance approaches. The concept of the Smart Factory and the possibility of highly connected plants enable the collection of massive data that allow equipment to be monitored continuously and real-time feedback on their health status. The main issue met by industries is the lack of data corresponding to faulty conditions, due to environmental and safety issues that failed machinery might cause, besides the production loss and product quality issues. In this paper, a complete and easy-to-implement procedure for streaming fault diagnosis and novelty detection, using different Machine Learning techniques, is applied to an industrial machinery sub-system. The paper aims to offer useful guidelines to practitioners to choose the best solution for their systems, including a model hyperparameter optimization technique that supports the choice of the best model. Results indicate that the methodology is easy, fast, and accurate. Few training data guarantee a high accuracy and a high generalization ability of the classification models, while the integration of a classifier and an anomaly detector reduces the number of false alarms and the computational time

    Hybrid ACO and SVM algorithm for pattern classification

    Get PDF
    Ant Colony Optimization (ACO) is a metaheuristic algorithm that can be used to solve a variety of combinatorial optimization problems. A new direction for ACO is to optimize continuous and mixed (discrete and continuous) variables. Support Vector Machine (SVM) is a pattern classification approach originated from statistical approaches. However, SVM suffers two main problems which include feature subset selection and parameter tuning. Most approaches related to tuning SVM parameters discretize the continuous value of the parameters which will give a negative effect on the classification performance. This study presents four algorithms for tuning the SVM parameters and selecting feature subset which improved SVM classification accuracy with smaller size of feature subset. This is achieved by performing the SVM parameters’ tuning and feature subset selection processes simultaneously. Hybridization algorithms between ACO and SVM techniques were proposed. The first two algorithms, ACOR-SVM and IACOR-SVM, tune the SVM parameters while the second two algorithms, ACOMV-R-SVM and IACOMV-R-SVM, tune the SVM parameters and select the feature subset simultaneously. Ten benchmark datasets from University of California, Irvine, were used in the experiments to validate the performance of the proposed algorithms. Experimental results obtained from the proposed algorithms are better when compared with other approaches in terms of classification accuracy and size of the feature subset. The average classification accuracies for the ACOR-SVM, IACOR-SVM, ACOMV-R and IACOMV-R algorithms are 94.73%, 95.86%, 97.37% and 98.1% respectively. The average size of feature subset is eight for the ACOR-SVM and IACOR-SVM algorithms and four for the ACOMV-R and IACOMV-R algorithms. This study contributes to a new direction for ACO that can deal with continuous and mixed-variable ACO

    MACHINERY ANOMALY DETECTION UNDER INDETERMINATE OPERATING CONDITIONS

    Get PDF
    Anomaly detection is a critical task in system health monitoring. Current practice of anomaly detection in machinery systems is still unsatisfactory. One issue is with the use of features. Some features are insensitive to the change of health, and some are redundant with each other. These insensitive and redundant features in the data mislead the detection. Another issue is from the influence of operating conditions, where a change in operating conditions can be mistakenly detected as an anomalous state of the system. Operating conditions are usually changing, and they may not be readily identified. They contribute to false positive detection either from non-predictive features driven by operating conditions, or from influencing predictive features. This dissertation contributes to the reduction of false detection by developing methods to select predictive features and use them to span a space for anomaly detection under indeterminate operating conditions. Available feature selection methods fail to provide consistent results when some features are correlated. A method was developed in this dissertation to explore the correlation structure of features and group correlated features into the same clusters. A representative feature from each cluster is selected to form a non-correlated set of features, where an optimized subset of predictive features is selected. After feature selection, the influence of operating conditions through non-predictive variables are removed. To remove the influence on predictive features, a clustering-based anomaly detection method is developed. Observations are collected when the system is healthy, and these observations are grouped into clusters corresponding to the states of operating conditions with automatic estimation of clustering parameters. Anomalies are detected if the test data are not members of the clusters. Correct partitioning of clusters is an open challenge due to the lack of research on the clustering of the machinery health monitoring data. This dissertation uses unimodality of the data as a criterion for clustering validation, and a unimodality-based clustering method is developed. Methods of this dissertation were evaluated by simulated data, benchmark data, experimental study and field data. These methods provide consistent results and outperform representatives of available methods. Although the focus of this dissertation is on the application of machinery systems, the methods developed in this dissertation can be adapted for other application scenarios for anomaly detection, feature selection, and clustering

    HYBRID FEATURE SELECTION AND SUPPORT VECTOR MACHINE FRAMEWORK FOR PREDICTING MAINTENANCE FAILURES

    Get PDF
    The main aim of predictive maintenance is to minimize downtime, failure risks and maintenance costs in manufacturing systems. Over the past few years, machine learning methods gained ground with diverse and successful applications in the area of predictive maintenance. This study shows that performing preprocessing techniques such as oversampling and features selection for failure prediction, is promising. For instance, to handle imbalanced data, the SMOTE-Tomek method is used. For features selection, three different methods can be applied: Recursive Feature Elimination, Random Forest and Variance Threshold. The data considered in this paper for simulation is used in literature; it is applied to aircraft engine sensors measurements to predict engines failure, while the predicting algorithm used is a Support Vector Machine. The results show that classification accuracy can be significantly boosted by using the preprocessing techniques

    Review of data mining applications for quality assessment in manufacturing industry: Support Vector Machines

    Get PDF
    In many modern manufacturing industries, data that characterize the manufacturing process are electronically collected and stored in the databases. Due to advances in data collection systems and analysis tools, data mining (DM) has widely been applied for quality assessment (QA) in manufacturing industries. In DM, the choice of technique to use in analyzing a dataset and assessing the quality depend on the understanding of the analyst. On the other hand, with the advent of improved and efficient prediction techniques, there is a need for an analyst to know which tool performs best for a particular type of data set. Although a few review papers have recently been published to discuss DM applications in manufacturing for QA, this paper provides an extensive review to investigate the application of a special DM technique, namely support vector machine (SVM) to solve QA problems. The review provides a comprehensive analysis of the literature from various points of view as DM preliminaries, data preprocessing, DM applications for each quality task, SVM preliminaries, and application results. Summary tables and figures are also provided besides to the analyses. Finally, conclusions and future research directions are provided

    Mathematical optimization and feature selection

    Get PDF
    Universidad de Sevilla. Grado en Matemática

    A Review of using Data Mining Techniques in Power Plants

    Get PDF
    Data mining techniques and their applications have developed rapidly during the last two decades. This paper reviews application of data mining techniques in power systems, specially in power plants, through a survey of literature between the year 2000 and 2015. Keyword indices, articles’ abstracts and conclusions were used to classify more than 86 articles about application of data mining in power plants, from many academic journals and research centers. Because this paper concerns about application of data mining in power plants; the paper started by providing a brief introduction about data mining and power systems to give the reader better vision about these two different disciplines. This paper presents a comprehensive survey of the collected articles and classifies them according to three categories: the used techniques, the problem and the application area. From this review we found that data mining techniques (classification, regression, clustering and association rules) could be used to solve many types of problems in power plants, like predicting the amount of generated power, failure prediction, failure diagnosis, failure detection and many others. Also there is no standard technique that could be used for a specific problem. Application of data mining in power plants is a rich research area and still needs more exploration

    Analysis of Microarray Data using Machine Learning Techniques on Scalable Platforms

    Get PDF
    Microarray-based gene expression profiling has been emerged as an efficient technique for classification, diagnosis, prognosis, and treatment of cancer disease. Frequent changes in the behavior of this disease, generate a huge volume of data. The data retrieved from microarray cover its veracities, and the changes observed as time changes (velocity). Although, it is a type of high-dimensional data which has very large number of features rather than number of samples. Therefore, the analysis of microarray high-dimensional dataset in a short period is very much essential. It often contains huge number of data, only a fraction of which comprises significantly expressed genes. The identification of the precise and interesting genes which are responsible for the cause of cancer is imperative in microarray data analysis. Most of the existing schemes employ a two phase process such as feature selection/extraction followed by classification. Our investigation starts with the analysis of microarray data using kernel based classifiers followed by feature selection using statistical t-test. In this work, various kernel based classifiers like Extreme learning machine (ELM), Relevance vector machine (RVM), and a new proposed method called kernel fuzzy inference system (KFIS) are implemented. The proposed models are investigated using three microarray datasets like Leukemia, Breast and Ovarian cancer. Finally, the performance of these classifiers are measured and compared with Support vector machine (SVM). From the results, it is revealed that the proposed models are able to classify the datasets efficiently and the performance is comparable to the existing kernel based classifiers. As the data size increases, to handle and process these datasets becomes very bottleneck. Hence, a distributed and a scalable cluster like Hadoop is needed for storing (HDFS) and processing (MapReduce as well as Spark) the datasets in an efficient way. The next contribution in this thesis deals with the implementation of feature selection methods, which are able to process the data in a distributed manner. Various statistical tests like ANOVA, Kruskal-Wallis, and Friedman tests are implemented using MapReduce and Spark frameworks which are executed on the top of Hadoop cluster. The performance of these scalable models are measured and compared with the conventional system. From the results, it is observed that the proposed scalable models are very efficient to process data of larger dimensions (GBs, TBs, etc.), as it is not possible to process with the traditional implementation of those algorithms. After selecting the relevant features, the next contribution of this thesis is the scalable viii implementation of the proximal support vector machine classifier, which is an efficient variant of SVM. The proposed classifier is implemented on the two scalable frameworks like MapReduce and Spark and executed on the Hadoop cluster. The obtained results are compared with the results obtained using conventional system. From the results, it is observed that the scalable cluster is well suited for the Big data. Furthermore, it is concluded that Spark is more efficient than MapReduce due to its an intelligent way of handling the datasets through Resilient distributed dataset (RDD) as well as in-memory processing and conventional system to analyze the Big datasets. Therefore, the next contribution of the thesis is the implementation of various scalable classifiers base on Spark. In this work various classifiers like, Logistic regression (LR), Support vector machine (SVM), Naive Bayes (NB), K-Nearest Neighbor (KNN), Artificial Neural Network (ANN), and Radial basis function network (RBFN) with two variants hybrid and gradient descent learning algorithms are proposed and implemented using Spark framework. The proposed scalable models are executed on Hadoop cluster as well as conventional system and the results are investigated. From the obtained results, it is observed that the execution of the scalable algorithms are very efficient than conventional system for processing the Big datasets. The efficacy of the proposed scalable algorithms to handle Big datasets are investigated and compared with the conventional system (where data are not distributed, kept on standalone machine and processed in a traditional manner). The comparative analysis shows that the scalable algorithms are very efficient to process Big datasets on Hadoop cluster rather than the conventional system
    corecore