423 research outputs found

    Big Data Analytics for Complex Systems

    Get PDF
    The evolution of technology in all fields led to the generation of vast amounts of data by modern systems. Using data to extract information, make predictions, and make decisions is the current trend in artificial intelligence. The advancement of big data analytics tools made accessing and storing data easier and faster than ever, and machine learning algorithms help to identify patterns in and extract information from data. The current tools and machines in health, computer technologies, and manufacturing can generate massive raw data about their products or samples. The author of this work proposes a modern integrative system that can utilize big data analytics, machine learning, super-computer resources, and industrial health machines’ measurements to build a smart system that can mimic the human intelligence skills of observations, detection, prediction, and decision-making. The applications of the proposed smart systems are included as case studies to highlight the contributions of each system. The first contribution is the ability to utilize big data revolutionary and deep learning technologies on production lines to diagnose incidents and take proper action. In the current digital transformational industrial era, Industry 4.0 has been receiving researcher attention because it can be used to automate production-line decisions. Reconfigurable manufacturing systems (RMS) have been widely used to reduce the setup cost of restructuring production lines. However, the current RMS modules are not linked to the cloud for online decision-making to take the proper decision; these modules must connect to an online server (super-computer) that has big data analytics and machine learning capabilities. The online means that data is centralized on cloud (supercomputer) and accessible in real-time. In this study, deep neural networks are utilized to detect the decisive features of a product and build a prediction model in which the iFactory will make the necessary decision for the defective products. The Spark ecosystem is used to manage the access, processing, and storing of the big data streaming. This contribution is implemented as a closed cycle, which for the best of our knowledge, no one in the literature has introduced big data analysis using deep learning on real-time applications in the manufacturing system. The code shows a high accuracy of 97% for classifying the normal versus defective items. The second contribution, which is in Bioinformatics, is the ability to build supervised machine learning approaches based on the gene expression of patients to predict proper treatment for breast cancer. In the trial, to personalize treatment, the machine learns the genes that are active in the patient cohort with a five-year survival period. The initial condition here is that each group must only undergo one specific treatment. After learning about each group (or class), the machine can personalize the treatment of a new patient by diagnosing the patients’ gene expression. The proposed model will help in the diagnosis and treatment of the patient. The future work in this area involves building a protein-protein interaction network with the selected genes for each treatment to first analyze the motives of the genes and target them with the proper drug molecules. In the learning phase, a couple of feature-selection techniques and supervised standard classifiers are used to build the prediction model. Most of the nodes show a high-performance measurement where accuracy, sensitivity, specificity, and F-measure ranges around 100%. The third contribution is the ability to build semi-supervised learning for the breast cancer survival treatment that advances the second contribution. By understanding the relations between the classes, we can design the machine learning phase based on the similarities between classes. In the proposed research, the researcher used the Euclidean matrix distance among each survival treatment class to build the hierarchical learning model. The distance information that is learned through a non-supervised approach can help the prediction model to select the classes that are away from each other to maximize the distance between classes and gain wider class groups. The performance measurement of this approach shows a slight improvement from the second model. However, this model reduced the number of discriminative genes from 47 to 37. The model in the second contribution studies each class individually while this model focuses on the relationships between the classes and uses this information in the learning phase. Hierarchical clustering is completed to draw the borders between groups of classes before building the classification models. Several distance measurements are tested to identify the best linkages between classes. Most of the nodes show a high-performance measurement where accuracy, sensitivity, specificity, and F-measure ranges from 90% to 100%. All the case study models showed high-performance measurements in the prediction phase. These modern models can be replicated for different problems within different domains. The comprehensive models of the newer technologies are reconfigurable and modular; any newer learning phase can be plugged-in at both ends of the learning phase. Therefore, the output of the system can be an input for another learning system, and a newer feature can be added to the input to be considered for the learning phase

    New Morphological Features for Grading Pancreatic Ductal Adenocarcinomas

    Get PDF

    Improved support vector machine classification for imbalanced medical datasets by novel hybrid sampling combining modified mega-trend-diffusion and bagging extreme learning machine model

    Get PDF
    To handle imbalanced datasets in machine learning or deep learning models, some studies suggest sampling techniques to generate virtual examples of minority classes to improve the models' prediction accuracy. However, for kernel-based support vector machines (SVM), some sampling methods suggest generating synthetic examples in an original data space rather than in a high-dimensional feature space. This may be ineffective in improving SVM classification for imbalanced datasets. To address this problem, we propose a novel hybrid sampling technique termed modified mega-trend-diffusion-extreme learning machine (MMTD-ELM) to effectively move the SVM decision boundary toward a region of the majority class. By this movement, the prediction of SVM for minority class examples can be improved. The proposed method combines α-cut fuzzy number method for screening representative examples of majority class and MMTD method for creating new examples of the minority class. Furthermore, we construct a bagging ELM model to monitor the similarity between new examples and original data. In this paper, four datasets are used to test the efficiency of the proposed MMTD-ELM method in imbalanced data prediction. Additionally, we deployed two SVM models to compare prediction performance of the proposed MMTD-ELM method with three state-of-the-art sampling techniques in terms of geometric mean (G-mean), F-measure (F1), index of balanced accuracy (IBA) and area under curve (AUC) metrics. Furthermore, paired t-test is used to elucidate whether the suggested method has statistically significant differences from the other sampling techniques in terms of the four evaluation metrics. The experimental results demonstrated that the proposed method achieves the best average values in terms of G-mean, F1, IBA and AUC. Overall, the suggested MMTD-ELM method outperforms these sampling methods for imbalanced datasets

    Automated Feature Engineering for Deep Neural Networks with Genetic Programming

    Get PDF
    Feature engineering is a process that augments the feature vector of a machine learning model with calculated values that are designed to enhance the accuracy of a model’s predictions. Research has shown that the accuracy of models such as deep neural networks, support vector machines, and tree/forest-based algorithms sometimes benefit from feature engineering. Expressions that combine one or more of the original features usually create these engineered features. The choice of the exact structure of an engineered feature is dependent on the type of machine learning model in use. Previous research demonstrated that various model families benefit from different types of engineered feature. Random forests, gradient-boosting machines, or other tree-based models might not see the same accuracy gain that an engineered feature allowed neural networks, generalized linear models, or other dot-product based models to achieve on the same data set. This dissertation presents a genetic programming-based algorithm that automatically engineers features that increase the accuracy of deep neural networks for some data sets. For a genetic programming algorithm to be effective, it must prioritize the search space and efficiently evaluate what it finds. This dissertation algorithm faced a potential search space composed of all possible mathematical combinations of the original feature vector. Five experiments were designed to guide the search process to efficiently evolve good engineered features. The result of this dissertation is an automated feature engineering (AFE) algorithm that is computationally efficient, even though a neural network is used to evaluate each candidate feature. This approach gave the algorithm a greater opportunity to specifically target deep neural networks in its search for engineered features that improve accuracy. Finally, a sixth experiment empirically demonstrated the degree to which this algorithm improved the accuracy of neural networks on data sets augmented by the algorithm’s engineered features

    The 6th Conference of PhD Students in Computer Science

    Get PDF

    Harnessing Evolution in-Materio as an Unconventional Computing Resource

    Get PDF
    This thesis illustrates the use and development of physical conductive analogue systems for unconventional computing using the Evolution in-Materio (EiM) paradigm. EiM uses an Evolutionary Algorithm to configure and exploit a physical material (or medium) for computation. While EiM processors show promise, fundamental questions and scaling issues remain. Additionally, their development is hindered by slow manufacturing and physical experimentation. This work addressed these issues by implementing simulated models to speed up research efforts, followed by investigations of physically implemented novel in-materio devices. Initial work leveraged simulated conductive networks as single substrate ‘monolithic’ EiM processors, performing classification by formulating the system as an optimisation problem, solved using Differential Evolution. Different material properties and algorithm parameters were isolated and investigated; which explained the capabilities of configurable parameters and showed ideal nanomaterial choice depended upon problem complexity. Subsequently, drawing from concepts in the wider Machine Learning field, several enhancements to monolithic EiM processors were proposed and investigated. These ensured more efficient use of training data, better classification decision boundary placement, an independently optimised readout layer, and a smoother search space. Finally, scalability and performance issues were addressed by constructing in-Materio Neural Networks (iM-NNs), where several EiM processors were stacked in parallel and operated as physical realisations of Hidden Layer neurons. Greater flexibility in system implementation was achieved by re-using a single physical substrate recursively as several virtual neurons, but this sacrificed faster parallelised execution. These novel iM-NNs were first implemented using Simulated in-Materio neurons, and trained for classification as Extreme Learning Machines, which were found to outperform artificial networks of a similar size. Physical iM-NN were then implemented using a Raspberry Pi, custom Hardware Interface and Lambda Diode based Physical in-Materio neurons, which were trained successfully with neuroevolution. A more complex AutoEncoder structure was then proposed and implemented physically to perform dimensionality reduction on a handwritten digits dataset, outperforming both Principal Component Analysis and artificial AutoEncoders. This work presents an approach to exploit systems with interesting physical dynamics, and leverage them as a computational resource. Such systems could become low power, high speed, unconventional computing assets in the future

    Combined optimization algorithms applied to pattern classification

    Get PDF
    Accurate classification by minimizing the error on test samples is the main goal in pattern classification. Combinatorial optimization is a well-known method for solving minimization problems, however, only a few examples of classifiers axe described in the literature where combinatorial optimization is used in pattern classification. Recently, there has been a growing interest in combining classifiers and improving the consensus of results for a greater accuracy. In the light of the "No Ree Lunch Theorems", we analyse the combination of simulated annealing, a powerful combinatorial optimization method that produces high quality results, with the classical perceptron algorithm. This combination is called LSA machine. Our analysis aims at finding paradigms for problem-dependent parameter settings that ensure high classifica, tion results. Our computational experiments on a large number of benchmark problems lead to results that either outperform or axe at least competitive to results published in the literature. Apart from paxameter settings, our analysis focuses on a difficult problem in computation theory, namely the network complexity problem. The depth vs size problem of neural networks is one of the hardest problems in theoretical computing, with very little progress over the past decades. In order to investigate this problem, we introduce a new recursive learning method for training hidden layers in constant depth circuits. Our findings make contributions to a) the field of Machine Learning, as the proposed method is applicable in training feedforward neural networks, and to b) the field of circuit complexity by proposing an upper bound for the number of hidden units sufficient to achieve a high classification rate. One of the major findings of our research is that the size of the network can be bounded by the input size of the problem and an approximate upper bound of 8 + √2n/n threshold gates as being sufficient for a small error rate, where n := log/SL and SL is the training set

    INVESTIGATION OF ORTHOGONAL POLYNOMIAL KERNELS AS SIMILARITY FUNCTIONS FOR PATTERN CLASSIFICATION BY SUPPORT VECTOR MACHINES

    Get PDF
    A kernel function is an important component in the support vector machine (SVM) kernel-based classifier. This is due to the elegant mathematical characteristics of a kernel, which amount to the mapping of non-linearly separable classes to an implicit higher-dimensional feature space where they can become linearly separable, and hence easier to classify. Such characteristics are those prescribed by the underpinning positive semi-definite (PSD) property. The properties of this feature space can, however, be difficult to interpret, to customize or select an appropriate kernel for the classification task at hand. Moreover, the high-dimensionality of the feature space does not usually provide apparent and intuitive information about the natural representations of the data in the input space, as the construction of this feature space is only implicit. On the other hand, SVM kernels have also been regarded as similarity functions in many contexts to measure the resemblance between two patterns, which can be from the same or different classes. However, despite the elegant theory of PSD kernels, and its remarkable implications on the performance of many learning algorithms, limited research efforts seem to have studied kernels from this similarity perspective. Given that patterns from the same class share more similar characteristics than those belonging to different classes, this similarity perspective can therefore provide more tangible means to craft or select appropriate kernels than the properties of the implicit high-dimensional feature spaces that one might not even be able to calculate. This thesis therefore aims to: (i) investigate the similarity-based properties, which can be exploited to characterise kernels (with focus on the so-called “orthogonal polynomial kernels”) when used as similarity functions, and (ii) assess the influence of these properties on the performance of the SVM classifier. An appropriate similarity-based model is therefore defined in the thesis based on how the shape of an SVM kernel should ideally look like when used to measure the similarity between its two inputs. The model proposes that the similarity curve should be maximized when the two kernel inputs are identical, and it should decay monotonically as they differ more and more from each other. Motivated by the pictorial characteristics of the Chebyshev kernels reported in the literature, the thesis adopts this kernel-shape perspective to also study some other orthogonal polynomial kernels (such as the Legendre kernels and Hermite kernels), to underpin the assessment of the proposed ideal shape of the similarity curve for kernel-based pattern classification by SVMs. The analysis of these polynomial kernels revealed that they are naturally constructed from smaller kernel building blocks, which are combined by summation and multiplication operations. A novel similarity fusion framework is therefore developed in this thesis to investigate the effect of these fusion operations on the shape characteristics of the kernels and on their classification performance. This framework is developed in three stages, where Stage 1 kernels are those building blocks constructed from only the polynomial order n (the highest order under consideration), whereas Stage 2 kernels combine all the Stage 1 kernel blocks (from order 0 to n) using a summation fusion operation. The Stage 3 kernels finally combine Stage 2 kernels with another kernel via a multiplication fusion operation. The analysis of the shape characteristics of these three-stage polynomial kernels revealed that their inherent fusion operations are synergistic in nature, as they bring their shapes closer to the ideal similarity function model, and hence enable the calculation of more accurate similarity measures, and accordingly score better classification performance. Experimental results showed that these summative and multiplicative fusion operations improved the classification accuracy by average factors of 17.35% and 19.16%, respectively, depending on the dataset and the polynomial function employed. On the other hand, the shapes of the Stage 2 polynomial kernels have also been shown to oscillate after a certain threshold within the standard normalized input space of [-1,1]. A simple adaptive data normalization approach is therefore proposed to confine the data to the threshold window where these kernels exhibit the sought after ideal shape characteristics, hence eliminate the possibility of any data point to be located outside the range where these oscillations are observed. The implementation of the adaptive data normalization approach accordingly leads to a more accurate calculation of similarity measures and improves the classification performance. When compared to the standard normalized input space, experimental results (performed on the Stage 2 kernels) demonstrate the effectiveness of the proposed adaptive data normalization approach, with an average accuracy improvement factor of 11.772%, depending on the dataset and the polynomial function utilized. Finally, a new perspective is also introduced whereby the utilization of orthogonal polynomials is perceived as a way of transforming the input space to another vector space, of the same dimensionality as the input space, prior to the kernel calculation step. Based on this perspective, a novel processing approach, based on vector concatenation, is proposed which, unlike the previous approaches, ensures that the quantities processed by each polynomial order are always formulated in vector form. This way, the attributes embedded in the structure of the original vectors are maintained intact. The proposed concatenated processing approach can also be used with any polynomial function, regardless of the parity combination of its monomials, whether they are only odd, only even, or a combination of both. Moreover, the Gaussian kernel is also proposed to be evaluated on vectors processed by the polynomial kernels (instead of the linear kernel used in the previous approaches), due to the more accurate similarity shape characteristics of the Gaussian kernel, as well as its renowned ability to implicitly map the input space to a feature space of higher dimensionality. Experimental results demonstrate the superiority of the concatenated approach for all the three polynomial-kernel stages of the developed similarity fusion framework and for all the polynomial functions under investigation. When the Gaussian kernel is evaluated on the vectors processed using the concatenated approach, the observed results show a statistically significant improvement in the average classification accuracy of 22.269%, compared to when the linear kernel is evaluated on the vectors processed using the previously proposed approaches

    Analysing functional genomics data using novel ensemble, consensus and data fusion techniques

    Get PDF
    Motivation: A rapid technological development in the biosciences and in computer science in the last decade has enabled the analysis of high-dimensional biological datasets on standard desktop computers. However, in spite of these technical advances, common properties of the new high-throughput experimental data, like small sample sizes in relation to the number of features, high noise levels and outliers, also pose novel challenges. Ensemble and consensus machine learning techniques and data integration methods can alleviate these issues, but often provide overly complex models which lack generalization capability and interpretability. The goal of this thesis was therefore to develop new approaches to combine algorithms and large-scale biological datasets, including novel approaches to integrate analysis types from different domains (e.g. statistics, topological network analysis, machine learning and text mining), to exploit their synergies in a manner that provides compact and interpretable models for inferring new biological knowledge. Main results: The main contributions of the doctoral project are new ensemble, consensus and cross-domain bioinformatics algorithms, and new analysis pipelines combining these techniques within a general framework. This framework is designed to enable the integrative analysis of both large- scale gene and protein expression data (including the tools ArrayMining, Top-scoring pathway pairs and RNAnalyze) and general gene and protein sets (including the tools TopoGSA , EnrichNet and PathExpand), by combining algorithms for different statistical learning tasks (feature selection, classification and clustering) in a modular fashion. Ensemble and consensus analysis techniques employed within the modules are redesigned such that the compactness and interpretability of the resulting models is optimized in addition to the predictive accuracy and robustness. The framework was applied to real-word biomedical problems, with a focus on cancer biology, providing the following main results: (1) The identification of a novel tumour marker gene in collaboration with the Nottingham Queens Medical Centre, facilitating the distinction between two clinically important breast cancer subtypes (framework tool: ArrayMining) (2) The prediction of novel candidate disease genes for Alzheimer’s disease and pancreatic cancer using an integrative analysis of cellular pathway definitions and protein interaction data (framework tool: PathExpand, collaboration with the Spanish National Cancer Centre) (3) The prioritization of associations between disease-related processes and other cellular pathways using a new rule-based classification method integrating gene expression data and pathway definitions (framework tool: Top-scoring pathway pairs) (4) The discovery of topological similarities between differentially expressed genes in cancers and cellular pathway definitions mapped to a molecular interaction network (framework tool: TopoGSA, collaboration with the Spanish National Cancer Centre) In summary, the framework combines the synergies of multiple cross-domain analysis techniques within a single easy-to-use software and has provided new biological insights in a wide variety of practical settings
    • 

    corecore