1,366 research outputs found

    Geometric generalisation of surrogate model-based optimisation to combinatorial and program spaces

    Get PDF
    Open access journalSurrogate models (SMs) can profitably be employed, often in conjunction with evolutionary algorithms, in optimisation in which it is expensive to test candidate solutions. The spatial intuition behind SMs makes them naturally suited to continuous problems, and the only combinatorial problems that have been previously addressed are those with solutions that can be encoded as integer vectors. We show how radial basis functions can provide a generalised SM for combinatorial problems which have a geometric solution representation, through the conversion of that representation to a different metric space. This approach allows an SM to be cast in a natural way for the problem at hand, without ad hoc adaptation to a specific representation. We test this adaptation process on problems involving binary strings, permutations, and tree-based genetic programs. © 2014 Yong-Hyuk Kim et al

    Combined optimization algorithms applied to pattern classification

    Get PDF
    Accurate classification by minimizing the error on test samples is the main goal in pattern classification. Combinatorial optimization is a well-known method for solving minimization problems, however, only a few examples of classifiers axe described in the literature where combinatorial optimization is used in pattern classification. Recently, there has been a growing interest in combining classifiers and improving the consensus of results for a greater accuracy. In the light of the "No Ree Lunch Theorems", we analyse the combination of simulated annealing, a powerful combinatorial optimization method that produces high quality results, with the classical perceptron algorithm. This combination is called LSA machine. Our analysis aims at finding paradigms for problem-dependent parameter settings that ensure high classifica, tion results. Our computational experiments on a large number of benchmark problems lead to results that either outperform or axe at least competitive to results published in the literature. Apart from paxameter settings, our analysis focuses on a difficult problem in computation theory, namely the network complexity problem. The depth vs size problem of neural networks is one of the hardest problems in theoretical computing, with very little progress over the past decades. In order to investigate this problem, we introduce a new recursive learning method for training hidden layers in constant depth circuits. Our findings make contributions to a) the field of Machine Learning, as the proposed method is applicable in training feedforward neural networks, and to b) the field of circuit complexity by proposing an upper bound for the number of hidden units sufficient to achieve a high classification rate. One of the major findings of our research is that the size of the network can be bounded by the input size of the problem and an approximate upper bound of 8 + √2n/n threshold gates as being sufficient for a small error rate, where n := log/SL and SL is the training set

    Artificial Intelligence Based Classification for Urban Surface Water Modelling

    Get PDF
    Estimations and predictions of surface water runoff can provide very useful insights, regarding flood risks in urban areas. To automatically predict the flow behaviour of the rainfall-runoff water, in real-world satellite images, it is important to precisely identify permeable and impermeable areas. This identification indicates and helps to calculate the amount of surface water, by taking into account the amount of water being absorbed in a permeable area and what remains on the impermeable area. In this research, a model of surface water has been established, to predict the behavioural flow of rainfall-runoff water. This study employs a combination of image processing, artificial intelligence and machine learning techniques, for automatic segmentation and classification of permeable and impermeable areas, in satellite images. These techniques investigate the image classification approaches for classifying three land-use categories (roofs, roads, and pervious areas), commonly found in satellite images of the earth’s surface. Three different classification scenarios are investigated, to select the best classification model. The first scenario involves pixel by pixel classification of images, using Classification Tree and Random Forest classification techniques, in 2 different settings of sequential and parallel execution of algorithms. In the second classification scenario, the image is divided into objects, by using Superpixels (SLIC) segmentation method, while three kinds of feature sets are extracted from the segmented objects. The performance of eight different supervised machine learning classifiers is probed, using 5-fold cross-validation, for multiple SLIC values, while detailed performance comparisons lead to conclusions about the classification into different classes, regarding Object-based and Pixel-based classification schemes. Pareto analysis and Knee point selection are used to select SLIC value and the suitable type of classification, among the aforementioned two. Furthermore, a new diversity and weighted sum-based ensemble classification model, called ParetoEnsemble, is proposed, in this classification scenario. The weights are applied to selected component classifiers of an ensemble, creating a strong classifier, where classification is done based on multiple votes from candidate classifiers of the ensemble, as opposed to individual classifiers, where classification is done based on a single vote, from only one classifier. Unbalanced and balanced data-based classification results are also evaluated, to determine the most suitable mode, for satellite image classifications, in this study. Convolutional Neural Networks, based on semantic segmentation, are also employed in the classification phase, as a third scenario, to evaluate the strength of deep learning model SegNet, in the classification of satellite imaging. The best results, from the three classification scenarios, are compared and the best classification method, among the three scenarios, is used in the next phase of water modelling, with the InfoWorks ICM software, to explore the potential of modelling process, regarding a partially automated surface water network. By using the parameter settings, with a specified amount of simulated rain falling, onto the imaged area, the amount of surface water flow is estimated, to get predictions about runoff situations in urban areas, since runoff, in such a situation, can be high enough to pose a dangerous flood risk. The area of Feock, in Cornwall, is used as a simulation area of study, in this research, where some promising results have been derived, regarding classification and modelling of runoff. The correlation coefficient estimation, between classification and runoff accuracy, provides useful insight, regarding the dependence of runoff performance on classification performance. The trained system was tested on some unknown area images as well, demonstrating a reasonable performance, considering the training and classification limitations and conditions. Furthermore, in these unknown area images, reasonable estimations were derived, regarding surface water runoff. An analysis of unbalanced and balanced data-based classification and runoff estimations, for multiple parameter configurations, provides aid to the selection of classification and modelling parameter values, to be used in future unknown data predictions. This research is founded on the incorporation of satellite imaging into water modelling, using selective images for analysis and assessment of results. This system can be further improved, and runoff predictions of high precision can be better achieved, by adding more high-resolution images to the classifiers training. The added variety, to the trained model, can lead to an even better classification of any unknown image, which could eventually provide better modelling and better insights into surface water modelling. Moreover, the modelling phase can be extended, in future research, to deal with real-time parameters, by calibrating the model, after the classification phase, in order to observe the impact of classification on the actual calibration

    Probabilistic Mapping and Spatial Pattern Analysis of Grazing Lawns in Southern African Savannahs Using WorldView-3 Imagery and Machine Learning Techniques

    Get PDF
    Savannah grazing lawns are a key food resource for large herbivores such as blue wildebeest (Connochaetes taurinus), hippopotamus (Hippopotamus amphibius) and white rhino (Ceratotherium simum), and impact herbivore densities, movement and recruitment rates. They also exert a strong influence on fire behaviour including frequency, intensity and spread. Thus, variation in grazing lawn cover can have a profound impact on broader savannah ecosystem dynamics. However, knowledge of their present cover and distribution is limited. Importantly, we lack a robust, broad-scale approach for detecting and monitoring grazing lawns, which is critical to enhancing understanding of the ecology of these vital grassland systems. We selected two sites in the Lower Sabie and Satara regions of Kruger National Park, South Africa with mesic and semiarid conditions, respectively. Using spectral and texture features derived from WorldView-3 imagery, we (i) parameterised and assessed the quality of Random Forest (RF), Support Vector Machines (SVM), Classification and Regression Trees (CART) and Multilayer Perceptron (MLP) models for general discrimination of plant functional types (PFTs) within a sub-area of the Lower Sabie landscape, and (ii) compared model performance for probabilistic mapping of grazing lawns in the broader Lower Sabie and Satara landscapes. Further, we used spatial metrics to analyse spatial patterns in grazing lawn distribution in both landscapes along a gradient of distance from waterbodies. All machine learning models achieved high F-scores (F1) and overall accuracy (OA) scores in general savannah PFTs classification, with RF (F1 = 95.73±0.004%, OA = 94.16±0.004%), SVM (F1 = 95.64±0.002%, OA = 94.02±0.002%) and MLP (F1 = 95.71±0.003%, OA = 94.27±0.003%) forming a cluster of the better performing models and marginally outperforming CART (F1 = 92.74±0.006%, OA = 90.93±0.003%). Grazing lawn detection accuracy followed a similar trend within the Lower Sabie landscape, with RF, SVM, MLP and CART achieving F-scores of 0.89, 0.93, 0.94 and 0.81, respectively. Transferring models to the Satara landscape however resulted in relatively lower but high grazing lawn detection accuracies across models (RF = 0.87, SVM = 0.88, MLP = 0.85 and CART = 0.75). Results from spatial pattern analysis revealed a relatively higher proportion of grazing lawn cover under semiarid savannah conditions (Satara) compared to the mesic savannah landscape (Lower Sabie). Additionally, the results show strong negative correlation between grazing lawn spatial structure (fractional cover, patch size and connectivity) and distance from waterbodies, with larger and contiguous grazing lawn patches occurring in close proximity to waterbodies in both landscapes. The proposed machine learning approach provides a novel and robust workflow for accurate and consistent landscape-scale monitoring of grazing lawns, while our findings and research outputs provide timely information critical for understanding habitat heterogeneity in southern African savannah

    Building well-performing classifier ensembles: model and decision level combination.

    Get PDF
    There is a continuing drive for better, more robust generalisation performance from classification systems, and prediction systems in general. Ensemble methods, or the combining of multiple classifiers, have become an accepted and successful tool for doing this, though the reasons for success are not always entirely understood. In this thesis, we review the multiple classifier literature and consider the properties an ensemble of classifiers - or collection of subsets - should have in order to be combined successfully. We find that the framework of Stochastic Discrimination provides a well-defined account of these properties, which are shown to be strongly encouraged in a number of the most popular/successful methods in the literature via differing algorithmic devices. This uncovers some interesting and basic links between these methods, and aids understanding of their success and operation in terms of a kernel induced on the training data, with form particularly well suited to classification. One property that is desirable in both the SD framework and in a regression context, the ambiguity decomposition of the error, is de-correlation of individuals. This motivates the introduction of the Negative Correlation Learning method, in which neural networks are trained in parallel in a way designed to encourage de-correlation of the individual networks. The training is controlled by a parameter λ governing the extent to which correlations are penalised. Theoretical analysis of the dynamics of training results in an exact expression for the interval in which we can choose λ while ensuring stability of the training, and a value λ∗ for which the training has some interesting optimality properties. These values depend only on the size N of the ensemble. Decision level combination methods often result in a difficult to interpret model, and NCL is no exception. However in some applications, there is a need for understandable decisions and interpretable models. In response to this, we depart from the standard decision level combination paradigm to introduce a number of model level combination methods. As decision trees are one of the most interpretable model structures used in classification, we chose to combine structure from multiple individual trees to build a single combined model. We show that extremely compact, well performing models can be built in this way. In particular, a generalisation of bottom-up pruning to a multiple-tree context produces good results in this regard. Finally, we develop a classification system for a real-world churn prediction problem, illustrating some of the concepts introduced in the thesis, and a number of more practical considerations which are of importance when developing a prediction system for a specific problem

    Genetic Programming for Classification with Unbalanced Data

    No full text
    In classification,machine learning algorithms can suffer a performance bias when data sets are unbalanced. Binary data sets are unbalanced when one class is represented by only a small number of training examples (called the minority class), while the other class makes up the rest (majority class). In this scenario, the induced classifiers typically have high accuracy on the majority class but poor accuracy on the minority class. As the minority class typically represents the main class-of-interest in many real-world problems, accurately classifying examples from this class can be at least as important as, and in some cases more important than, accurately classifying examples from the majority class. Genetic Programming (GP) is a promising machine learning technique based on the principles of Darwinian evolution to automatically evolve computer programs to solve problems. While GP has shown much success in evolving reliable and accurate classifiers for typical classification tasks with balanced data, GP, like many other learning algorithms, can evolve biased classifiers when data is unbalanced. This is because traditional training criteria such as the overall success rate in the fitness function in GP, can be influenced by the larger number of examples from the majority class. This thesis proposes a GP approach to classification with unbalanced data. The goal is to develop new internal cost-adjustment techniques in GP to improve classification performances on both the minority class and the majority class. By focusing on internal cost-adjustment within GP rather than the traditional databalancing techniques, the unbalanced data can be used directly or "as is" in the learning process. This removes any dependence on a sampling algorithm to first artificially re-balance the input data prior to the learning process. This thesis shows that by developing a number of new methods in GP, genetic program classifiers with good classification ability on the minority and the majority classes can be evolved. This thesis evaluates these methods on a range of binary benchmark classification tasks with unbalanced data. This thesis demonstrates that unlike tasks with multiple balanced classes where some dynamic (non-static) classification strategies perform significantly better than the simple static classification strategy, either a static or dynamic strategy shows no significant difference in the performance of evolved GP classifiers on these binary tasks. For this reason, the rest of the thesis uses this static classification strategy. This thesis proposes several new fitness functions in GP to perform cost adjustment between the minority and the majority classes, allowing the unbalanced data sets to be used directly in the learning process without sampling. Using the Area under the Receiver Operating Characteristics (ROC) curve (also known as the AUC) to measure how well a classifier performs on the minority and majority classes, these new fitness functions find genetic program classifiers with high AUC on the tasks on both classes, and with fast GP training times. These GP methods outperform two popular learning algorithms, namely, Naive Bayes and Support Vector Machines on the tasks, particularly when the level of class imbalance is large, where both algorithms show biased classification performances. This thesis also proposes a multi-objective GP (MOGP) approach which treats the accuracies of the minority and majority classes separately in the learning process. The MOGP approach evolves a good set of trade-off solutions (a Pareto front) in a single run that perform as well as, and in some cases better than, multiple runs of canonical single-objective GP (SGP). In SGP, individual genetic program solutions capture the performance trade-off between the two objectives (minority and majority class accuracy) using an ROC curve; whereas in MOGP, this requirement is delegated to multiple genetic program solutions along the Pareto front. This thesis also shows how multiple Pareto front classifiers can be combined into an ensemble where individual members vote on the class label. Two ensemble diversity measures are developed in the fitness functions which treat the diversity on both the minority and the majority classes as equally important; otherwise, these measures risk being biased toward the majority class. The evolved ensembles outperform their individual members on the tasks due to good cooperation between members. This thesis further improves the ensemble performances by developing a GP approach to ensemble selection, to quickly find small groups of individuals that cooperate very well together in the ensemble. The pruned ensembles use much fewer individuals to achieve performances that are as good as larger (unpruned) ensembles, particularly on tasks with high levels of class imbalance, thereby reducing the total time to evaluate the ensemble

    Boosting en el modelo de aprendizaje PAC

    Get PDF
    A review on the idea of Boosting in the PAC learning model is presented. Also a review of the first practical Boosting method, the adaptative boosting (Adaboost) is provided, giving details concerning theoretical garantees on error convergence and exploring the important concept of margin.Una revisión de la idea de Boosting en el modelo de aprendizaje PAC es presentada. Adicionalmente se provee una revisión del primer método de Boosting práctico, el Boosting adaptativo (Adaboost), dando detalles respecto a las garantías teóricas en la convergencia del error y explorando el importante concepto de margen

    Data mining techniques for protein sequence analysis

    Get PDF
    This thesis concerns two areas of bioinformatics related by their role in protein structure and function: protein structure prediction and post translational modification of proteins. The dihedral angles Ψ and Φ are predicted using support vector regression. For the prediction of Ψ dihedral angles the addition of structural information is examined and the normalisation of Ψ and Φ dihedral angles is examined. An application of the dihedral angles is investigated. The relationship between dihedral angles and three bond J couplings determined from NMR experiments is described by the Karplus equation. We investigate the determination of the correct solution of the Karplus equation using predicted Φ dihedral angles. Glycosylation is an important post translational modification of proteins involved in many different facets of biology. The work here investigates the prediction of N-linked and O-linked glycosylation sites using the random forest machine learning algorithm and pairwise patterns in the data. This methodology produces more accurate results when compared to state of the art prediction methods. The black box nature of random forest is addressed by using the trepan algorithm to generate a decision tree with comprehensible rules that represents the decision making process of random forest. The prediction of our program GPP does not distinguish between glycans at a given glycosylation site. We use farthest first clustering, with the idea of classifying each glycosylation site by the sugar linking the glycan to protein. This thesis demonstrates the prediction of protein backbone torsion angles and improves the current state of the art for the prediction of glycosylation sites. It also investigates potential applications and the interpretation of these methods

    Computational models and approaches for lung cancer diagnosis

    Full text link
    The success of treatment of patients with cancer depends on establishing an accurate diagnosis. To this end, the aim of this study is to developed novel lung cancer diagnostic models. New algorithms are proposed to analyse the biological data and extract knowledge that assists in achieving accurate diagnosis results
    corecore