88 research outputs found

    Novel Pattern Recognition Approaches to Identification of Gene-Expression Pathways in Banana Cultivars

    Get PDF
    Bolstered resubstitution is a simple and fast error estimation method that has been shown to perform better than cross-validation and comparably with bootstrap in small-sample settings. However, it has been observed that its performance can deteriorate in high-dimensional feature spaces. To overcome this issue, we propose here a modification of bolstered error estimation based on the principle of Naive Bayes. This estimator is simple to compute and is reducible under feature selection. In experiments using popular classification rules applied to data from a well-known breast cancer gene expression study, the new Naive-Bayes bolstered estimator outperformed the old one, as well as cross-validation and resubstitution, in high-dimensional target feature spaces (after feature selection); it was superior to the 0.632 bootstrap provided that the sample size was not too small. Model selection is the task of choosing a model with optimal complexity for the given data set. Most model selection criteria try to minimize the sum of a training error term and a complexity control term, that is, minimize the complexity penalized loss. We investigate replacing the training error with bolstered resubstitution in the penalized loss to do model selection. Computer simulations indicate that the proposed method improves the performance of the model selection in terms of choosing the correct model complexity. Besides applying novel error estimation to model selection in pattern recognition, we also apply it to assess the performance of classifiers designed on the banana gene-expression data. Bananas are the world's most important fruit; they are a vital component of local diets in many countries. Diseases and drought are major threats in banana production. To generate disease and drought tolerant bananas, we need to identify disease and drought responsive genes and pathways. Towards this goal, we conducted RNA-Seq analysis with wild type and transgenic banana, with and without inoculation/drought stress, and on different days after applying the stress. By combining several state-of-the-art computational models, we identified stress responsive genes and pathways. The validation results of these genes in Arabidopsis are promising

    Novel Approaches in Classification Error Estimation, Predicting Generalization in Deep Learning, and Hybrid Compartmental Models

    Get PDF
    In data-poor environments, it may not be possible to set aside a large enough test data set to produce accurate test-set error estimates. On the other hand, in modern classification applications where training is time and resource intensive, as when training deep neural networks, classification error estimators based on resampling, such as cross-validation and bootstrap, are too computationally expensive, since they require training tens or hundreds of classifiers on resampled versions of the training data. The alternative in this case is to train and test on the same data, without resampling, i.e., to use resubstitution-like error estimators. Here, a family of generalized resubstitution classifier error estimators are proposed and their performance in various scenarios is investigated. This family of error estimators is based on empirical measures. The plain resubstitution error estimator corresponds to choosing the standard empirical measure that puts equal probability mass over each training points. Other choices of empirical measure lead to bolstered resubstitution, posterior-probability, Bayesian error estimators, as well as the newly proposed bolstered posterior-probability error estimators. Empirical results of this dissertation suggest that the generalized resubstitution error estimators are particularly useful in the presence of small sample size for various classification rules. In particular, bolstering led to remarkable improvement in error estimation in the majority of experiments on traditional classifiers as well as modern deep neural networks. Bolstering is a type of data augmentation that systematically generates meaningful samples, primarily through data-driven bolstering parameters. The bolstering parameter for low to average dimensional data was defined based on the Euclidean distance between samples in each class. But Euclidean distance between images is not straightforward and semantically meaningful. Hence, for experiments with image data, parameters of data augmentation were selected in a different fashion. I introduce three approaches to image augmentation, among which weighted augmented data combined with the posterior probability was most effective in predicting the generalization gap in deep learning. For the study of protein turn over, I propose hybrid compartmental models (HCM), that are useful for multi-substrate experiments. Unlike the conventional compartmental models, HCM starts with a partially specified structure for tracer models, estimates the tracer parameters given the data, and finally determines the details of model’s structure by choosing the most physiologically meaningful tracee model among the resulting alternative tracee models. The parameters in the alternatives tracee models are computed by simple mathematical operations on tracer parameters. The proposed HCM was employed to estimate kinetics of Phenylalanine and Tyrosine using tracer-tracee-ratio (TTR) data. Results show that HCM tracer model was able to fit the TTR-time data points, and the best tracee model was selected by comparing the alternative tracee models’ parameters with those reported in the literature

    Empirical evaluation of resampling procedures for optimising SVM hyperparameters

    Get PDF
    Tuning the regularisation and kernel hyperparameters is a vital step in optimising the generalisation performance of kernel methods, such as the support vector machine (SVM). This is most often performed by minimising a resampling/cross-validation based model selection criterion, however there seems little practical guidance on the most suitable form of resampling. This paper presents the results of an extensive empirical evaluation of resampling procedures for SVM hyperparameter selection, designed to address this gap in the machine learning literature. Wetested 15 different resampling procedures on 121 binary classification data sets in order to select the best SVM hyperparameters. Weused three very different statistical procedures to analyse the results: the standard multi-classifier/multidata set procedure proposed by Demˇsar, the confidence intervals on the excess loss of each procedure in relation to 5-fold cross validation, and the Bayes factor analysis proposed by Barber. We conclude that a 2-fold procedure is appropriate to select the hyperparameters of an SVM for data sets for 1000or more datapoints, while a 3-fold procedure is appropriate for smaller data sets

    Performance of machine learning methods in predicting trend in price and trading volume of cryptocurrencies

    Get PDF
    This study is motivated by the growing interest in cryptocurrency trading and the need for accurate forecasting tools to guide investment decisions. The main aim is to forecast price and trading volume changes of cryptocurrencies by determining their movement directions. Naïve Bayes, support vector machines, logistic regression, regression trees, and the K-nearest neighbors’ algorithm are selected to solve the problem and compared. Performance measures such as accuracy, sensitivity, and specificity are used to assess the models. The study shows that some models are better at predicting volume trends than price trends in cryptocurrencies. Naïve Bayes is good at spotting positive trends, while Logistic Regression is accurate at identifying negative trends. Interestingly, the research reveals that shorter prediction times are more accurate for price forecasts, but intermediate times work better for specificity. These insights help us understand which models work well for different aspects of cryptocurrency forecasting

    PV System Information Enhancement and Better Control of Power Systems.

    Get PDF
    abstract: Due to the rapid penetration of solar power systems in residential areas, there has been a dramatic increase in bidirectional power flow. Such a phenomenon of bidirectional power flow creates a need to know where Photovoltaic (PV) systems are located, what their quantity is, and how much they generate. However, significant challenges exist for accurate solar panel detection, capacity quantification, and generation estimation by employing existing methods, because of the limited labeled ground truth and relatively poor performance for direct supervised learning. To mitigate these issue, this thesis revolutionizes key learning concepts to (1) largely increase the volume of training data set and expand the labelled data set by creating highly realistic solar panel images, (2) boost detection and quantification learning through physical knowledge and (3) greatly enhance the generation estimation capability by utilizing effective features and neighboring generation patterns. These techniques not only reshape the machine learning methods in the GIS domain but also provides a highly accurate solution to gain a better understanding of distribution networks with high PV penetration. The numerical validation and performance evaluation establishes the high accuracy and scalability of the proposed methodologies on the existing solar power systems in the Southwest region of the United States of America. The distribution and transmission networks both have primitive control methodologies, but now is the high time to work out intelligent control schemes based on reinforcement learning and show that they can not only perform well but also have the ability to adapt to the changing environments. This thesis proposes a sequence task-based learning method to create an agent that can learn to come up with the best action set that can overcome the issues of transient over-voltage.Dissertation/ThesisMasters Thesis Electrical Engineering 201

    Validating supervised learning approaches to the prediction of disease status in neuroimaging

    Get PDF
    Alzheimer’s disease (AD) is a serious global health problem with growing human and monetary costs. Neuroimaging data offers a rich source of information about pathological changes in the brain related to AD, but its high dimensionality makes it difficult to fully exploit using conventional methods. Automated neuroimage assessment (ANA) uses supervised learning to model the relationships between imaging signatures and measures of disease. ANA methods are assessed on the basis of their predictive performance, which is measured using cross validation (CV). Despite its ubiquity, CV is not always well understood, and there is a lack of guidance as to best practice. This thesis is concerned with the practice of validation in ANA. It introduces several key challenges and considers potential solutions, including several novel contributions. Part I of this thesis reviews the field and introduces key theoretical concepts related to CV. Part II is concerned with bias due to selective reporting of performance results. It describes an empirical investigation to assess the likely level of this bias in the ANA literature and relative importance of several contributory factors. Mitigation strategies are then discussed. Part III is concerned with the optimal selection of CV strategy with respect to bias, variance and computational cost. Part IV is concerned with the statistical analysis of CV performance results. It discusses the failure of conventional statistical procedures, reviews previous alternative approaches, and demonstrates a new heuristic solution that fares well in preliminary investigations. Though the focus of this thesis is AD ANA, the issues it addresses are of great importance to all applied machine learning fields where samples are limited and predictive performance is critical

    Computer aided techniques for the attribution of Attic black-figure vase-paintings using the Princeton painter as a model.

    Get PDF
    Thesis (Ph.D.)-University of KwaZulu-Natal, Durban, 2009.Because of their abundance and because of the insight into the ancient world offered by the depictions on their decorated surfaces, Attic painted ceramics are an extremely valuable source of material evidence. Knowing the identities and personalities of the artists who painted them not only helps us understand the paintings, but also helps in the process of dating them and, in the case of sherds, reconstructing them. However, few of the artists signed their wares, and the identities of the artists have to be revealed through a close analysis of the style in a process called attribution. The vast majority of the attributions of archaic Attic vases are due to John Beazley whose monumental works set the stage for the dominance of attribution studies in the scholarship of Greek ceramics for most of the 20th century. However, the number of new scholars trained in this arcane art is dwindling as new avenues of archaeological research have gained ascendency. A computer-aided technique for attribution may preserve the benefits of the art while allowing new scholars to explore previously ignored areas of research. To this end, the present study provides a theoretical framework for computer-aided attribution, and using the corpus of the Princeton Painter - a painter active in the 6th century BCE - demonstrates the principal that, by employing pattern recognition techniques, computers may be trained to serve as an aid in the attribution process. Three different techniques are presented that are capable of distinguishing between paintings of the Princeton Painter and some of his contemporaries with reasonable accuracy. The first uses shape descriptors to distinguish between the methods employed by respective artists to render minor anatomical details. The second shows that the relative positions of cranial features of the male figures on black-figure paintings is an indicator of style and may also be used as part of the attribution process. Finally a novel technique is presented that can distinguish between pots constructed by different potters based on their shape profiles. This technique may offer valuable clues for attribution when artists are known to work mostly with a single potter

    The role of temporal frequency in continuous flash suppression: A case for a unified framework

    Get PDF
    In continuous flash suppression (CFS), a rapidly changing Mondrian sequence is presented to one eye in order to suppress a static target presented to the other eye. Targets generally remain suppressed for several seconds at a time, contributing to the widespread use of CFS in studies of unconscious visual processes. Nevertheless, the mechanisms underlying CFS suppression remain unclear, complicating its use and the comprehension of results obtained with the technique. As a starting point, this thesis examined the role of temporal frequency in CFS suppression using carefully controlled stimuli generated by Fourier Transform techniques. A low-level stimulus attribute, the choice of temporal frequency allowed us to evaluate the contributions of early visual processes and test the general assumption that fast update rates drive CFS effectiveness. Three psychophysical studies are described in this thesis, starting with the temporal frequency tuning of CFS (Chapter 2), the relationship between the Mondrian pattern and temporal frequency content (Chapter 3), and finally the role of temporal frequency selectivity in CFS (Chapter 4). Contrary to conventional wisdom, the results showed that the suppression of static targets is largely driven by high spatial frequencies and low temporal frequencies. Faster masker rates, on the other hand, worked best with transient targets. Indicative of early, feature selective processes, these findings are reminiscent of binocular rivalry suppression, demonstrating the possible use of a unified framework

    Probabilistic state estimation in regimes of nonlinear error growth

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Earth, Atmospheric, and Planetary Sciences, 2005.Includes bibliographical references (p. 273-286).State estimation, or data assimilation as it is often called, is a key component of numerical weather prediction (NWP). Nearly all implementable methods of state estimation suitable for NWP are forced to assume that errors remain in regimes of linear error growth and retain distributions of Gaussian uncertainty, yet nonlinear systems like the atmosphere can readily allow regimes of nonlinear error growth and, in turn, produce distributions of non- Gaussian uncertainty. State-of-the-art, ensemble-based methods of state estimation suitable for NWP are examined to gauge the consequences and relevance of violating the linear error growth assumption. For quite generic sources of non-Gaussian uncertainty, the methods are observed to fail, as they must, and the obtained analyses become probabilistically unreliable before becoming inaccurate. The mispositioning of coherent features is identified as a specific, geophysically relevant source of non-Gaussian uncertainty that can easily cause the state-of-the-art methods of state estimation to fail. However, an understanding of relevant phenomenology sometimes allows these same methods to remain successful owing to an available redefinition of the involved errors. The redefinition is phrased as an alternative error model. It is recognized and exploited that non-Gaussian additive Eulerian errors can come from Gaussian Lagrangian position errors. A two-step, augmented state vector approach is developed that is suitable for use with coherent features and that relies only on implementable methods of state estimation.(cont.) By combining the dual Eulerian and Lagrangian state information into one vector, an ensemble can approximate their covariance, thus allowing each component's uncertainty to be reduced. The first step of the two-step approach reduces the feature position errors in an effort to render the residual additive errors Gaussian, thereby allowing the second step of an implementable state estimation method to proceed successfully. Philosophically, the two-step approach uses physical knowledge of the problem (as phrased by the error model) to compensate for neglected important non-Gaussian uncertainty structure in the state estimation process. The proposed two-step approach successfully allows use of implementable methods of state estimation to obtain probabilistically reliable analyses in regimes of nonlinear error growth, something unavailable using current standards.by W. Gregory Lawson.Ph.D
    • …
    corecore