401 research outputs found

    Evolutionarily Tuned Generalized Pseudo-Inverse in Linear Discriminant Analysis

    Get PDF
    Linear Discriminant Analysis (LDA) and the related Fisher's linear discriminant are very important techniques used for classification and for dimensionality reduction. A certain complication occurs in applying these methods to real data. We have to estimate the class means and common covariance matrix, which are not known. A problem arises if the number of features exceeds the number of observations. In this case the estimate of the covariance matrix does not have full rank, and so cannot be inverted. There are a number of ways to deal with this problem. In our previous paper, we proposed improving LDA in this area, and we presented a new approach which uses a generalization of the Moore-Penrose (MP) pseudo-inverse to remove this weakness. However, for data sets with a larger number of features, our method was computationally too slow to achieve good results. Now we propose a model selection method with a genetic algorithm to solve this problem. Experimental results on different data sets demonstrate that the improvement is efficient

    Evolving interval-based representation for multiple classifier fusion.

    Get PDF
    Designing an ensemble of classifiers is one of the popular research topics in machine learning since it can give better results than using each constituent member. Furthermore, the performance of ensemble can be improved using selection or adaptation. In the former, the optimal set of base classifiers, meta-classifier, original features, or meta-data is selected to obtain a better ensemble than using the entire classifiers and features. In the latter, the base classifiers or combining algorithms working on the outputs of the base classifiers are made to adapt to a particular problem. The adaptation here means that the parameters of these algorithms are trained to be optimal for each problem. In this study, we propose a novel evolving combining algorithm using the adaptation approach for the ensemble systems. Instead of using numerical value when computing the representation for each class, we propose to use the interval-based representation for the class. The optimal value of the representation is found through Particle Swarm Optimization. During classification, a test instance is assigned to the class with the interval-based representation that is closest to the base classifiers’ prediction. Experiments conducted on a number of popular dataset confirmed that the proposed method is better than the well-known ensemble systems using Decision Template and Sum Rule as combiner, L2-loss Linear Support Vector Machine, Multiple Layer Neural Network, and the ensemble selection methods based on GA-Meta-data, META-DES, and ACO

    Advanced Data Analysis - Lecture Notes

    Get PDF
    Lecture notes for Advanced Data Analysis (ADA1 Stat 427/527 and ADA2 Stat 428/528), Department of Mathematics and Statistics, University of New Mexico, Fall 2016-Spring 2017. Additional material including RMarkdown templates for in-class and homework exercises, datasets, R code, and video lectures are available on the course websites: https://statacumen.com/teaching/ada1 and https://statacumen.com/teaching/ada2 . Contents I ADA1: Software 0 Introduction to R, Rstudio, and ggplot II ADA1: Summaries and displays, and one-, two-, and many-way tests of means 1 Summarizing and Displaying Data 2 Estimation in One-Sample Problems 3 Two-Sample Inferences 4 Checking Assumptions 5 One-Way Analysis of Variance III ADA1: Nonparametric, categorical, and regression methods 6 Nonparametric Methods 7 Categorical Data Analysis 8 Correlation and Regression IV ADA1: Additional topics 9 Introduction to the Bootstrap 10 Power and Sample size 11 Data Cleaning V ADA2: Review of ADA1 1 R statistical software and review VI ADA2: Introduction to multiple regression and model selection 2 Introduction to Multiple Linear Regression 3 A Taste of Model Selection for Multiple Regression VII ADA2: Experimental design and observational studies 4 One Factor Designs and Extensions 5 Paired Experiments and Randomized Block Experiments 6 A Short Discussion of Observational Studies VIII ADA2: ANCOVA and logistic regression 7 Analysis of Covariance: Comparing Regression Lines 8 Polynomial Regression 9 Discussion of Response Models with Factors and Predictors 10 Automated Model Selection for Multiple Regression 11 Logistic Regression IX ADA2: Multivariate Methods 12 An Introduction to Multivariate Methods 13 Principal Component Analysis 14 Cluster Analysis 15 Multivariate Analysis of Variance 16 Discriminant Analysis 17 Classificationhttps://digitalrepository.unm.edu/unm_oer/1002/thumbnail.jp

    Advanced statistical methods for data analysis in particle physics

    Get PDF
    The thesis has been developed focusing on the use of multivariate statistical methods in the High Energy Physics framework. Stemming from the framework described by the current dominant physical theory, known as the Standard Model, the thesis has been developed by following two directions, associated with two different physical research questions. The first route takes the steps from the need of improving the knowledge within the Standard Model. From a statistical point of view, such improvement refers to the aim of obtaining more accurate estimates of the parameters describing the Standard Model in order to gain a better knowledge of the probability distribution of the underlying physical process, known as the background. In practice, estimation of such probability distribution builds on the use of Monte Carlo simulated data, which, in turn, can be costly and imprecise. To prevent these problems, the physical community has developed a novel procedure to generate artificial background data from the experimental ones. Within the thesis, a formal validation of the physical procedure is performed by means of introducing a statistical permutation-based two-sample test for density equality. The test relies on kernel density estimation and is suitably adjusted to be applied to high dimensional data. The second direction of research derives from the incompleteness of the Standard Model, known to be unable to fully describe the Universe and the interactions among its characterising forces. The goal of going beyond the Standard Model is reached through model-independent searches of new physics which aim at looking for new possible particles not predicted by the Standard Model. Such particles, referred to as a signal, are expected to behave as a deviation from the known background. From a statistical perspective, the problem is recasted to a peculiar classification one where only partial information is available. Therefore a semi-supervised approach shall be adopted, either by strengthening or by relaxing assumptions underlying clustering or classification methods respectively. Within this context, the thesis follows two distinct approaches. The first approach consists of developing a parametric semi-supervised method which originates from the framework of model-based clustering. A dimensionality reduction technique is proposed by resorting to penalised methods to circumvent issues related to parameters estimation and the curse of dimensionality. The proposed variable selection approach is extended from the unsupervised to the semi-supervised context with attention to features exhibiting anomalous properties. The second approach followed with the aim of new physics searches consists of suitably adjusting and statistically validating an existing procedure, developed within the physical community. Some improvements to the algorithm are also proposed regarding, among others, cases of high dimensional and correlated data

    A study on the prediction of flight delays of a private aviation airline

    Get PDF
    The delay is a crucial performance indicator of any transportation system, and flight delays cause financial and economic consequences to passengers and airlines. Hence, recognizing them through prediction may improve marketing decisions. The goal is to use machine learning techniques to predict an aviation challenge: flight delay above 15 minutes on departure of a private airline. Business and data understanding of this particular segment of aviation are revised against literature revision, and data preparation, modelling and evaluation are addressed to lead towards a model that may contribute as support for decision-making in a private aviation environment. The results show us which algorithms performed better and what variables contribute the most for the model, thereafter delay on departure.O atraso de voo é um indicador fulcral em toda a indútria de transporte aéreo e esses atrasos têm consequências económicas e financeiras para passageiros e companhias aéras. Reconhecê- los através de predição poderá melhorar decisões estratégicas e operacionais. O objectivo é utilizar técnicas de aprendizagem de máquina (machine learning) para prever um eterno desafio da aviação: atraso de voo à partida, utilizando dados de uma companhia aérea privada. O conhecimento do contexto do negócio e dos dados adquiridos, num segmento singular da aviação, são revistos à luz das literatura vigente e a preparação dos dados, a modelização e respectiva avaliação são conduzidos de modo a contribuir para uma ferramenta de apoio à decisão no contexto da aviação privada. Os resultados obtidos revelam quais dos algoritmos utilizados demonstra uma melhor performance e quais as variáveis dos dados obtidos que mais contribuem para o modelo e consequentemente para o atraso à partida
    corecore