78 research outputs found

    Learning Timbre Analogies from Unlabelled Data by Multivariate Tree Regression

    Get PDF
    This is the Author's Original Manuscript of an article whose final and definitive form, the Version of Record, has been published in the Journal of New Music Research, November 2011, copyright Taylor & Francis. The published article is available online at http://www.tandfonline.com/10.1080/09298215.2011.596938

    Unsupervised Anomaly Detection: investigations on Isolation Forest

    Get PDF
    Nel mondo di oggi, la crescente quantità di informazioni disponibili rende possibile analizzare diversi fattori. Uno di questo fattori è il rilevamento delle anomalie. Negli ultimi anni questo problema viene affrontato grazie al machine learning, il quale permette di riconoscere le istanze che non sono conformi al comportamento atteso di un sistema, i cosiddetti outlier. Uno dei settori che trae maggiore beneficio è quello industriale, dove i dati sono la nuova ricchezza delle industrie, basta pensare al potenziamento delle vendite o alla manutenzione predittiva. Negli anni sono stati proposti diverse classi di metodi, recentemente è stata introdotta una nuova classe basata sull’isolamento. Il primo metodo della classe basata sull’isolamento è Isolation Forest. Questo metodo ha riscosso un grande successo sia nelle applicazioni industriali sia nella ricerca accademica rendendo disponibile una notevole quantità di varianti. L’intuizione di base è molto semplice, ovvero, il punteggio di anomalia riflette la propensione di ogni istanza ad essere separata, in base al numero medio di suddivisioni casuali necessarie per isolare completamente un istanza di dati. In questo lavoro di tesi, dopo un’indagine preliminare dello stato dell’arte e un approfondimento del metodo Isolation Forest, vengono sviluppate diverse varianti di questo metodo, con l’obiettivo di migliorare il rilevamento delle anomalie. Queste varianti sono state sviluppate grazie a delle intuizioni sulle due fasi principali, la fase dove si selezione la caratteristica e il relativo valore di split e la fase dove si calcola il punteggio di anomalia per ogni istanza. In conclusione vengono forniti degli esperimenti numerici, utilizzando sia set di dati Artificiali sia set di dati del mondo Reale, con lo scopo di confrontare le prestazioni con il metodo standard, in termini di rilevamento di anomalie. Questi esperimenti hanno dimostrato che il metodo Prob Split sembra essere il più promettente tra tutti quelli sviluppati, perché ha incrementi delle prestazioni significativi nel rilevamento e mantiene il costo computazionale invariato.In today’s world, the increasing amount of available information makes it possible to analyse several factors. One of these factors is anomaly detection. In recent years, this problem has been addressed by machine learning, which makes it possible to recognise instances that do not conform to the expected behaviour of a system, so-called outliers. One of the sectors that benefits most is the industrial sector, where data is the new wealth of industries, just think of boosting sales or predictive maintenance. Over the years several classes of methods have been proposed, recently a new class based on isolation has been introduced. The first method of the isolation-based class is Isolation Forest. This method has been very successful both in industrial applications and in academic research, which has made a large number of variants available. The basic intuition is very simple, that is, the anomaly score reflects the propensity of each instance to be separated, based on the average number of random splits required to completely isolate a data instance. In this thesis, after a preliminary survey of the state of the art and an in-depth study of the Isolation Forest method, several variants of this method are developed, with the aim of improving anomaly detection. These variants were developed thanks to insights into the two main phases, the phase where the feature and its split value are selected and the phase where the anomaly score is calculated for each instance. In conclusion, numerical experiments are provided, using both Artificial and Real World datasets, with the aim of comparing performance in terms of anomaly detection. These experiments have shown that the Prob Split method appears to be the most promising of all those developed, because it has significant gains in detection and maintains the same computational cost as the Isolation Forest method

    Semi-Automatic Classification of Cementitious Materials using Scanning Electron Microscope Images

    No full text
    International audienceSegmentation and classification are prolific research topics in the image processing community, which have been more and more used in the context of analysis of cementitious materials, on images acquired with Scanning Electron Microscopes (SEM). Indeed, there is a need to be able to detect and to quantify the materials present in a cement paste in order to follow the chemical reactions occurring in the material even days after the solidification. In this paper, we propose a new approach for segmentation and classification of cementitious materials based on the denoising of the data with the Block Matching 3D (BM3D) algorithm, Binary Partition Tree (BPT) segmentation, Support Vector Machines (SVM) classification, and the interactivity with the user. The BPT provides a hierarchical representation of the spatial regions of the data, allowing a segmentation to be selected among the admissible partitions of the image. SVMs are used to obtain a classification map of the image. This approach combines state-of-the-art image processing tools with the interactivity with the user to allow a better segmentation to be performed, or to help the classifier discriminate the classes better. We show that the proposed approach outperforms a previous method on synthetic data and several real datasets coming from cement samples, both qualitatively with visual examination and quantitatively with the comparison of experimental results with theoretical ones

    Meta-Learning and the Full Model Selection Problem

    Get PDF
    When working as a data analyst, one of my daily tasks is to select appropriate tools from a set of existing data analysis techniques in my toolbox, including data preprocessing, outlier detection, feature selection, learning algorithm and evaluation techniques, for a given data project. This indeed was an enjoyable job at the beginning, because to me finding patterns and valuable information from data is always fun. Things become tricky when several projects needed to be done in a relatively short time. Naturally, as a computer science graduate, I started to ask myself, "What can be automated here?"; because, intuitively, part of my work is more or less a loop that can be programmed. Literally, the loop is "choose, run, test and choose again... until some criterion/goals are met". In other words, I use my experience or knowledge about machine learning and data mining to guide and speed up the process of selecting and applying techniques in order to build a relatively good predictive model for a given dataset for some purpose. So the following questions arise: "Is it possible to design and implement a system that helps a data analyst to choose from a set of data mining tools? Or at least that provides a useful recommendation about tools that potentially save some time for a human analyst." To answer these questions, I decided to undertake a long-term study on this topic, to think, define, research, and simulate this problem before coding my dream system. This thesis presents research results, including new methods, algorithms, and theoretical and empirical analysis from two directions, both of which try to propose systematic and efficient solutions to the questions above, using different resource requirements, namely, the meta-learning-based algorithm/parameter ranking approach and the meta-heuristic search-based full-model selection approach. Some of the results have been published in research papers; thus, this thesis also serves as a coherent collection of results in a single volume

    Optimisation based approaches for machine learning

    Get PDF
    Machine learning has attracted a lot of attention in recent years and it has become an integral part of many commercial and research projects, with a wide range of applications. With current developments in technology, more data is generated and stored than ever before. Identifying patterns, trends and anomalies in these datasets and summarising them with simple quantitative models is a vital task. This thesis focuses on the development of machine learning algorithms based on mathematical programming for datasets that are relatively small in size. The first topic of this doctoral thesis is piecewise regression, where a dataset is partitioned into multiple regions and a regression model is fitted to each one. This work uses an existing algorithm from the literature and extends the mathematical formulation in order to include information criteria. The inclusion of such criteria targets to deal with overfitting, which is a common problem in supervised learning tasks, by finding a balance between predictive performance and model complexity. The improvement in overall performance is demonstrated by testing and comparing the proposed method with various algorithms from the literature on various regression datasets. Extending the topic of regression, a decision tree regressor is also proposed. Decision trees are powerful and easy to understand structures that can be used both for regression and classification. In this work, an optimisation model is used for the binary splitting of nodes. A statistical test is introduced to check whether the partitioning of nodes is statistically meaningful and as a result control the tree generation process. Additionally, a novel mathematical formulation is proposed to perform feature selection and ultimately identify the appropriate variable to be selected for the splitting of nodes. The performance of the proposed algorithm is once again compared with a number of literature algorithms and it is shown that the introduction of the variable selection model is useful for reducing the training time of the algorithm without major sacrifices in performance. Lastly, a novel decision tree classifier is proposed. This algorithm is based on a mathematical formulation that identifies the optimal splitting variable and break value, applies a linear transformation to the data and then assigns them to a class while minimising the number of misclassified samples. The introduction of the linear transformation step reduces the dimensionality of the examined dataset down to a single variable, aiding the classification accuracy of the algorithm for more complex datasets. Popular classifiers from the literature have been used to compare the accuracy of the proposed algorithm on both synthetic and publicly available classification datasets

    Forecasting workload and airspace configuration with neural networks and tree search methods

    Get PDF
    International audienceThe aim of the research presented in this paper is to forecast air traffic controller workload and required airspace configuration changes with enough lead time and with a good degree of realism. For this purpose, tree search methods were combined with a neural network. The neural network takes relevant air traffic complexity metrics as input and provides a workload indication (high, normal, or low) for any given air traffic control (ATC) sector. It was trained on historical data, i.e. archived sector operations, considering that ATC sectors made up of several airspace modules are usually split into several smaller sectors when the workload is excessive, or merged with other sectors when the workload is low. The input metrics are computed from the sector geometry and from simulated or real aircraft trajectories. The tree search methods explore all possible combinations of elementary airspace modules in order to build an optimal airspace partition where the workload is balanced as well as possible across the ATC sectors. The results are compared both to the real airspace configurations and to the forecast made by flow management operators in a French "en-route" air traffic control centre

    Technical and Fundamental Features Analysis for Stock Market Prediction with Data Mining Methods

    Get PDF
    Predicting stock prices is an essential objective in the financial world. Forecasting stock returns and their risk represents one of the most critical concerns of market decision makers. This thesis investigates the stock price forecasting with three approaches from the data mining concept and shows how different elements in the stock price can help to enhance the accuracy of our prediction. For this reason, the first and second approaches capture many fundamental indicators from the stocks and implement them as explanatory variables to do stock price classification and forecasting. In the third approach, technical features from the candlestick representation of the share prices are extracted and used to enhance the accuracy of the forecasting. In each approach, different tools and techniques from data mining and machine learning are employed to justify why the forecasting is working. Furthermore, since the idea is to evaluate the potential of features in the stock trend forecasting, therefore we diversify our experiments using both technical and fundamental features. Therefore, in the first approach, a three-stage methodology is developed while in the first step, a comprehensive investigation of all possible features which can be effective on stocks risk and return are identified. Then, in the next stage, risk and return are predicted by applying data mining techniques for the given features. Finally, we develop a hybrid algorithm, based on some filters and function-based clustering; and re-predicted the risk and return of stocks. In the second approach, instead of using single classifiers, a fusion model is proposed based on the use of multiple diverse base classifiers that operate on a common input and a meta-classifier that learns from base classifiers’ outputs to obtain a more precise stock return and risk predictions. A set of diversity methods, including Bagging, Boosting, and AdaBoost, is applied to create diversity in classifier combinations. Moreover, the number and procedure for selecting base classifiers for fusion schemes are determined using a methodology based on dataset clustering and candidate classifiers’ accuracy. Finally, in the third approach, a novel forecasting model for stock markets based on the wrapper ANFIS (Adaptive Neural Fuzzy Inference System) – ICA (Imperialist Competitive Algorithm) and technical analysis of Japanese Candlestick is presented. Two approaches of Raw-based and Signal-based are devised to extract the model’s input variables and buy and sell signals are considered as output variables. To illustrate the methodologies, for the first and second approaches, Tehran Stock Exchange (TSE) data for the period from 2002 to 2012 are applied, while for the third approach, we used General Motors and Dow Jones indexes.Predicting stock prices is an essential objective in the financial world. Forecasting stock returns and their risk represents one of the most critical concerns of market decision makers. This thesis investigates the stock price forecasting with three approaches from the data mining concept and shows how different elements in the stock price can help to enhance the accuracy of our prediction. For this reason, the first and second approaches capture many fundamental indicators from the stocks and implement them as explanatory variables to do stock price classification and forecasting. In the third approach, technical features from the candlestick representation of the share prices are extracted and used to enhance the accuracy of the forecasting. In each approach, different tools and techniques from data mining and machine learning are employed to justify why the forecasting is working. Furthermore, since the idea is to evaluate the potential of features in the stock trend forecasting, therefore we diversify our experiments using both technical and fundamental features. Therefore, in the first approach, a three-stage methodology is developed while in the first step, a comprehensive investigation of all possible features which can be effective on stocks risk and return are identified. Then, in the next stage, risk and return are predicted by applying data mining techniques for the given features. Finally, we develop a hybrid algorithm, based on some filters and function-based clustering; and re-predicted the risk and return of stocks. In the second approach, instead of using single classifiers, a fusion model is proposed based on the use of multiple diverse base classifiers that operate on a common input and a meta-classifier that learns from base classifiers’ outputs to obtain a more precise stock return and risk predictions. A set of diversity methods, including Bagging, Boosting, and AdaBoost, is applied to create diversity in classifier combinations. Moreover, the number and procedure for selecting base classifiers for fusion schemes are determined using a methodology based on dataset clustering and candidate classifiers’ accuracy. Finally, in the third approach, a novel forecasting model for stock markets based on the wrapper ANFIS (Adaptive Neural Fuzzy Inference System) – ICA (Imperialist Competitive Algorithm) and technical analysis of Japanese Candlestick is presented. Two approaches of Raw-based and Signal-based are devised to extract the model’s input variables and buy and sell signals are considered as output variables. To illustrate the methodologies, for the first and second approaches, Tehran Stock Exchange (TSE) data for the period from 2002 to 2012 are applied, while for the third approach, we used General Motors and Dow Jones indexes.154 - Katedra financívyhově
    corecore