886 research outputs found

    Classifier selection with permutation tests

    Get PDF
    This work presents a content-based recommender system for machine learning classifier algorithms. Given a new data set, a recommendation of what classifier is likely to perform best is made based on classifier performance over similar known data sets. This similarity is measured according to a data set characterization that includes several state-of-the-art metrics taking into account physical structure, statistics, and information theory. A novelty with respect to prior work is the use of a robust approach based on permutation tests to directly assess whether a given learning algorithm is able to exploit the attributes in a data set to predict class labels, and compare it to the more commonly used F-score metric for evaluating classifier performance. To evaluate our approach, we have conducted an extensive experimentation including 8 of the main machine learning classification methods with varying configurations and 65 binary data sets, leading to over 2331 experiments. Our results show that using the information from the permutation test clearly improves the quality of the recommendations.Peer ReviewedPostprint (author's final draft

    Unsupervised Heterogeneous Coupling Learning for Categorical Representation.

    Full text link
    Complex categorical data is often hierarchically coupled with heterogeneous relationships between attributes and attribute values and the couplings between objects. Such value-to-object couplings are heterogeneous with complementary and inconsistent interactions and distributions. Limited research exists on unlabeled categorical data representations, ignores the heterogeneous and hierarchical couplings, underestimates data characteristics and complexities, and overuses redundant information, etc. Deep representation learning of unlabeled categorical data is challenging, overseeing such value-to-object couplings, complementarity and inconsistency, and requiring large data, disentanglement, and high computational power. This work introduces a shallow but powerful UNsupervised heTerogeneous couplIng lEarning (UNTIE) approach for representing coupled categorical data by untying the interactions between couplings and revealing heterogeneous distributions embedded in each type of couplings. UNTIE is efficiently optimized w.r.t. a kernel k-means objective function for unsupervised representation learning of heterogeneous and hierarchical value-to-object couplings. Theoretical analysis shows that UNTIE can represent categorical data with maximal separability while effectively represents heterogeneous couplings and disclose their roles in categorical data. The UNTIE-learned representations make significant performance improvement against the state-of-the-art categorical representations and deep representation models on 25 categorical data sets with diversified characteristics

    Identification des régimes et regroupement des séquences pour la prévision des marchés financiers

    Get PDF
    Abstract : Regime switching analysis is extensively advocated to capture complex behaviors underlying financial time series for market prediction. Two main disadvantages in current approaches of regime identification are raised in the literature: 1) the lack of a mechanism for identifying regimes dynamically, restricting them to switching among a fixed set of regimes with a static transition probability matrix; 2) failure to utilize cross-sectional regime dependencies among time series, since not all the time series are synchronized to the same regime. As the numerical time series can be symbolized into categorical sequences, a third issue raises: 3) the lack of a meaningful and effective measure of the similarity between chronological dependent categorical values, in order to identify sequence clusters that could serve as regimes for market forecasting. In this thesis, we propose a dynamic regime identification model that can identify regimes dynamically with a time-varying transition probability, to address the first issue. For the second issue, we propose a cluster-based regime identification model to account for the cross-sectional regime dependencies underlying financial time series for market forecasting. For the last issue, we develop a dynamic order Markov model, making use of information underlying frequent consecutive patterns and sparse patterns, to identify the clusters that could serve as regimes identified on categorized financial time series. Experiments on synthetic and real-world datasets show that our two regime models show good performance on both regime identification and forecasting, while our dynamic order Markov clustering model also demonstrates good performance on identifying clusters from categorical sequences.L'analyse de changement de régime est largement préconisée pour capturer les comportements complexes sous-jacents aux séries chronologiques financières pour la prédiction du marché. Deux principaux problèmes des approches actuelles d'identifica-tion de régime sont soulevés dans la littérature. Il s’agit de: 1) l'absence d'un mécanisme d'identification dynamique des régimes. Ceci limite la commutation entre un ensemble fixe de régimes avec une matrice de probabilité de transition statique; 2) l’incapacité à utiliser les dépendances transversales des régimes entre les séries chronologiques, car toutes les séries chronologiques ne sont pas synchronisées sur le même régime. Étant donné que les séries temporelles numériques peuvent être symbolisées en séquences catégorielles, un troisième problème se pose: 3) l'absence d'une mesure significative et efficace de la similarité entre les séries chronologiques dépendant des valeurs catégorielles pour identifier les clusters de séquences qui pourraient servir de régimes de prévision du marché. Dans cette thèse, nous proposons un modèle d'identification de régime dynamique qui identifie dynamiquement des régimes avec une probabilité de transition variable dans le temps afin de répondre au premier problème. Ensuite, pour adresser le deuxième problème, nous proposons un modèle d'identification de régime basé sur les clusters. Notre modèle considère les dépendances transversales des régimes sous-jacents aux séries chronologiques financières avant d’effectuer la prévision du marché. Pour terminer, nous abordons le troisième problème en développant un modèle de Markov d'ordre dynamique, en utilisant les informations sous-jacentes aux motifs consécutifs fréquents et aux motifs clairsemés, pour identifier les clusters qui peuvent servir de régimes identifiés sur des séries chronologiques financières catégorisées. Nous avons mené des expériences sur des ensembles de données synthétiques et du monde réel. Nous démontrons que nos deux modèles de régime présentent de bonnes performances à la fois en termes d'identification et de prévision de régime, et notre modèle de clustering de Markov d'ordre dynamique produit également de bonnes performances dans l'identification de clusters à partir de séquences catégorielles

    Machine Learning for Classification and Clustering of Dementia Data

    Get PDF
    Dementia is a term used to describe heterogeneous diseases that can generally be characterised by a decline in cognitive ability that affects daily living. It has been predicted that the prevalence of dementia will increase significantly over the coming years, thus it is a priority worldwide. This thesis discusses research conducted with two primary aims. They were to investigate the use of machine learning for distinguishing between people with and without dementia, as well as differentiating between key dementia subtypes where appropriate; and to gain an understanding of the inherent structure of dementia data, to ultimately investigate disease signatures. Data was acquired from the National Alzheimer's Coordinating Center in the United States, and a data set comprising 32,573 observations and 260 features of mixed type was utilised. It included features whose values were constrained by relations with others, as well as two types of missingness which arose when data was unexpectedly not recorded and when the information was irrelevant or unobtainable for a known reason, respectively. Notably, the former genuinely missing values were imputed where possible, whilst the latter conditionally missing values were handled. An imputation approach was developed, which simultaneously builds a random forest classifier while handling conditionally missing values. It maintained the known relations in the data set, so far as possible. A clustering approach was also developed that ultimately measures the similarity of observations based on the similarity of their paths through the trees of an isolation forest before employing spectral clustering. Crucially, it can naturally draw on variables of mixed type. A dementia classifier with an area under the receiver operating characteristic curve (AUC) of 0.99 and 10 pairwise dementia subtype classifiers with AUCs ranging from 0.88 to 1.0 (rounded) were produced, suggesting machine learning could be a useful tool for diagnosing dementia and differentiating between the main subtypes. Key features were identified using these classifiers and were markedly different for the two types of diagnosis. Furthermore, preliminary experiments conducted using the clustering approach suggested that mild cognitive impairment may be a mild form of dementia as opposed to a clinical entity, over which there is much debate; and there could be evidence for the current subtypes. Ultimately, these findings have the potential to transform the way dementia is diagnosed

    Multi-objective constrained optimization for energy applications via tree ensembles

    Get PDF
    Energy systems optimization problems are complex due to strongly non-linear system behavior and multiple competing objectives, e.g. economic gain vs. environmental impact. Moreover, a large number of input variables and different variable types, e.g. continuous and categorical, are challenges commonly present in real-world applications. In some cases, proposed optimal solutions need to obey explicit input constraints related to physical properties or safety-critical operating conditions. This paper proposes a novel data-driven strategy using tree ensembles for constrained multi-objective optimization of black-box problems with heterogeneous variable spaces for which underlying system dynamics are either too complex to model or unknown. In an extensive case study comprised of synthetic benchmarks and relevant energy applications we demonstrate the competitive performance and sampling efficiency of the proposed algorithm compared to other state-of-the-art tools, making it a useful all-in-one solution for real-world applications with limited evaluation budgets

    soMLier: A South African Wine Recommender System

    Get PDF
    Though several commercial wine recommender systems exist, they are largely tailored to consumers outside of South Africa (SA). Consequently, these systems are of limited use to novice wine consumers in SA. To address this, the aim of this research is to develop a system for South African consumers that yields high-quality wine recommendations, maximises the accuracy of predicted ratings for those recommendations and provides insights into why those suggestions were made. To achieve this, a hybrid system “soMLier” (pronounced “sommelier”) is built in this thesis that makes use of two datasets. Firstly, a database containing several attributes of South African wines such as the chemical composition, style, aroma, price and description was supplied by wine.co.za (a SA wine retailer). Secondly, for each wine in that database, the numeric 5-star ratings and textual reviews made by users worldwide were further scraped from Vivino.com to serve as a dataset of user preferences. Together, these are used to develop and compare several systems, the most optimal of which are combined in the final system. Item-based collaborative filtering methods are investigated first along with model-based techniques (such as matrix factorisation and neural networks) when applied to the user rating dataset to generate wine recommendations through the ranking of rating predictions. Respectively, these methods are determined to excel at generating lists of relevant wine recommendations and producing accurate corresponding predicted ratings. Next, the wine attribute data is used to explore the efficacy of content-based systems. Numeric features (such as price) are compared along with categorical features (such as style) using various distance measures and the relationships between the textual descriptions of the wines are determined using natural language processing methods. These methods are found to be most appropriate for explaining wine recommendations. Hence, the final hybrid system makes use of collaborative filtering to generate recommendations, matrix factorisation to predict user ratings, and content-based techniques to rationalise the wine suggestions made. This thesis contributes the “soMLier” system that is of specific use to SA wine consumers as it bridges the gap between the technologies used by highly-developed existing systems and the SA wine market. Though this final system would benefit from more explicit user data to establish a richer model of user preferences, it can ultimately assist consumers in exploring unfamiliar wines, discovering wines they will likely enjoy, and understanding their preferences of SA wine

    Methods for Learning Directed and Undirected Graphical Models

    Get PDF
    Probabilistic graphical models provide a general framework for modeling relationships between multiple random variables. The main tool in this framework is a mathematical object called graph which visualizes the assertions of conditional independence between the variables. This thesis investigates methods for learning these graphs from observational data. Regarding undirected graphical models, we propose a new scoring criterion for learning a dependence structure of a Gaussian graphical model. The scoring criterion is derived as an approximation to often intractable Bayesian marginal likelihood. We prove that the scoring criterion is consistent and demonstrate its applicability to high-dimensional problems when combined with an efficient search algorithm. Secondly, we present a non-parametric method for learning undirected graphs from continuous data. The method combines a conditional mutual information estimator with a permutation test in order to perform conditional independence testing without assuming any specific parametric distributions for the involved random variables. Accompanying this test with a constraint-based structure learning algorithm creates a method which performs well in numerical experiments when the data generating mechanisms involve non-linearities. For directed graphical models, we propose a new scoring criterion for learning Bayesian network structures from discrete data. The criterion approximates a hard-to-compute quantity called the normalized maximum likelihood. We study the theoretical properties of the score and compare it experimentally to popular alternatives. Experiments show that the proposed criterion provides a robust and safe choice for structure learning and prediction over a wide variety of different settings. Finally, as an application of directed graphical models, we derive a closed form expression for Bayesian network Fisher kernel. This provides us with a similarity measure over discrete data vectors, capable of taking into account the dependence structure between the components. We illustrate the similarity measured by this kernel with an example where we use it to seek sets of observations that are important and representative of the underlying Bayesian network model.Graafiset todennäköisyysmallit ovat yleispätevä tapa mallintaa yhteyksiä usean satunnaismuuttujan välillä. Keskeinen työkalu näissä malleissa on verkko, eli graafi, jolla voidaan visuaalisesti esittää muuttujien välinen riippuvuusrakenne. Tämä väitöskirja käsittelee erilaisia menetelmiä suuntaamattomien ja suunnattujen verkkojen oppimiseen havaitusta aineistosta. Liittyen suuntaamattomiin verkkoihin, tässä työssä esitellään kaksi erilaisiin tilanteisiin soveltuvaa menetelmää verkkojen rakenteen oppimiseen. Ensiksi esitellään mallinvalintakriteeri, jolla voidaan oppia verkkojen rakenteita muuttujien ollessa normaalijakautuneita. Kriteeri johdetaan approksimaationa usein laskennallisesti vaativalle bayesiläiselle marginaaliuskottavuudelle (marginal likelihood). Työssä tutkitaan kriteerin teoreettisia ominaisuuksia ja näytetään kokeellisesti, että se toimii hyvin tilanteissa, joissa muuttujien määrä on suuri. Toinen esiteltävä menetelmä on ei-parametrinen, tarkoittaen karkeasti, että emme tarvitse tarkkoja oletuksia syötemuuttujien jakaumasta. Menetelmä käyttää hyväkseen aineistosta estimoitavia informaatioteoreettisia suureita sekä permutaatiotestiä. Kokeelliset tulokset osoittavat, että menetelmä toimii hyvin, kun riippuvuudet syöteaineiston muuttujien välillä ovat epälineaarisia. Väitöskirjan toinen osa käsittelee Bayes-verkkoja, jotka ovat suunnattuja graafisia malleja. Työssä esitellään uusi mallinvalintakriteeri Bayes-verkkojen oppimiseen diskreeteille muuttujille. Tätä kriteeriä tutkitaan teoreettisesti sekä verrataan kokeellisesti muihin yleisesti käytettyihin mallinvalintakriteereihin. Väitöskirjassa esitellään viimeisenä sovellus suunnatuille graafisille malleille johtamalla Bayes-verkkoon pohjautuva Fisher-ydin (Fisher kernel). Saatua Fisher-ydintä voidaan käyttää mittaamaan datavektoreiden samankaltaisuutta ottaen huomioon riippuvuudet vektoreiden komponenttien välillä, mitä havainnollistetaan kokeellisesti
    • …
    corecore