49 research outputs found

    Unsupervised learning in high-dimensional space

    Full text link
    Thesis (Ph.D.)--Boston UniversityIn machine learning, the problem of unsupervised learning is that of trying to explain key features and find hidden structures in unlabeled data. In this thesis we focus on three unsupervised learning scenarios: graph based clustering with imbalanced data, point-wise anomaly detection and anomalous cluster detection on graphs. In the first part we study spectral clustering, a popular graph based clustering technique. We investigate the reason why spectral clustering performs badly on imbalanced and proximal data. We then propose the partition constrained minimum cut (PCut) framework based on a novel parametric graph construction method, that is shown to adapt to different degrees of imbalanced data. We analyze the limit cut behavior of our approach, and demonstrate the significant performance improvement through clustering and semi-supervised learning experiments on imbalanced data. [TRUNCATED

    A POWER INDEX BASED FRAMEWORKFOR FEATURE SELECTION PROBLEMS

    Get PDF
    One of the most challenging tasks in the Machine Learning context is the feature selection. It consists in selecting the best set of features to use in the training and prediction processes. There are several benefits from pruning the set of actually operational features: the consequent reduction of the computation time, often a better quality of the prediction, the possibility to use less data to create a good predictor. In its most common form, the problem is called single-view feature selection problem, to distinguish it from the feature selection task in Multi-view learning. In the latter, each view corresponds to a set of features and one would like to enact feature selection on each view, subject to some global constraints. A related problem in the context of Multi-View Learning, is Feature Partitioning: it consists in splitting the set of features of a single large view into two or more views so that it becomes possible to create a good predictor based on each view. In this case, the best features must be distributed between the views, each view should contain synergistic features, while features that interfere disruptively must be placed in different views. In the semi-supervised multi-view task known as Co-training, one requires also that each predictor trained on an individual view is able to teach something to the other views: in classification tasks for instance, one view should learn to classify unlabelled examples based on the guess provided by the other views. There are several ways to address these problems. A set of techniques is inspired by Coalitional Game Theory. Such theory defines several useful concepts, among which two are of high practical importance: the concept of power index and the concept of interaction index. When used in the context of feature selection, they take the following meaning: the power index is a (context-dependent) synthesis measure of the prediction\u2019s capability of a feature, the interaction index is a (context-dependent) synthesis measure of the interaction (constructive/disruptive interference) between two features: it can be used to quantify how the collaboration between two features enhances their prediction capabilities. An important point is that the powerindex of a feature is different from the predicting power of the feature in isolation: it takes into account, by a suitable averaging, the context, i.e. the fact that the feature is acting, together with other features, to train a model. Similarly, the interaction index between two features takes into account the context, by suitably averaging the interaction with all the other features. In this work we address both the single-view and the multi-view problems as follows. The single-view feature selection problem, is formalized as the problem of maximization of a pseudo-boolean function, i.e. a real valued set function (that maps sets of features into a performance metric). Since one has to enact a search over (a considerable portion of) the Boolean lattice (without any special guarantees, except, perhaps, positivity) the problem is in general NP-hard. We address the problem producing candidate maximum coalitions through the selection of the subset of features characterized by the highest power indices and using the coalition to approximate the actual maximum. Although the exact computation of the power indices is an exponential task, the estimates of the power indices for the purposes of the present problem can be achieved in polynomial time. The multi-view feature selection problem is formalized as the generalization of the above set-up to the case of multi-variable pseudo-boolean functions. The multi-view splitting problem is formalized instead as the problem of maximization of a real function defined over the partition lattice. Also this problem is typically NP-hard. However, candidate solutions can be found by suitably partitioning the top power-index features and keeping in different views the pairs of features that are less interactive or negatively interactive. The sum of the power indices of the participating features can be used to approximate the prediction capability of the view (i.e. they can be used as a proxy for the predicting power). The sum of the feature pair interactivity across views can be used as proxy for the orthogonality of the views. Also the capability of a view to pass information (to teach) to other views, within a co-training procedure can benefit from the use of power indices based on a suitable definition of information transfer (a set of features { a coalition { classifies examples that are subsequently used in the training of a second set of features). As to the feature selection task, not only we demonstrate the use of state of the art power index concepts (e.g. Shapley Value and Banzhaf along the 2lines described above Value), but we define new power indices, within the more general class of probabilistic power indices, that contains the Shapley and the Banzhaf Values as special cases. Since the number of features to select is often a predefined parameter of the problem, we also introduce some novel power indices, namely k-Power Index (and its specializations k-Shapley Value, k-Banzhaf Value): they help selecting the features in a more efficient way. For the feature partitioning, we use the more general class of probabilistic interaction indices that contains the Shapley and Banzhaf Interaction Indices as members. We also address the problem of evaluating the teaching ability of a view, introducing a suitable teaching capability index. The last contribution of the present work consists in comparing the Game Theory approach to the classical Greedy Forward Selection approach for feature selection. In the latter the candidate is obtained by aggregating one feature at time to the current maximal coalition, by choosing always the feature with the maximal marginal contribution. In this case we show that in typical cases the two methods are complementary, and that when used in conjunction they reduce one another error in the estimate of the maximum value. Moreover, the approach based on game theory has two advantages: it samples the space of all possible features\u2019 subsets, while the greedy algorithm scans a selected subspace excluding totally the rest of it, and it is able, for each feature, to assign a score that describes a context-aware measure of importance in the prediction process

    MODELING LARGE-SCALE CROSS EFFECT IN CO-PURCHASE INCIDENCE: COMPARING ARTIFICIAL NEURAL NETWORK TECHNIQUES AND MULTIVARIATE PROBIT MODELING

    Get PDF
    This dissertation examines cross-category effects in consumer purchases from the big data and analytics perspectives. It uses data from Nielsen Consumer Panel and Scanner databases for its investigations. With big data analytics it becomes possible to examine the cross effects of many product categories on each other. The number of categories whose cross effects are studied is called category scale or just scale in this dissertation. The larger the category scale the higher the number of categories whose cross effects are studied. This dissertation extends research on models of cross effects by (1) examining the performance of MVP model across category scale; (2) customizing artificial neural network (ANN) techniques for large-scale cross effect analysis; (3) examining the performance of ANN across scale; and (4) developing a conceptual model of spending habits as a source of cross effect heterogeneity. The results provide researchers and managers new knowledge about using the two techniques in large category scale settings The computational capabilities required by MVP models grow exponentially with scale and thus are more significantly limited by computational capabilities than are ANN models. In our experiments, for scales 4, 8, 16 and 32, using Nielsen data, MVP models could not be estimated using baskets with 16 and more categories. We attempted to and could calibrate ANN models, on the other hand, for both scales 16 and 32. Surprisingly, the predictive results of ANN models exhibit an inverted U relationship with scale. As an ancillary result we provide a method for determining the existence and extent of non-linear own and cross category effects on likelihood of purchase of a category using ANN models. Besides our empirical studies, we draw on the mental budgeting model and impulsive spending literature, to provide a conceptualization of consumer spending habits as a source of heterogeneity in cross effect context. Finally, after a discussion of conclusions and limitations, the dissertation concludes with a discussion of open questions for future research

    Postal Reform in Japan: A Comparative New Zealand–Japan Study of Economic Issues in Privatising a Postal System

    Get PDF
    The purpose of this thesis is to analyse the economic issues in the postal reform experiences of New Zealand and Japan. New Zealand’s reforms were conducted in the 1980s. Its experience raises questions about what factors were important for overcoming resistance to reform. Japan’s case is a current issue, raising questions of how likely privatisation may be, what dilutions may occur, and what might the post-reform organisation look like. This study charts New Zealand’s reform evolution by supplementing the literature with interviews conducted with experts closely tied to the events. Japan’s reform is similarly traced to the present day where a simulation model I have developed proposes final negotiation outcomes. I argue that New Zealand Post’s pre-reform institutional environment was incongruent with efficiency and productivity. Reforms created an entirely new institutional environment based on neoliberal ideologies, separating governance from ownership and disentangling commercial and social objectives. This study shows that resistance was overcome by a culmination of implementation speed, scale of reform, carefully drafted legislation, and managerial acuity. Japan’s pre-reform environment displays a number of parallels to New Zealand’s. However in this case, I argue that prolonged implementation and a greater presence of interest groups hamper reform progress and simulation suggests only a partial reform where legislation maintains entry barriers, favouring Japan Post over private competition. Key reasons in explaining the differences in outcomes between the two countries are argued to be differences in political ideologies and the strength of reform opposition. The separation of governance and ownership for instance, appears to be more distinct in New Zealand than Japan, and whilst the New Zealand model focuses upon shareholder wealth maximisation, the Japanese case appears to place greater emphasis upon stakeholder interests

    Mixture Model Clustering in the Analysis of Complex Diseases

    Get PDF
    The topic of this thesis is the analysis of complex diseases, and specifically the use of certain clustering methods to do it. We concern ourselves mostly with the modeling of complex phenotypes of diseases: the symptoms and signs of diseases, and the other multiple cophenotypes that go with them. The two related questions we seek answers for are: 1) how can we use these clustering methods to summarize the complex, multivariate phenotype data, for example to be used as a simple phenotype in genetic analyses and 2) how can we use these clustering methods to find subgroups of sufferers of a particular disease, such that might share the same causal factors of the disease. Current methods for studies on medical genetics ideally call for a single or at most handful of univariate phenotypes to be compared to genetic markers. Multidimensional phenotypes cannot be handled by the standard methods, and treating each variable as independent and testing one hundred phenotypes with unclear true dependency structure against thousands of markers results into problems with both running times and multiple testing correction. In this work, clustering is utilized to summarize a multi-dimensional phenotype into something that can then be used in association studies of both genetic and other type of potential causes. I describe a clustering process and some clustering methods used in this work, with comments on practical issues and references to the relevant literature. After some experiments on artificial data to gain insight to the properties of these methods, I present four case-studies on real data, highlighting both ways to succesfully use these methods and problems that can arise in the process.Tässä väitöskirjatyössä tarkastellaan niin sanottujen kompleksitautien mallintamista sekoitemalliklusteroinniin avulla. Monet nykyään käytössä olevat geneettiset ja muut epidemiologiset menetelmät olettavat yksimuuttujaisen ilmenemisasun (esimerkiksi ihmisellä joko on tai ei ole tietty sairaus), mutta kompleksitautien ilmenemismuodot ovat yleensä monimutkaisempia. Tämän väitöskirjatyön pääasiallisena tutkimuskohteena on näiden monimutkaisten tautien ilmenemismuotojen (oireiden, löydösten ja samaan aikaan esiintyvien muiden piirteiden) mallintaminen sekoitemalliklusterointimentelmiä käyttäen. Tavoitteena on joko löytää yksinkertaisia kuvauksia monimutkaisista taudeista tai erottaa potilaista sellaisia alaryhmiä, että taudinkuva niiden sisällä on hyvin samankaltainen. Näitä tietoja voidaan sitten käyttää hyväksi tautien syitä selvitettäessä. Väitöskirjassa on kartoitettu näiden sekoitemalliklusteroinnin menetelmien käyttäytymistä eri tilanteissa käyttäen testiaineistona keinotekoista dataa, joka ominaisuuksiltaan muistuttaa todellista lääketieteellistä aineistoa. Lisäksi kuvataan menetelmien soveltamista neljässä oikeassa lääketieteellisessä aineistossa siten, että havainnollistetuksi tulee sekä tämänkaltaisen tutkimuksen hyviä että heikkoja puolia

    THREE-DIMENSIONAL VISION FOR STRUCTURE AND MOTION ESTIMATION

    Get PDF
    1997/1998Questa tesi, intitolata Visione Tridimensionale per la stima di Struttura e Moto, tratta di tecniche di Visione Artificiale per la stima delle proprietà geometriche del mondo tridimensionale a partire da immagini numeriche. Queste proprietà sono essenziali per il riconoscimento e la classificazione di oggetti, la navigazione di veicoli mobili autonomi, il reverse engineering e la sintesi di ambienti virtuali. In particolare, saranno descritti i moduli coinvolti nel calcolo della struttura della scena a partire dalle immagini, e verranno presentati contributi originali nei seguenti campi. Rettificazione di immagini steroscopiche. Viene presentato un nuovo algoritmo per la rettificazione, il quale trasforma una coppia di immagini stereoscopiche in maniera che punti corrispondenti giacciano su linee orizzontali con lo stesso indice. Prove sperimentali dimostrano il corretto comportamento del metodo, come pure la trascurabile perdita di accuratezza nella ricostruzione tridimensionale quando questa sia ottenuta direttamente dalle immagini rettificate. Calcolo delle corrispondenze in immagini stereoscopiche. Viene analizzato il problema della stereovisione e viene presentato un un nuovo ed efficiente algoritmo per l'identificazione di coppie di punti corrispondenti, capace di calcolare in modo robusto la disparità stereoscopica anche in presenza di occlusioni. L'algoritmo, chiamato SMW, usa uno schema multi-finestra adattativo assieme al controllo di coerenza destra-sinistra per calcolare la disparità e l'incertezza associata. Gli esperimenti condotti con immagini sintetiche e reali mostrano che SMW sortisce un miglioramento in accuratezza ed efficienza rispetto a metodi simili Inseguimento di punti salienti. L'inseguitore di punti salienti di Shi-Tomasi- Kanade viene migliorato introducendo uno schema automatico per lo scarto di punti spuri basato sulla diagnostica robusta dei campioni periferici ( outliers ). Gli esperimenti con immagini sintetiche e reali confermano il miglioramento rispetto al metodo originale, sia qualitativamente che quantitativamente. Ricostruzione non calibrata. Viene presentata una rassegna ragionata dei metodi per la ricostruzione di un modello tridimensionale della scena, a partire da una telecamera che si muove liberamente e di cui non sono noti i parametri interni. Il contributo consiste nel fornire una visione critica e unificata delle più recenti tecniche. Una tale rassegna non esiste ancora in letterarura. Moto tridimensionale. Viene proposto un algoritmo robusto per registrate e calcolare le corrispondenze in due insiemi di punti tridimensionali nei quali vi sia un numero significativo di elementi mancanti. Il metodo, chiamato RICP, sfrutta la stima robusta con la Minima Mediana dei Quadrati per eliminare l'effetto dei campioni periferici. Il confronto sperimentale con una tecnica simile, ICP, mostra la superiore robustezza e affidabilità di RICP.This thesis addresses computer vision techniques estimating geometrie properties of the 3-D world /rom digital images. Such properties are essential for object recognition and classification, mobile robots navigation, reverse engineering and synthesis of virtual environments. In particular, this thesis describes the modules involved in the computation of the structure of a scene given some images, and offers original contributions in the following fields. Stereo pairs rectification. A novel rectification algorithm is presented, which transform a stereo pair in such a way that corresponding points in the two images lie on horizontal lines with the same index. Experimental tests prove the correct behavior of the method, as well as the negligible decrease oLthe accuracy of 3-D reconstruction if performed from the rectified images directly. Stereo matching. The problem of computational stereopsis is analyzed, and a new, efficient stereo matching algorithm addressing robust disparity estimation in the presence of occlusions is presented. The algorithm, called SMW, is an adaptive, multi-window scheme using left-right consistency to compute disparity and its associated uncertainty. Experiments with both synthetic and real stereo pairs show how SMW improves on closely related techniques for both accuracy and efficiency. Features tracking. The Shi-Tomasi-Kanade feature tracker is improved by introducing an automatic scheme for rejecting spurious features, based on robust outlier diagnostics. Experiments with real and synthetic images confirm the improvement over the original tracker, both qualitatively and quantitatively. 111 Uncalibrated vision. A review on techniques for computing a three-dimensional model of a scene from a single moving camera, with unconstrained motion and unknown parameters is presented. The contribution is to give a critical, unified view of some of the most promising techniques. Such review does not yet exist in the literature. 3-D motion. A robust algorithm for registering and finding correspondences in two sets of 3-D points with significant percentages of missing data is proposed. The method, called RICP, exploits LMedS robust estimation to withstand the effect of outliers. Experimental comparison with a closely related technique, ICP, shows RICP's superior robustness and reliability.XI Ciclo1968Versione digitalizzata della tesi di dottorato cartacea
    corecore