267 research outputs found

    Isometry and convexity in dimensionality reduction

    Get PDF
    The size of data generated every year follows an exponential growth. The number of data points as well as the dimensions have increased dramatically the past 15 years. The gap between the demand from the industry in data processing and the solutions provided by the machine learning community is increasing. Despite the growth in memory and computational power, advanced statistical processing on the order of gigabytes is beyond any possibility. Most sophisticated Machine Learning algorithms require at least quadratic complexity. With the current computer model architecture, algorithms with higher complexity than linear O(N) or O(N logN) are not considered practical. Dimensionality reduction is a challenging problem in machine learning. Often data represented as multidimensional points happen to have high dimensionality. It turns out that the information they carry can be expressed with much less dimensions. Moreover the reduced dimensions of the data can have better interpretability than the original ones. There is a great variety of dimensionality reduction algorithms under the theory of Manifold Learning. Most of the methods such as Isomap, Local Linear Embedding, Local Tangent Space Alignment, Diffusion Maps etc. have been extensively studied under the framework of Kernel Principal Component Analysis (KPCA). In this dissertation we study two current state of the art dimensionality reduction methods, Maximum Variance Unfolding (MVU) and Non-Negative Matrix Factorization (NMF). These two dimensionality reduction methods do not fit under the umbrella of Kernel PCA. MVU is cast as a Semidefinite Program, a modern convex nonlinear optimization algorithm, that offers more flexibility and power compared to iv KPCA. Although MVU and NMF seem to be two disconnected problems, we show that there is a connection between them. Both are special cases of a general nonlinear factorization algorithm that we developed. Two aspects of the algorithms are of particular interest: computational complexity and interpretability. In other words computational complexity answers the question of how fast we can find the best solution of MVU/NMF for large data volumes. Since we are dealing with optimization programs, we need to find the global optimum. Global optimum is strongly connected with the convexity of the problem. Interpretability is strongly connected with local isometry1 that gives meaning in relationships between data points. Another aspect of interpretability is association of data with labeled information. The contributions of this thesis are the following: 1. MVU is modified so that it can scale more efficient. Results are shown on 1 million speech datasets. Limitations of the method are highlighted. 2. An algorithm for fast computations for the furthest neighbors is presented for the first time in the literature. 3. Construction of optimal kernels for Kernel Density Estimation with modern convex programming is presented. For the first time we show that the Leave One Cross Validation (LOOCV) function is quasi-concave. 4. For the first time NMF is formulated as a convex optimization problem 5. An algorithm for the problem of Completely Positive Matrix Factorization is presented. 6. A hybrid algorithm of MVU and NMF the isoNMF is presented combining advantages of both methods. 7. The Isometric Separation Maps (ISM) a variation of MVU that contains classification information is presented. 8. Large scale nonlinear dimensional analysis on the TIMIT speech database is performed. 9. A general nonlinear factorization algorithm is presented based on sequential convex programming. Despite the efforts to scale the proposed methods up to 1 million data points in reasonable time, the gap between the industrial demand and the current state of the art is still orders of magnitude wide.Ph.D.Committee Chair: David Anderson; Committee Co-Chair: Alexander Gray; Committee Member: Anthony Yezzi; Committee Member: Hongyuan Zha; Committee Member: Justin Romberg; Committee Member: Ronald Schafe

    Towards a machine-learning architecture for lexical functional grammar parsing

    Get PDF
    Data-driven grammar induction aims at producing wide-coverage grammars of human languages. Initial efforts in this field produced relatively shallow linguistic representations such as phrase-structure trees, which only encode constituent structure. Recent work on inducing deep grammars from treebanks addresses this shortcoming by also recovering non-local dependencies and grammatical relations. My aim is to investigate the issues arising when adapting an existing Lexical Functional Grammar (LFG) induction method to a new language and treebank, and find solutions which will generalize robustly across multiple languages. The research hypothesis is that by exploiting machine-learning algorithms to learn morphological features, lemmatization classes and grammatical functions from treebanks we can reduce the amount of manual specification and improve robustness, accuracy and domain- and language -independence for LFG parsing systems. Function labels can often be relatively straightforwardly mapped to LFG grammatical functions. Learning them reliably permits grammar induction to depend less on language-specific LFG annotation rules. I therefore propose ways to improve acquisition of function labels from treebanks and translate those improvements into better-quality f-structure parsing. In a lexicalized grammatical formalism such as LFG a large amount of syntactically relevant information comes from lexical entries. It is, therefore, important to be able to perform morphological analysis in an accurate and robust way for morphologically rich languages. I propose a fully data-driven supervised method to simultaneously lemmatize and morphologically analyze text and obtain competitive or improved results on a range of typologically diverse languages

    Learning on relevance feedback in content-based image retrieval.

    Get PDF
    Hoi, Chu-Hong.Thesis (M.Phil.)--Chinese University of Hong Kong, 2004.Includes bibliographical references (leaves 89-103).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Content-based Image Retrieval --- p.1Chapter 1.2 --- Relevance Feedback --- p.3Chapter 1.3 --- Contributions --- p.4Chapter 1.4 --- Organization of This Work --- p.6Chapter 2 --- Background --- p.8Chapter 2.1 --- Relevance Feedback --- p.8Chapter 2.1.1 --- Heuristic Weighting Methods --- p.9Chapter 2.1.2 --- Optimization Formulations --- p.10Chapter 2.1.3 --- Various Machine Learning Techniques --- p.11Chapter 2.2 --- Support Vector Machines --- p.12Chapter 2.2.1 --- Setting of the Learning Problem --- p.12Chapter 2.2.2 --- Optimal Separating Hyperplane --- p.13Chapter 2.2.3 --- Soft-Margin Support Vector Machine --- p.15Chapter 2.2.4 --- One-Class Support Vector Machine --- p.16Chapter 3 --- Relevance Feedback with Biased SVM --- p.18Chapter 3.1 --- Introduction --- p.18Chapter 3.2 --- Biased Support Vector Machine --- p.19Chapter 3.3 --- Relevance Feedback Using Biased SVM --- p.22Chapter 3.3.1 --- Advantages of BSVM in Relevance Feedback --- p.22Chapter 3.3.2 --- Relevance Feedback Algorithm by BSVM --- p.23Chapter 3.4 --- Experiments --- p.24Chapter 3.4.1 --- Datasets --- p.24Chapter 3.4.2 --- Image Representation --- p.25Chapter 3.4.3 --- Experimental Results --- p.26Chapter 3.5 --- Discussions --- p.29Chapter 3.6 --- Summary --- p.30Chapter 4 --- Optimizing Learning with SVM Constraint --- p.31Chapter 4.1 --- Introduction --- p.31Chapter 4.2 --- Related Work and Motivation --- p.33Chapter 4.3 --- Optimizing Learning with SVM Constraint --- p.35Chapter 4.3.1 --- Problem Formulation and Notations --- p.35Chapter 4.3.2 --- Learning boundaries with SVM --- p.35Chapter 4.3.3 --- OPL for the Optimal Distance Function --- p.38Chapter 4.3.4 --- Overall Similarity Measure with OPL and SVM --- p.40Chapter 4.4 --- Experiments --- p.41Chapter 4.4.1 --- Datasets --- p.41Chapter 4.4.2 --- Image Representation --- p.42Chapter 4.4.3 --- Performance Evaluation --- p.43Chapter 4.4.4 --- Complexity and Time Cost Evaluation --- p.45Chapter 4.5 --- Discussions --- p.47Chapter 4.6 --- Summary --- p.48Chapter 5 --- Group-based Relevance Feedback --- p.49Chapter 5.1 --- Introduction --- p.49Chapter 5.2 --- SVM Ensembles --- p.50Chapter 5.3 --- Group-based Relevance Feedback Using SVM Ensembles --- p.51Chapter 5.3.1 --- (x+l)-class Assumption --- p.51Chapter 5.3.2 --- Proposed Architecture --- p.52Chapter 5.3.3 --- Strategy for SVM Combination and Group Ag- gregation --- p.52Chapter 5.4 --- Experiments --- p.54Chapter 5.4.1 --- Experimental Implementation --- p.54Chapter 5.4.2 --- Performance Evaluation --- p.55Chapter 5.5 --- Discussions --- p.56Chapter 5.6 --- Summary --- p.57Chapter 6 --- Log-based Relevance Feedback --- p.58Chapter 6.1 --- Introduction --- p.58Chapter 6.2 --- Related Work and Motivation --- p.60Chapter 6.3 --- Log-based Relevance Feedback Using SLSVM --- p.61Chapter 6.3.1 --- Problem Statement --- p.61Chapter 6.3.2 --- Soft Label Support Vector Machine --- p.62Chapter 6.3.3 --- LRF Algorithm by SLSVM --- p.64Chapter 6.4 --- Experimental Results --- p.66Chapter 6.4.1 --- Datasets --- p.66Chapter 6.4.2 --- Image Representation --- p.66Chapter 6.4.3 --- Experimental Setup --- p.67Chapter 6.4.4 --- Performance Comparison --- p.68Chapter 6.5 --- Discussions --- p.73Chapter 6.6 --- Summary --- p.75Chapter 7 --- Application: Web Image Learning --- p.76Chapter 7.1 --- Introduction --- p.76Chapter 7.2 --- A Learning Scheme for Searching Semantic Concepts --- p.77Chapter 7.2.1 --- Searching and Clustering Web Images --- p.78Chapter 7.2.2 --- Learning Semantic Concepts with Relevance Feed- back --- p.73Chapter 7.3 --- Experimental Results --- p.79Chapter 7.3.1 --- Dataset and Features --- p.79Chapter 7.3.2 --- Performance Evaluation --- p.80Chapter 7.4 --- Discussions --- p.82Chapter 7.5 --- Summary --- p.82Chapter 8 --- Conclusions and Future Work --- p.84Chapter 8.1 --- Conclusions --- p.84Chapter 8.2 --- Future Work --- p.85Chapter A --- List of Publications --- p.87Bibliography --- p.10

    High impact bug report identification with imbalanced learning strategies

    Get PDF
    Supplementary code and data available from GitHub: https://github.com/goddding/JCST</p

    Efficient data mining algorithms for time series and complex medical data

    Get PDF

    Taxonomy of datasets in graph learning : a data-driven approach to improve GNN benchmarking

    Full text link
    The core research of this thesis, mostly comprising chapter four, has been accepted to the Learning on Graphs (LoG) 2022 conference for a spotlight presentation as a standalone paper, under the title "Taxonomy of Benchmarks in Graph Representation Learning", and is to be published in the Proceedings of Machine Learning Research (PMLR) series. As a main author of the paper, my specific contributions to this paper cover problem formulation, design and implementation of our taxonomy framework and experimental pipeline, collation of our results and of course the writing of the article.L'apprentissage profond sur les graphes a atteint des niveaux de succès sans précédent ces dernières années grâce aux réseaux de neurones de graphes (GNN), des architectures de réseaux de neurones spécialisées qui ont sans équivoque surpassé les approches antérieurs d'apprentissage définies sur des graphes. Les GNN étendent le succès des réseaux de neurones aux données structurées en graphes en tenant compte de leur géométrie intrinsèque. Bien que des recherches approfondies aient été effectuées sur le développement de GNN avec des performances supérieures à celles des modèles références d'apprentissage de représentation graphique, les procédures d'analyse comparative actuelles sont insuffisantes pour fournir des évaluations justes et efficaces des modèles GNN. Le problème peut-être le plus répandu et en même temps le moins compris en ce qui concerne l'analyse comparative des graphiques est la "couverture de domaine": malgré le nombre croissant d'ensembles de données graphiques disponibles, la plupart d'entre eux ne fournissent pas d'informations supplémentaires et au contraire renforcent les biais potentiellement nuisibles dans le développement d’un modèle GNN. Ce problème provient d'un manque de compréhension en ce qui concerne les aspects d'un modèle donné qui sont sondés par les ensembles de données de graphes. Par exemple, dans quelle mesure testent-ils la capacité d'un modèle à tirer parti de la structure du graphe par rapport aux fonctionnalités des nœuds? Ici, nous développons une approche fondée sur des principes pour taxonomiser les ensembles de données d'analyse comparative selon un "profil de sensibilité" qui est basé sur la quantité de changement de performance du GNN en raison d'une collection de perturbations graphiques. Notre analyse basée sur les données permet de mieux comprendre quelles caractéristiques des données de référence sont exploitées par les GNN. Par conséquent, notre taxonomie peut aider à la sélection et au développement de repères graphiques adéquats et à une évaluation mieux informée des futures méthodes GNN. Enfin, notre approche et notre implémentation dans le package GTaxoGym (https://github.com/G-Taxonomy-Workgroup/GTaxoGym) sont extensibles à plusieurs types de tâches de prédiction de graphes et à des futurs ensembles de données.Deep learning on graphs has attained unprecedented levels of success in recent years thanks to Graph Neural Networks (GNNs), specialized neural network architectures that have unequivocally surpassed prior graph learning approaches. GNNs extend the success of neural networks to graph-structured data by accounting for their intrinsic geometry. While extensive research has been done on developing GNNs with superior performance according to a collection of graph representation learning benchmarks, current benchmarking procedures are insufficient to provide fair and effective evaluations of GNN models. Perhaps the most prevalent and at the same time least understood problem with respect to graph benchmarking is "domain coverage": Despite the growing number of available graph datasets, most of them do not provide additional insights and on the contrary reinforce potentially harmful biases in GNN model development. This problem stems from a lack of understanding with respect to what aspects of a given model are probed by graph datasets. For example, to what extent do they test the ability of a model to leverage graph structure vs. node features? Here, we develop a principled approach to taxonomize benchmarking datasets according to a "sensitivity profile" that is based on how much GNN performance changes due to a collection of graph perturbations. Our data-driven analysis provides a deeper understanding of which benchmarking data characteristics are leveraged by GNNs. Consequently, our taxonomy can aid in selection and development of adequate graph benchmarks, and better informed evaluation of future GNN methods. Finally, our approach and implementation in the GTaxoGym package (https://github.com/G-Taxonomy-Workgroup/GTaxoGym) are extendable to multiple graph prediction task types and future datasets
    corecore