15 research outputs found

    SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

    Get PDF
    The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795

    Machine learning techniques for music information retrieval

    Get PDF
    Tese de doutoramento, Informática (Engenharia Informática), Universidade de Lisboa, Faculdade de Ciências, 2015The advent of digital music has changed the rules of music consumption, distribution and sales. With it has emerged the need to effectively search and manage vast music collections. Music information retrieval is an interdisciplinary field of research that focuses on the development of new techniques with that aim in mind. This dissertation addresses a specific aspect of this field: methods that automatically extract musical information exclusively based on the audio signal. We propose a method for automatic music-based classification, label inference, and music similarity estimation. Our method consist in representing the audio with a finite set of symbols and then modeling the symbols time evolution. The symbols are obtained via vector quantization in which a single codebook is used to quantize the audio descriptors. The symbols time evolution is modeled via a first order Markov process. Based on systematic evaluations we carried out on publicly available sets, we show that our method achieves performances on par with most techniques found in literature. We also present and discuss the problems that appear when computers try to classify or annotate songs using the audio as the only source of information. In our method, the separation of quantization process from the creation and training of classification models helped us in that analysis. It enabled us to examine how instantaneous sound attributes (henceforth features) are distributed in term of musical genre, and how designing codebooks specially tailored for these distributions affects the performance of ours and other classification systems commonly used for this task. On this issue, we show that there is no apparent benefit in seeking a thorough representation of the feature space. This is a bit unexpected since it goes against the assumption that features carry equally relevant information loads and somehow capture the specificities of musical facets, implicit in many genre recognition methods. Label inference is the task of automatically annotating songs with semantic words - this tasks is also known as autotagging. In this context, we illustrate the importance of a number of issues, that in our perspective, are often overlooked. We show that current techniques are fragile in the sense that small alterations in the set of labels may lead to dramatically different results. Furthermore, through a series of experiments, we show that autotagging systems fail to learn tag models capable to generalize to datasets of different origins. We also show that the performance achieved with these techniques is not sufficient to be able to take advantage of the correlations between tags.Fundação para a Ciência e a Tecnologia (FCT

    Enhancing deep transfer learning for image classification

    Get PDF
    Though deep learning models require a large amount of labelled training data for yielding high performance, they are applied to accomplish many computer vision tasks such as image classification. Current models also do not perform well across different domain settings such as illumination, camera angle and real-to-synthetic. Thus the models are more likely to misclassify unknown classes as known classes. These issues challenge the supervised learning paradigm of the models and encourage the study of transfer learning approaches. Transfer learning allows us to utilise the knowledge acquired from related domains to improve performance on a target domain. Existing transfer learning approaches lack proper high-level source domain feature analyses and are prone to negative transfers for not exploring proper discriminative information across domains. Current approaches also lack at discovering necessary visual-semantic linkage and has a bias towards the source domain. In this thesis, to address these issues and improve image classification performance, we make several contributions to three different deep transfer learning scenarios, i.e., the target domain has i) labelled data; no labelled data; and no visual data. Firstly, for improving inductive transfer learning for the first scenario, we analyse the importance of high-level deep features and propose utilising them in sequential transfer learning approaches and investigating the suitable conditions for optimal performance. Secondly, to improve image classification across different domains in an open set setting by reducing negative transfers (second scenario), we propose two novel architectures. The first model has an adaptive weighting module based on underlying domain distinctive information, and the second model has an information-theoretic weighting module to reduce negative transfers. Thirdly, to learn visual classifiers when no visual data is available (third scenario) and reduce source domain bias, we propose two novel models. One model has a new two-step dense attention mechanism to discover semantic attribute-guided local visual features and mutual learning loss. The other model utilises bidirectional mapping and adversarial supervision to learn the joint distribution of source-target domains simultaneously. We propose a new pointwise mutual information dependant loss in the first model and a distance-based loss in the second one for handling source domain bias. We perform extensive evaluations on benchmark datasets and demonstrate the proposed models outperform contemporary works.Doctor of Philosoph

    Towards Data Privacy and Utility in the Applications of Graph Neural Networks

    Get PDF
    Graph Neural Networks (GNNs) are essential for handling graph-structured data, often containing sensitive information. It’s vital to maintain a balance between data privacy and usability. To address this, this dissertation introduces three studies aimed at enhancing privacy and utility in GNN applications, particularly in node classification, link prediction, and graph classification. The first work tackles celebrity privacy in social networks. We develop a novel framework using adversarial learning for link-privacy preserved graph embedding, which effectively safeguards sensitive links without compromising the graph’s structure and node attributes. This approach is validated using real social network data. In the second work, we confront challenges in federated graph learning with non-independent and identically distributed (non-IID) data. We introduce PPFL-GNN, a privacy-preserving federated graph neural network framework that mitigates overfitting on the client side and inefficient aggregation on the server side. It leverages local graph data for embeddings and employs embedding alignment techniques for enhanced privacy, addressing the hurdles in federated learning on non-IID graph data. The third work explores Few-Shot graph classification, which aims to classify novel graph types with limited labeled data. We propose a unique framework combining Meta-learning and contrastive learning to better utilize graph structures in molecular and social network datasets. Additionally, we offer benchmark graph datasets with extensive node-attribute dimensions for future research. These studies collectively advance the field of graph-based machine learning by addressing critical issues of data privacy and utility in GNN applications

    Exploiting word embeddings for modeling bilexical relations

    Get PDF
    There has been an exponential surge of text data in the recent years. As a consequence, unsupervised methods that make use of this data have been steadily growing in the field of natural language processing (NLP). Word embeddings are low-dimensional vectors obtained using unsupervised techniques on the large unlabelled corpora, where words from the vocabulary are mapped to vectors of real numbers. Word embeddings aim to capture syntactic and semantic properties of words. In NLP, many tasks involve computing the compatibility between lexical items under some linguistic relation. We call this type of relation a bilexical relation. Our thesis defines statistical models for bilexical relations that centrally make use of word embeddings. Our principle aim is that the word embeddings will favor generalization to words not seen during the training of the model. The thesis is structured in four parts. In the first part of this thesis, we present a bilinear model over word embeddings that leverages a small supervised dataset for a binary linguistic relation. Our learning algorithm exploits low-rank bilinear forms and induces a low-dimensional embedding tailored for a target linguistic relation. This results in compressed task-specific embeddings. In the second part of our thesis, we extend our bilinear model to a ternary setting and propose a framework for resolving prepositional phrase attachment ambiguity using word embeddings. Our models perform competitively with state-of-the-art models. In addition, our method obtains significant improvements on out-of-domain tests by simply using word-embeddings induced from source and target domains. In the third part of this thesis, we further extend the bilinear models for expanding vocabulary in the context of statistical phrase-based machine translation. Our model obtains a probabilistic list of possible translations of target language words, given a word in the source language. We do this by projecting pre-trained embeddings into a common subspace using a log-bilinear model. We empirically notice a significant improvement on an out-of-domain test set. In the final part of our thesis, we propose a non-linear model that maps initial word embeddings to task-tuned word embeddings, in the context of a neural network dependency parser. We demonstrate its use for improved dependency parsing, especially for sentences with unseen words. We also show downstream improvements on a sentiment analysis task.En els darrers anys hi ha hagut un sorgiment notable de dades en format textual. Conseqüentment, en el camp del Processament del Llenguatge Natural (NLP, de l'anglès "Natural Language Processing") s'han desenvolupat mètodes no supervistats que fan ús d'aquestes dades. Els anomenats "word embeddings", o embeddings de paraules, són vectors de dimensionalitat baixa que s'obtenen mitjançant tècniques no supervisades aplicades a corpus textuals de grans volums. Com a resultat, cada paraula del diccionari es correspon amb un vector de nombres reals, el propòsit del qual és capturar propietats sintàctiques i semàntiques de la paraula corresponent. Moltes tasques de NLP involucren calcular la compatibilitat entre elements lèxics en l'àmbit d'una relació lingüística. D'aquest tipus de relació en diem relació bilèxica. Aquesta tesi proposa models estadístics per a relacions bilèxiques que fan ús central d'embeddings de paraules, amb l'objectiu de millorar la generalització del model lingüístic a paraules no vistes durant l'entrenament. La tesi s'estructura en quatre parts. A la primera part presentem un model bilineal sobre embeddings de paraules que explota un conjunt petit de dades anotades sobre una relaxió bilèxica. L'algorisme d'aprenentatge treballa amb formes bilineals de poc rang, i indueix embeddings de poca dimensionalitat que estan especialitzats per la relació bilèxica per la qual s'han entrenat. Com a resultat, obtenim embeddings de paraules que corresponen a compressions d'embeddings per a una relació determinada. A la segona part de la tesi proposem una extensió del model bilineal a trilineal, i amb això proposem un nou model per a resoldre ambigüitats de sintagmes preposicionals que usa només embeddings de paraules. En una sèrie d'avaluacións, els nostres models funcionen de manera similar a l'estat de l'art. A més, el nostre mètode obté millores significatives en avaluacions en textos de dominis diferents al d'entrenament, simplement usant embeddings induïts amb textos dels dominis d'entrenament i d'avaluació. A la tercera part d'aquesta tesi proposem una altra extensió dels models bilineals per ampliar la cobertura lèxica en el context de models estadístics de traducció automàtica. El nostre model probabilístic obté, donada una paraula en la llengua d'origen, una llista de possibles traduccions en la llengua de destí. Fem això mitjançant una projecció d'embeddings pre-entrenats a un sub-espai comú, usant un model log-bilineal. Empíricament, observem una millora significativa en avaluacions en dominis diferents al d'entrenament. Finalment, a la quarta part de la tesi proposem un model no lineal que indueix una correspondència entre embeddings inicials i embeddings especialitzats, en el context de tasques d'anàlisi sintàctica de dependències amb models neuronals. Mostrem que aquest mètode millora l'analisi de dependències, especialment en oracions amb paraules no vistes durant l'entrenament. També mostrem millores en un tasca d'anàlisi de sentiment

    Proceedings of the 7th Sound and Music Computing Conference

    Get PDF
    Proceedings of the SMC2010 - 7th Sound and Music Computing Conference, July 21st - July 24th 2010

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov

    Proceedings of the 19th Sound and Music Computing Conference

    Get PDF
    Proceedings of the 19th Sound and Music Computing Conference - June 5-12, 2022 - Saint-Étienne (France). https://smc22.grame.f
    corecore