18 research outputs found

    Classification of Gene Expression Data: A Hubness-aware Semi-supervised Approach

    Get PDF
    Background and Objective. Classification of gene expression data is the common denominator of various biomedical recognition tasks. However, obtaining class labels for large training samples may be difficult or even impossible in many cases. Therefore, semi-supervised classification techniques are required as semi-supervised classifiers take advantage of unlabeled data. Methods. Gene expression data is high-dimensional which gives rise to the phenomena known under the umbrella of the curse of dimensionality, one of its recently explored aspects being the presence of hubs or hubness for short. Therefore, hubness-aware classifiers have been developed recently, such as Naive Hubness-Bayesian k-Nearest Neighbor (NHBNN). In this paper, we propose a semi-supervised extension of NHBNN which follows the self-training schema. As one of the core components of self-training is the certainty score, we propose a new hubness-aware certainty score. Results. We performed experiments on publicly available gene expression data. These experiments show that the proposed classifier outperforms its competitors. We investigated the impact of each of the components (classification algorithm, semi-supervised technique, hubness-aware certainty score) separately and showed that each of these components are relevant to the performance of the proposed approach. Conclusions. Our results imply that our approach may increase classification accuracy and reduce computational costs (i.e., runtime). Based on the promising results presented in the paper, we envision that hubness-aware techniques will be used in various other biomedical machine learning tasks. In order to accelerate this process, we made an implementation of hubness-aware machine learning techniques publicly available in the PyHubs software package (http://www.biointelligence.hu/pyhubs) implemented in Python, one of the most popular programming languages of data science

    Reprezentacije i metrike za mašinsko učenje i analizu podataka velikih dimenzija

    Get PDF
    In the current information age, massive amounts of data are gathered, at a rate prohibiting their effective structuring, analysis, and conversion into useful knowledge. This information overload is manifested both in large numbers of data objects recorded in data sets, and large numbers of attributes, also known as high dimensionality. This dis-sertation deals with problems originating from high dimensionality of data representation, referred to as the “curse of dimensionality,” in the context of machine learning, data mining, and information retrieval. The described research follows two angles: studying the behavior of (dis)similarity metrics with increasing dimensionality, and exploring feature-selection methods, primarily with regard to document representation schemes for text classification. The main results of the dissertation, relevant to the first research angle, include theoretical insights into the concentration behavior of cosine similarity, and a detailed analysis of the phenomenon of hubness, which refers to the tendency of some points in a data set to become hubs by being in-cluded in unexpectedly many k-nearest neighbor lists of other points. The mechanisms behind the phenomenon are studied in detail, both from a theoretical and empirical perspective, linking hubness with the (intrinsic) dimensionality of data, describing its interaction with the cluster structure of data and the information provided by class la-bels, and demonstrating the interplay of the phenomenon and well known algorithms for classification, semi-supervised learning, clustering, and outlier detection, with special consideration being given to time-series classification and information retrieval. Results pertaining to the second research angle include quantification of the interaction between various transformations of high-dimensional document representations, and feature selection, in the context of text classification.U tekućem „informatičkom dobu“, masivne količine podataka se sakupljaju brzinom koja ne dozvoljava njihovo efektivno strukturiranje, analizu, i pretvaranje u korisno znanje. Ovo zasićenje informacijama se manifestuje kako kroz veliki broj objekata uključenih u skupove podataka, tako i kroz veliki broj atributa, takođe poznat kao velika dimenzionalnost. Disertacija se bavi problemima koji proizilaze iz velike dimenzionalnosti reprezentacije podataka, često nazivanim „prokletstvom dimenzionalnosti“, u kontekstu mašinskog učenja, data mining-a i information retrieval-a. Opisana istraživanja prate dva pravca: izučavanje ponašanja metrika (ne)sličnosti u odnosu na rastuću dimenzionalnost, i proučavanje metoda odabira atributa, prvenstveno u interakciji sa tehnikama reprezentacije dokumenata za klasifikaciju teksta. Centralni rezultati disertacije, relevantni za prvi pravac istraživanja, uključuju teorijske uvide u fenomen koncentracije kosinusne mere sličnosti, i detaljnu analizu fenomena habovitosti koji se odnosi na tendenciju nekih tačaka u skupu podataka da postanu habovi tako što bivaju uvrštene u neočekivano mnogo lista k najbližih suseda ostalih tačaka. Mehanizmi koji pokreću fenomen detaljno su proučeni, kako iz teorijske tako i iz empirijske perspektive. Habovitost je povezana sa (latentnom) dimenzionalnošću podataka, opisana je njena interakcija sa strukturom klastera u podacima i informacijama koje pružaju oznake klasa, i demonstriran je njen efekat na poznate algoritme za klasifikaciju, semi-supervizirano učenje, klastering i detekciju outlier-a, sa posebnim osvrtom na klasifikaciju vremenskih serija i information retrieval. Rezultati koji se odnose na drugi pravac istraživanja uključuju kvantifikaciju interakcije između različitih transformacija višedimenzionalnih reprezentacija dokumenata i odabira atributa, u kontekstu klasifikacije teksta

    Reprezentacije i metrike za mašinsko učenje i analizu podataka velikih dimenzija

    Get PDF
    In the current information age, massive amounts of data are gathered, at a rate prohibiting their effective structuring, analysis, and conversion into useful knowledge. This information overload is manifested both in large numbers of data objects recorded in data sets, and large numbers of attributes, also known as high dimensionality. This dis-sertation deals with problems originating from high dimensionality of data representation, referred to as the “curse of dimensionality,” in the context of machine learning, data mining, and information retrieval. The described research follows two angles: studying the behavior of (dis)similarity metrics with increasing dimensionality, and exploring feature-selection methods, primarily with regard to document representation schemes for text classification. The main results of the dissertation, relevant to the first research angle, include theoretical insights into the concentration behavior of cosine similarity, and a detailed analysis of the phenomenon of hubness, which refers to the tendency of some points in a data set to become hubs by being in-cluded in unexpectedly many k-nearest neighbor lists of other points. The mechanisms behind the phenomenon are studied in detail, both from a theoretical and empirical perspective, linking hubness with the (intrinsic) dimensionality of data, describing its interaction with the cluster structure of data and the information provided by class la-bels, and demonstrating the interplay of the phenomenon and well known algorithms for classification, semi-supervised learning, clustering, and outlier detection, with special consideration being given to time-series classification and information retrieval. Results pertaining to the second research angle include quantification of the interaction between various transformations of high-dimensional document representations, and feature selection, in the context of text classification.U tekućem „informatičkom dobu“, masivne količine podataka se sakupljaju brzinom koja ne dozvoljava njihovo efektivno strukturiranje, analizu, i pretvaranje u korisno znanje. Ovo zasićenje informacijama se manifestuje kako kroz veliki broj objekata uključenih u skupove podataka, tako i kroz veliki broj atributa, takođe poznat kao velika dimenzionalnost. Disertacija se bavi problemima koji proizilaze iz velike dimenzionalnosti reprezentacije podataka, često nazivanim „prokletstvom dimenzionalnosti“, u kontekstu mašinskog učenja, data mining-a i information retrieval-a. Opisana istraživanja prate dva pravca: izučavanje ponašanja metrika (ne)sličnosti u odnosu na rastuću dimenzionalnost, i proučavanje metoda odabira atributa, prvenstveno u interakciji sa tehnikama reprezentacije dokumenata za klasifikaciju teksta. Centralni rezultati disertacije, relevantni za prvi pravac istraživanja, uključuju teorijske uvide u fenomen koncentracije kosinusne mere sličnosti, i detaljnu analizu fenomena habovitosti koji se odnosi na tendenciju nekih tačaka u skupu podataka da postanu habovi tako što bivaju uvrštene u neočekivano mnogo lista k najbližih suseda ostalih tačaka. Mehanizmi koji pokreću fenomen detaljno su proučeni, kako iz teorijske tako i iz empirijske perspektive. Habovitost je povezana sa (latentnom) dimenzionalnošću podataka, opisana je njena interakcija sa strukturom klastera u podacima i informacijama koje pružaju oznake klasa, i demonstriran je njen efekat na poznate algoritme za klasifikaciju, semi-supervizirano učenje, klastering i detekciju outlier-a, sa posebnim osvrtom na klasifikaciju vremenskih serija i information retrieval. Rezultati koji se odnose na drugi pravac istraživanja uključuju kvantifikaciju interakcije između različitih transformacija višedimenzionalnih reprezentacija dokumenata i odabira atributa, u kontekstu klasifikacije teksta

    Combinando semi-supervisão e hubness para aprimorar o agrupamento de dados em alta dimensão

    Get PDF
    The curse of dimensionality turns the high-dimensional data analysis a challenging task for data clustering techniques. Recent works have efficiently employed an aspect inherent to high-dimensional data in the proposal of clustering approaches guided by hubs which provide information about the distribution of the data instances among the K-nearest neighbors. Though, hubs can not well reflect the implicit semantics of the data, leading to an unsuitable data partition. In order to cope with both issues (i.e., high-dimensional data and meaningful clusters), this dissertation presents a clustering approach that explores the combination of two strategies: semi-supervision and density estimation based on hubness scores. The experimental results conducted with 23 real datasets show that the proposed approach has a good performance when applied on datasets with different characteristics.CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível SuperiorCNPq - Conselho Nacional de Desenvolvimento Científico e TecnológicoDissertação (Mestrado)A chamada maldição da dimensionalidade faz com que a análise de dados em alta dimensão seja uma tarefa desafiadora para técnicas de agrupamento de dados. Para tratar desta questão, trabalhos recentes têm empregado de forma eficiente um aspecto inerente de dados de alta dimensão na realização de processos de agrupamentos de dados. Esse aspecto, denominado hubness, consiste na tendência de algumas instâncias de dados, chamadas hubs, ocorrerem com maior frequência nas listas dos K-vizinhos mais próximos de outras instâncias. Contudo, os hubs podem não refletir a semântica implícita dos dados, levando a uma partição de dados inadequada. Esta dissertação apresenta uma abordagem de agrupamento que explora a combinação de duas estratégias: semi-supervisão e estimativa de densidade baseada em pontuações hubness. Os resultados dos experimentos realizados com 23 conjuntos de dados reais mostram que a abordagem proposta tem um desempenho superior quando aplicada em conjuntos de dados com características diferentes
    corecore