Search CORE

7 research outputs found

Hubness-aware kNN classification of high-dimensional data in presence of label noise

Author: Angluin
Aucouturier
Bellman
Bootkrajang
Brodley
Frenay
Georgios
Guan
Ipeirotis
Karmaker
Keller
Khoshgoftaar
Ko
Krisztian Buza
Long
Lowe
Nenad Tomašev
Radovanović
Tan
Tomašev
Tomašev
Tomašev
Wang
Yu
Zeng
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Crossref

Repository of the Academy's Library

Semmelweis Repository

Semi-supervised Naive Hubness-Bayesian k-Nearest Neighbor for Gene Expression Data

Author: Buza Krisztián Antal
Publication venue: Springer International Publishing
Publication date: 01/01/2015
Field of study

Repository of the Academy's Library

Classification of Gene Expression Data: A Hubness-aware Semi-supervised Approach

Author: Buza Krisztian
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

Background and Objective. Classification of gene expression data is the common denominator of various biomedical recognition tasks. However, obtaining class labels for large training samples may be difficult or even impossible in many cases. Therefore, semi-supervised classification techniques are required as semi-supervised classifiers take advantage of unlabeled data. Methods. Gene expression data is high-dimensional which gives rise to the phenomena known under the umbrella of the curse of dimensionality, one of its recently explored aspects being the presence of hubs or hubness for short. Therefore, hubness-aware classifiers have been developed recently, such as Naive Hubness-Bayesian k-Nearest Neighbor (NHBNN). In this paper, we propose a semi-supervised extension of NHBNN which follows the self-training schema. As one of the core components of self-training is the certainty score, we propose a new hubness-aware certainty score. Results. We performed experiments on publicly available gene expression data. These experiments show that the proposed classifier outperforms its competitors. We investigated the impact of each of the components (classification algorithm, semi-supervised technique, hubness-aware certainty score) separately and showed that each of these components are relevant to the performance of the proposed approach. Conclusions. Our results imply that our approach may increase classification accuracy and reduce computational costs (i.e., runtime). Based on the promising results presented in the paper, we envision that hubness-aware techniques will be used in various other biomedical machine learning tasks. In order to accelerate this process, we made an implementation of hubness-aware machine learning techniques publicly available in the PyHubs software package (http://www.biointelligence.hu/pyhubs) implemented in Python, one of the most popular programming languages of data science

Repository of the Academy's Library

Noise simulation in classification with the noisemodel R package: Applications analyzing the impact of errors with chemical data

Author: Sáez Muñoz José Antonio
Publication venue: Wiley
Publication date: 01/05/2023
Field of study

Classification datasets created from chemical processes can be affected by errors, which impair the accuracy of the models built. This fact highlights the importance of analyzing the robustness of classifiers against different types and levels of noise to know their behavior against potential errors. In this con- text, noise models have been proposed to study noise-related phenomenology in a controlled environment, allowing errors to be introduced into the data in a supervised manner. This paper introduces the noisemodel R package, which contains the first extensive implementation of noise models for classification datasets, proposing it as support tool to analyze the impact of errors related to chemical data. It provides 72 noise models found in the specialized literature that allow errors to be introduced in different ways in classes and attributes. Each of them is properly documented and referenced, unifying their results through a specific S3 class, which benefits from customized print, summary and plot methods. The usage of the package is illustrated through four applica- tion examples considering real-world chemical datasets, where errors are prone to occur. The software presented will help to deepen the understanding of the problem of noisy chemical data, as well as to develop new robust algo- rithms and noise preprocessing methods properly adapted to different types of errors in this scenario.University of Granada/CBU

Repositorio Institucional Universidad de Granada

Noise Models in Classification: Unified Nomenclature, Extended Taxonomy and Pragmatic Categorization

Author: Sáez Muñoz José Antonio
Publication venue: 'MDPI AG'
Publication date: 11/10/2022
Field of study

This paper presents the first review of noise models in classification covering both label and attribute noise. Their study reveals the lack of a unified nomenclature in this field. In order to address this problem, a tripartite nomenclature based on the structural analysis of existing noise models is proposed. Additionally, a revision of their current taxonomies is carried out, which are combined and updated to better reflect the nature of any model. Finally, a categorization of noise models is proposed from a practical point of view depending on the characteristics of noise and the study purpose. These contributions provide a variety of models to introduce noise, their characteristics according to the proposed taxonomy and a unified way of naming them, which will facilitate their identification and study, as well as the reproducibility of future research

Repositorio Institucional Universidad de Granada

Unified processing framework of high-dimensional and overly imbalanced chemical datasets for virtual screening.

Author: Rafati-Afshar Amir Ali
Publication venue
Publication date
Field of study

Virtual screening in drug discovery involves processing large datasets containing unknown molecules in order to find the ones that are likely to have the desired effects on a biological target, typically a protein receptor or an enzyme. Molecules are thereby classified into active or non-active in relation to the target. Misclassification of molecules in cases such as drug discovery and medical diagnosis is costly, both in time and finances. In the process of discovering a drug, it is mainly the inactive molecules classified as active towards the biological target i.e. false positives that cause a delay in the progress and high late-stage attrition. However, despite the pool of techniques available, the selection of the suitable approach in each situation is still a major challenge. This PhD thesis is designed to develop a pioneering framework which enables the analysis of the virtual screening of chemical compounds datasets in a wide range of settings in a unified fashion. The proposed method provides a better understanding of the dynamics of innovatively combining data processing and classification methods in order to screen massive, potentially high dimensional and overly imbalanced datasets more efficiently

Bournemouth University Research Online