20 research outputs found

    Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data

    Get PDF
    Data plays a key role in the design of expert and intelligent systems and therefore, data preprocessing appears to be a critical step to produce high-quality data and build accurate machine learning models. Over the past decades, increasing attention has been paid towards the issue of class imbalance and this is now a research hotspot in a variety of fields. Although the resampling methods, either by under-sampling the majority class or by over-sampling the minority class, stand among the most powerful techniques to face this problem, their strengths and weaknesses have typically been discussed based only on the class imbalance ratio. However, several questions remain open and need further exploration. For instance, the subtle differences in performance between the over- and under-sampling algorithms are still under-comprehended, and we hypothesize that they could be better explained by analyzing the inner structure of the data sets. Consequently, this paper attempts to investigate and illustrate the effects of the resampling methods on the inner structure of a data set by exploiting local neighborhood information, identifying the sample types in both classes and analyzing their distribution in each resampled set. Experimental results indicate that the resampling methods that produce the highest proportion of safe samples and the lowest proportion of unsafe samples correspond to those with the highest overall performance. The significance of this paper lies in the fact that our findings may contribute to gain a better understanding of how these techniques perform on class-imbalanced data and why over-sampling has been reported to be usually more efficient than under-sampling. The outcomes in this study may have impact on both research and practice in the design of expert and intelligent systems since a priori knowledge about the internal structure of the imbalanced data sets could be incorporated to the learning algorithms

    Deficient data classification with fuzzy learning

    Full text link
    This thesis first proposes a novel algorithm for handling both missing values and imbalanced data classification problems. Then, algorithms for addressing the class imbalance problem in Twitter spam detection (Network Security Problem) have been proposed. Finally, the security profile of SVM against deliberate attacks has been simulated and analysed.<br /

    Mapping (Dis-)Information Flow about the MH17 Plane Crash

    Get PDF
    Digital media enables not only fast sharing of information, but also disinformation. One prominent case of an event leading to circulation of disinformation on social media is the MH17 plane crash. Studies analysing the spread of information about this event on Twitter have focused on small, manually annotated datasets, or used proxys for data annotation. In this work, we examine to what extent text classifiers can be used to label data for subsequent content analysis, in particular we focus on predicting pro-Russian and pro-Ukrainian Twitter content related to the MH17 plane crash. Even though we find that a neural classifier improves over a hashtag based baseline, labeling pro-Russian and pro-Ukrainian content with high precision remains a challenging problem. We provide an error analysis underlining the difficulty of the task and identify factors that might help improve classification in future work. Finally, we show how the classifier can facilitate the annotation task for human annotators

    Knowledge extraction from biomedical data using machine learning

    Get PDF
    PhD ThesisThanks to the breakthroughs in biotechnologies that have occurred during the recent years, biomedical data is accumulating at a previously unseen pace. In the field of biomedicine, decades-old statistical methods are still commonly used to analyse such data. However, the simplicity of these approaches often limits the amount of useful information that can be extracted from the data. Machine learning methods represent an important alternative due to their ability to capture complex patterns, within the data, likely missed by simpler methods. This thesis focuses on the extraction of useful knowledge from biomedical data using machine learning. Within the biomedical context, the vast majority of machine learning applications focus their e↵ort on the generation and validation of prediction models. Rarely the inferred models are used to discover meaningful biomedical knowledge. The work presented in this thesis goes beyond this scenario and devises new methodologies to mine machine learning models for the extraction of useful knowledge. The thesis targets two important and challenging biomedical analytic tasks: (1) the inference of biological networks and (2) the discovery of biomarkers. The first task aims to identify associations between di↵erent biological entities, while the second one tries to discover sets of variables that are relevant for specific biomedical conditions. Successful solutions for both problems rely on the ability to recognise complex interactions within the data, hence the use of multivariate machine learning methods. The network inference problem is addressed with FuNeL: a protocol to generate networks based on the analysis of rule-based machine learning models. The second task, the biomarker discovery, is studied with RGIFE, a heuristic that exploits the information extracted from machine learning models to guide its search for minimal subsets of variables. The extensive analysis conducted for this dissertation shows that the networks inferred with FuNeL capture relevant knowledge complementary to that extracted by standard inference methods. Furthermore, the associations defined by FuNeL are discovered - 6 - more pertinent in a disease context. The biomarkers selected by RGIFE are found to be disease-relevant and to have a high predictive power. When applied to osteoarthritis data, RGIFE confirmed the importance of previously identified biomarkers, whilst also extracting novel biomarkers with possible future clinical applications. Overall, the thesis shows new e↵ective methods to leverage the information, often remaining buried, encapsulated within machine learning models and discover useful biomedical knowledge.European Union Seventh Framework Programme (FP7/2007- 2013) that funded part of this work under the “D-BOARD” project (grant agreement number 305815)

    Developing statistical and bioinformatic analysis of genomic data from tumours

    Get PDF
    Previous prognostic signatures for melanoma based on tumour transcriptomic data were developed predominantly on cohorts of AJCC (American Joint Committee on Cancer) stages III and IV melanoma. Since 92% of melanoma patients are diagnosed at AJCC stages I and II, there is an urgent need for better prognostic biomarkers to allow patient stratification for receiving early adjuvant therapies. This study uses genome-wide tumour gene expression levels and clinico-histopathological characteristics of patients from the Leeds Melanoma Cohort (LMC). Several unsupervised and supervised classification approaches were applied to the transcriptomic data, to identify biological classes of melanoma, and to develop prognostic classification models respectively. Unsupervised clustering identified six biologically distinct primary melanoma classes (LMC classes). Unlike previous molecular classes of melanoma, the LMC classes were prognostic in both the whole LMC dataset and in stage I tumours. The prognostic value of the LMC classes was replicated in an independent dataset, but insufficient data were available to replicate in an AJCC stage I subset. Supervised classification using the Random Forest (RF) approach provided improved performances when adjustments were made to deal with class imbalance, while this did not improve performance of the Support Vector Machine (SVM). However, RF and SVM had similar results overall, with RF only marginally better. Combining clinical and transcriptomic information in the RF further improved the performance of the prediction model in comparison to using clinical information alone. Finally, the agnostically derived LMC classes and the supervised RF model showed convergence in their association with outcome in some groups of patients, but not in others. In conclusion, this study reports six molecular classes of primary melanoma with prognostic value in stage I disease and overall, and a prognostic classification model that predicts outcome in primary melanoma

    LIPIcs, Volume 277, GIScience 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 277, GIScience 2023, Complete Volum

    Security of Ubiquitous Computing Systems

    Get PDF
    The chapters in this open access book arise out of the EU Cost Action project Cryptacus, the objective of which was to improve and adapt existent cryptanalysis methodologies and tools to the ubiquitous computing framework. The cryptanalysis implemented lies along four axes: cryptographic models, cryptanalysis of building blocks, hardware and software security engineering, and security assessment of real-world systems. The authors are top-class researchers in security and cryptography, and the contributions are of value to researchers and practitioners in these domains. This book is open access under a CC BY license
    corecore